An Easytouse Performance Diagnosis Tool for Hpc Applications
- Slides: 24
Download presentation
Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications Martin Burtscher, Byoung-Do Kim, Jeff Diamond, John Mc. Calpin, Lars Koesterke, and James Browne The University of Texas at Austin
Motivation § Problem: HPC systems operate far below peak § Performance optimization complexity is growing § Status: Most performance tools are hard to use § Require detailed performance and system expertise § HPC application developers are domain experts § Result: HPC programmers do not use these tools § 75% of users haven't used performance tool on Ranger § Do not know how to apply information Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 2
Performance Counter Tool Workflow Basic tools [mostly manual] Provide no aid with: § Counter selection § 100 s of possibilities § cryptic descriptions § unclear what counted § Result interpretation § Is there a problem? § What is the problem? § Solution finding § How do I fix it? Selecting performance counters Running multiple measurements Collecting performance data Identifying bottlenecks Perf. Expert [mostly automated] Automatic (for core, chip, & nodelevel bottlenecks) performance counter selection, measurement execution, data collection, bottleneck diagnosis, and optimization suggestion based on several categories Searching for proper optimization method Implementing optimization Perf. Expert features: § Automatic bottleneck detection & analysis § at core/chip/node level § Recommends remedy § includes code examples & compiler switches § Simple user interface § use provided job script § intuitive output Selecting and implementing optimization Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 3
Overview § Perf. Expert case studies on four Ranger codes § Mangll: mantle advection production code (C) § Homme: atmospherics acceptance benchmark (F 95) § Libmesh: Navier-Stokes example code (C++) § Asset: astrophysics production code (F 90) § § Step-by-step usage example Internal operation and performance metric Future work Summary Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 4
Mantle Advection Case Study § Found code to be memory bound § 40% speedup due to node-level optimizations total runtime in dgelastic. xml is 75. 70 seconds Suggestions on how to alleviate performance bottlenecks are available at: http: //www. tacc. utexas. edu/perfexpert/ procedure identifier URL to suggested optimizations dgae_RHS (59. 8% of the total runtime) --------------------------------------performance assessment great. . . good. . . okay. . . bad. . . . problematic - overall >>>>>>>>>>>>>>> overall loop performance is bad upper bound by category - data accesses >>>>>>>>>>>>>>>>>>>>>>>>>+ - instruction accesses >>>>> - floating-point instr >>>>>>>>>>>>>>>>>>>> - branch instructions >> most of runtime is due to data accesses - data TLB > - instruction TLB > Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 5
Atmospheric Circulation Case Study § Highlights scaling problem due to shared resources total runtime in homme-4 x 64. xml is 356. 73 seconds total runtime in homme-16 x 16. xml is 555. 43 seconds. . . comparing two experiments second much worse than first prim_advance_mod_mp_preq_advance_exp_ (runtimes are 86. 35 s and 159. 20 s) --------------------------------------performance assessment great. . . good. . . okay. . . bad. . . . problematic - overall >>>>>>>>>>>>>>>>222222222+ upper bound by category - data accesses >>>>>>>>>>>>>>>>>>>>>>>>>+ - instruction accesses >>>>> - floating-point instr >>>>>>>>>>>>>>>>>>>>>>1 - branch instructions > - data TLB > - instruction TLB > Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 6
Navier-Stokes Case Study § Illustrates optimization-benefit tracking ability total runtime in ex 18. xml is 144. 78 seconds total runtime in ex 18 -cse. xml is 137. 91 seconds. . . Navier. System: : element_time_derivative (runtimes are 33. 29 s and 25. 24 s) --------------------------------------performance assessment great. . . good. . . okay. . . bad. . . . problematic - overall >>>>>>>>222 upper bound by category - data accesses >>>>>>>>>>>>>>>>>>>>>2 - instruction accesses >>>>>>> - floating-point instr >>>>>>>>>11111 - branch instructions > - data TLB > optimization benefit - instruction TLB > Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 7
Astrophysical Case Study § Code has already been aggressively optimized total runtime in asset. xml is 52. 25 seconds Suggestions on how to alleviate performance bottlenecks are available at: http: //www. tacc. utexas. edu/perfexpert/ calc_intens 3 s_vec_mexp (27. 6% of the total runtime) --------------------------------------performance assessment great. . . good. . . okay. . . bad. . . . problematic - overall >>>>>>>>> performance is already good upper bound by category - data accesses >>>>>>>>>>>>>>>>>>>>>> - instruction accesses >>>>> - floating-point instr >>>>>>>>>>>>>>>>>>> - branch instructions > - data TLB > - instruction TLB > Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 8
Step-by-Step Usage Example § Scenario § Developer's HPC code performs poorly § May know code section but not how to accelerate it § Example: matrix-matrix multiplication § Coded inefficiently for illustration purposes § Perf. Expert reports where the slow code is, why it performs poorly, and suggests how to improve it Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 9
Perf. Expert Output for MMM Code total runtime in mmm 1. xml is 3. 74 seconds Suggestions on how to alleviate performance bottlenecks are available at: http: //www. tacc. utexas. edu/perfexpert/ matrixproduct (100. 0% of the total runtime) --------------------------------------loop identifier (if compiled with "-g"). . . overall loop performance is bad loop at line 25 in matrixproduct (99. 7% of the total runtime) --------------------------------------performance assessment LCPI good. . . okay. . . fair. . . poor. . . bad. . - overall 9. 6 >>>>>>>>>>>>>>>>>>>>>>>+ upper bound by category - data accesses 14. 7 >>>>>>>>>>>>>>>>>>>>>>>+ - instruction accesses 0. 6 >>>>>> - data TLB 9. 9 >>>>>>>>>>>>>>>>>>>>>>>+ - instruction TLB 0. 0 > most of runtime is due to data TLB and data accesses - branch instructions 0. 1 > - floating-point instr 3. 0 >>>>>>>>>>>>>>> Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 10
Optimize Critical Code Section § Loop nest around line 25 for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i][j] += a[i][k] * b[k][j]; § Identified main bottleneck § Cause: memory accesses & data TLB § Focus on data TLB problem first § No need to know what a data TLB is, just used as label to locate corresponding optimizations on web page Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 11
Data TLB Optimization Suggestions 1) Improve the data locality a) use superpages (larger page sizes) not yet enabled on all Ranger nodes b) change the order of loops loop i {. . . } loop j {. . . } → loop j {. . . } loop i {. . . } c) employ loop blocking and interchange (change the order of the memory accesses) loop i {loop k {loop j {c[i][j] = c[i][j] + a[i][k] * b[k][j]; }}} → loop k step s {loop j step s {loop i {for (kk = k; kk < k + s; kk++) {for (jj = j; jj < j + s; jj++) {c[i][jj] = c[i][jj] + a[i][kk] * b[kk][jj]; }}}}} 2) Reduce the data size suggested remedy a) use smaller types (e. g. , float instead of double or short instead of int) code example double a[n]; → float a[n]; use the "-fpack-struct" compiler flag example b) allocate an array of elements instead of each element individually loop {. . . c = malloc(1); . . . } → top = n; loop {if (top == n) {tmp = malloc(n); top = 0; }. . . c = &tmp[top++]; . . . } Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 12
Eliminate Inapplicable Suggestions 1) Improve the data locality a) use superpages (larger page sizes) not yet enabled on all Ranger nodes b) change the order of loops loop i {. . . } loop j {. . . } → loop j {. . . } loop i {. . . } c) employ loop blocking and interchange (change the order of the memory accesses) loop i {loop k {loop j {c[i][j] = c[i][j] + a[i][k] * b[k][j]; }}} → loop k step s {loop j step s {loop i {for (kk = k; kk < k + s; kk++) {for (jj = j; jj < j + s; jj++) {c[i][jj] = c[i][jj] + a[i][kk] * b[kk][jj]; }}}}} 2) Reduce the data size a) use smaller types (e. g. , float instead of double or short instead of int) double a[n]; → float a[n]; use the "-fpack-struct" compiler flag b) allocate an array of elements instead of each element individually loop {. . . c = malloc(1); . . . } → top = n; loop {if (top == n) {tmp = malloc(n); top = 0; }. . . c = &tmp[top++]; . . . } Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 13
Try Remaining Suggestions § Start with suggestion 1 b because it is simpler 1) Improve the data locality b) change the order of loops loop i {. . . } loop j {. . . } → loop j {. . . } loop i {. . . } § Exchange the j and k loops of the loop nest for (i = 0; i < n; i++) for (k = 0; k < n; k++) for (j = 0; j < n; j++) c[i][j] += a[i][k] * b[k][j]; § Assess transformed code with Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 14
Output after Loop Exchange total runtime in mmm 2. xml is 1. 45 seconds runtime is lower Suggestions on how to alleviate performance bottlenecks are available at: http: //www. tacc. utexas. edu/perfexpert/. . . overall loop performance is better but still bad loop at line 25 in matrixproduct (99. 5% of the total runtime) --------------------------------------performance assessment LCPI good. . . okay. . . fair. . . poor. . . bad. . - overall 4. 1 >>>>>>>>>>>>>>>>>>>>> upper bound by category - data accesses 3. 6 >>>>>>>>>>>>>>>>>> - instruction accesses 0. 5 >>>>> data accesses should be optimized next - data TLB 0. 0 > - instruction TLB 0. 0 > data TLB is no longer a problem - branch instructions 0. 1 > - floating-point instr 3. 3 >>>>>>>>>>>>>>>>> Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 15
Data Access Optimization Suggestions 1) Reduce the number of memory accesses a) move loop invariant memory accesses out of loop i {a[i] = b[i] * c[j]} → temp = c[j]; loop i {a[i] = b[i] * temp; } b). . . 2) Improve the data locality a) componentize important loops by factoring them into their own subroutines. . . loop i {. . . }. . . loop j {. . . }. . . → void li() {. . . }; void lj() {. . . }; . . . li(); . . . lj(); . . . b) employ loop blocking and interchange (change the order of the memory accesses) loop i {loop k {loop j {c[i][j] = c[i][j] + a[i][k] * b[k][j]; }}} → loop k step s {loop j step s {loop i {for (kk = k; kk < k + s; kk++) {for (jj = j; jj < j + s; jj++) {c[i][jj] = c[i][jj] + a[i][kk] * b[kk][jj]; }}}} c). . . we will pick this one as it was already suggested before 3) Reduce the data size. . . Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 16
Try Loop Blocking Suggestion § Blocked loop code (blocking factor s = 70) for (k = 0; k < n; k += s) { for (j = 0; j < n; j += s) { for (i = 0; i < n; i++) { for (kk = k; kk < k + s; kk++) { for (jj = j; jj < j + s; jj++) { c[i][jj] += a[i][kk] * b[kk][jj]; } } } Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 17
Output after Loop Blocking total runtime in mmm 3. xml is 0. 28 seconds runtime is much lower Suggestions on how to alleviate performance bottlenecks are available at: http: //www. tacc. utexas. edu/perfexpert/. . . overall loop performance is now good loop at line 28 in matrixproduct (98. 8% of the total runtime) --------------------------------------performance assessment LCPI good. . . okay. . . fair. . . poor. . . bad. . - overall 0. 6 >>>>>> upper bound by category - data accesses 2. 1 >>>>>>>>>>> - instruction accesses 0. 6 >>>>>> - data TLB 0. 0 > - instruction TLB 0. 0 > - branch instructions 0. 2 >> - floating-point instr 2. 5 >>>>>>>>>>>>> Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 18
Usage Example Summary § Performance is greatly improved § Optimization process guided by Perf. Expert § Runtime dropped by 13 x § Memory access and data TLB problems fixed § Perf. Expert correctly identified these bottlenecks § Suggested useful code optimizations § Helped verify the resolution of the problem Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 19
Internal Perf. Expert Operation § Gather performance counter measurements § Multiple runs with HPCToolkit (PAPI & native counters) § Sampling-based results for procedures and loops § Combine and check results § Check variability, runtime, consistency, and integrity § Compute metrics and output assessment § Only for most important code sections § Correlate results from different runs Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 20
Perf. Expert's Performance Metric § Local Cycles Per Instruction (LCPI) § Compute upper bounds on CPI contribution for various categories (e. g. , branches, memory) and code sections (BR_INS * BR_lat + BR_MSP * BR_miss_lat) / TOT_INS § (L 1_DCA * L 1_dlat + L 2_DCA * L 2_lat + L 2_DCM * Mem_lat) / TOT_INS § green = performance counter results, blue = system parameters § § Benefits § Highlights key aspects and hides misleading details § Relative metric (less susceptible to non-determinism) § Easily extensible (to refine or add more categories) Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 21
Related Work § Automatic bottleneck analysis and remediation § PERCS project at IBM Research Less automation for bottleneck identification and analysis § Not open source § § PERI Autotuning project § Parallel Performance Wizard § Event trace analysis, program instrumentation § Analysis tools with automated diagnosis § Projects that target multicore optimizations Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 22
Future Work § More case studies § Applications with various bottlenecks to harden tool § Port to other systems: AMD, Intel, Power & GPU § Make Perf. Expert available for general download § Improve and expand capabilities § Finer-grained recommendations § Add data structure based analyses and optimizations § Automatic implementation of solutions to common core, chip and node-level performance bottlenecks Perf. Expert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications 23
Summary § Perf. Expert performance diagnosis tool § Automates measurement and analysis § Uses new LCPI metric to compare counter results § Recommends optimizations for each bottleneck § Easy-to-use interface and understandable output code sections sorted by importance § longer bars mean more important to optimize § § Acknowledgments Try it out on Ranger! § Omar Ghattas' group, John Mellor-Crummey's group § National Science Foundation OCI award #0622780 Perf. Expert 24
Source: https://slidetodoc.com/perf-expert-an-easytouse-performance-diagnosis-tool-for/
0 Response to "An Easytouse Performance Diagnosis Tool for Hpc Applications"
Post a Comment