Performance > Craypat
Quick start guide
The following instructions are for those who want to get CrayPat up and running quickly.
- module load xt-craypat
- ftn -c fcode.f90 ; ftn -o fexe fcode.o
- pat_build -O apa ./myprogram
- /usr/bin/time -p aprun -n 128 ./myprogram+pat
- pat_report ./myprogram+pat*.xf
- module load apprentice2
- app2 ./myprogram+pat*.ap2
- pat_build -O <apafile>.apa
- /usr/bin/time -p aprun -n 128 ./myprogram+apa
- pat_report ./myprogram+apa*.xf
Read on for complete details.
Description
CrayPat is a performance analysis tool developed by Cray for the XT platform.
The CrayPat tool provides detailed information about application performance. It can be used for basic profiling, MPI tracing and hardware performance counter based analysis. CrayPat provides access to a wide variety of performance experiments that measure how an executable program consumes resources while it is running, as well as several different user interfaces that provide access to the experiment and reporting functions.
CrayPat consists of three major components:
- pat_build - used to instrument the program to be analyzed.
- pat_report - a standalone text report generator that can be use to further explore the data generated by instrumented program execution.
- Apprentice2 - a graphical analysis tool that can be used, in addition to pat_report to further explore and visualize the data generated by instrumented program execution.
Usage
CrayPat enables you to sample, trace, measure and evaluate your program's behaviour during execution, and may help you find opportunities to significantly improve program performance.
- First, load the xt-craypat module to get the correct environment variables set.
module load xt-craypat
- Compile and link your application as usual.
ftn -c fcode.f90 ; ftn -o fexe fcode.o
cc -c ccode.c ; cc -o cexe ccode.o
- Instrument your code with pat_build. Use the -u flag to trace all user-defined functions in your program. Use the -g [group] flag to instrument all functions belonging to a specified group. For instance, in order to instrument mpi, io, heap and user functions calls, type :
pat_build -u -g mpi,io,heap ./myprogram or,
pat_build -O apa ./myprogram
- The name of the instrumented version of the executable will end with +pat, so in the example, the result will be ./myprogram+pat.
- Run the instrumented executable (by modifying your PBS batch job script).
/usr/bin/time -p aprun -n 128 ./myprogram+pat
- Upon successful execution, the report file will be generated and will end with .xf, so in the example, the result will be ./myprogram+pat+*.xf. By default, when you use pat_report to generate a report from one or more .xf files, pat_report also generates a corresponding .ap2 file with the same base name as the original executable. Data in .ap2 format can be viewed in text form using pat_report or viewed and manipulated using GUI tools in Cray Apprentice2. The most significant difference between .xf and .ap2 format is that .xf files require the original instrumented executable to be available to provide mapping from addresses to function names and source line numbers, while .ap2 files incorporate this data mapping and are self-contained. Therefore the .ap2 format is recommended if you wish to preserve the data for future reference. Use pat_report to generate a human readable performance report.
pat_report ./myprogram+pat*.xf
pat_report ./myprogram+pat*.ap2
- Apprentice2 can also be used to analyze the results. Read on for more details about apprentice2.
- Using the -O apa flag will create a .apa file. That .apa file will allow you to instrument your application for further analysis.
pat_build -O apa ./<apafile>.apa
/usr/bin/time -p aprun -n 128 ./myprogram+apa
pat_report ./myprogram+pat*.xf
Trace group | Description |
|---|---|
biolib | Cray Bioinformatics library routines |
blacs | Basic Linear Algebra communication subprograms |
blas | Basic Linear Algebra subprograms |
caf | Co-Array Fortran (Cray X2 systems only) |
fftw | Fast Fourier Transform library (64-bit only) |
hdf5 | manages extremely large and complex data collections |
heap | dynamic heap |
io | includes stdio and sysio groups |
lapack | Linear Algebra Package |
lustre | Lustre File System |
math | ANSI math |
mpi | MPI |
netcdf | network common data form (manages array-oriented scientific data) |
omp | OpenMP API (not supported on Catamount) |
omp-rtl | OpenMP runtime library (not supported on Catamount) |
portals | Lightweight message passing API |
pthreads | POSIX threads (not supported on Catamount) |
scalapack | Scalable LAPACK |
shmem | SHMEM |
stdio | all library functions that accept or return the FILE* construct |
sysio | I/O system calls |
system | system calls |
upc | Unified Parallel C (Cray X2 systems only) |
For more information, see Troubleshooting below.
Cray Apprentice2
- Cray Apprentice2 is a post-processing performance data visualization tool. It will allow you to pinpoint problems in load balance, MPI overhead, I/O strategy and so on.
Usage
After you instrument a program for a performance analysis experiment, execute the instrumented program, and generate one or more performance analysis data files, use Cray Apprentice2 to explore the experiment data and generate a variety of interactive graphical reports. Cray Apprentice2 is a GUI tool that requires that your workstation support the X Window System. Depending on your system configuration, you may need to use the ssh -X option to enable X Window System support in your shell session.
- Load the apprentice2 module to get the correct environment variables set.
module load apprentice2
- Launch the Cray Apprentice2 graphical interface in order to visualize data.
app2 ./myprogram+pat*.ap2
Apprentice2 can produce a number of very informative plots of performance data. For more information, see Troubleshooting below.
Cray Apprentice2 on your pc
app2 is a 64-bit executable, hence you cannot run it on your desktop pc (unless it is a 64-bit cpu). It is possible to get a 32-bit linux desktop copy of Cray Apprentice2 for use on your local machine, rather than on CSCS machines. It is available in /apps/apprentice2-desktop. This desktop version is available on a convenience basis only.
Hardware performance counters
CrayPat can also provides hardware counter data
- pat_hwpc is an alternative to pat_build and pat_report, used specifically to perform simplified hardware counter analysis experiments and generate reports from the resulting data.
The basic process of using pat_hwpc follows these steps.
- Load the xt-craypat module and compile your code. Programs compiled without the CrayPat module loaded cannot be used with pat_hwpc.
module load xt-craypat
ftn -c myprogram.f
ftn -o myprogram myprogram.o
- Set the runtime environment variable PAT_RT_HWPC to monitor hardware counter group 1.
export PAT_RT_HWPC=1
- The available counter groups are (man hwpc) :
Group | Description |
|---|---|
0 | Summary with instruction metrics |
1 | Summary with TLB metrics |
2 | L1 and L2 metrics |
3 | Bandwidth information |
4 | Hypertransport information (not supported on Quad-core AMD Opteron processors !) |
5 | Floating point mix |
6 | Cycles stalled, resources idle |
7 | Cycles stalled, resources full |
8 | Instructions and branches |
9 | Instruction cache |
10 | Cache hierarchy |
11 | Floating point operations mix (2) |
12 | Floating point operations mix (vectorization) |
13 | Floating point operations mix (SP) |
14 | Floating point operations mix (DP) |
15 | L3 (socket-level) |
16 | L3 (core-level reads) |
17 | L3 (core-level misses) |
18 | L3 (core-level fills caused by L2 evictions) |
19 | Prefetches |
- Instrument and run the program. pat_hwpc cannot be used to run an executable that has already been instrumented using pat_build
pat_hwpc -E aprun -n 128 ./ myprogram
- In this example, the -E option is required in order to force the pat_hwpc command to use the contents of the PAT_RT_HWPC environment variable. It is also possible to specify a hardware performance counter group directly on the command line
pat_hwpc -g 1 aprun -n 128 ./ myprogram
- By default,pat_hwpc monitors the following hardware counter and derived events (group 1) : PAPI_L1_DCM, PAPI_L1_DCA, PAPI_TLB_DM, PAPI_FP_OPS and CYCLES_USER.
- Of 103 possible papi events, 40 are available, of which 8 are derived :
# | Name | Description |
|---|---|---|
01 | PAPI_L1_DCM | Level 1 data cache misses |
02 | PAPI_L1_ICM | Level 1 instruction cache misses |
03 | PAPI_L2_DCM | Level 2 data cache misses |
04 | PAPI_L2_ICM | Level 2 instruction cache misses |
05 | PAPI_L1_TCM | Level 1 cache misses (derived) |
06 | PAPI_L2_TCM | Level 2 cache misses |
07 | PAPI_L3_TCM | Level 3 cache misses |
08 | PAPI_FPU_IDL | Cycles floating point units are idle |
09 | PAPI_TLB_DM | Data translation lookaside buffer misses |
10 | PAPI_TLB_IM | Instruction translation lookaside buffer misses |
11 | PAPI_TLB_TL | Total translation lookaside buffer misses (derived) |
12 | PAPI_STL_ICY | Cycles with no instruction issue |
13 | PAPI_HW_INT | Hardware interrupts |
14 | PAPI_BR_TKN | Conditional branch instructions taken |
15 | PAPI_BR_MSP | Conditional branch instructions mispredicted |
16 | PAPI_TOT_INS | Instructions completed |
17 | PAPI_FP_INS | Floating point instructions |
18 | PAPI_BR_INS | Branch instructions |
19 | PAPI_VEC_INS | Vector/SIMD instructions |
20 | PAPI_RES_STL | Cycles stalled on any resource |
21 | PAPI_TOT_CYC | Total cycles |
22 | PAPI_L1_DCH | Level 1 data cache hits (derived) |
23 | PAPI_L2_DCH | Level 2 data cache hits (derived) |
24 | PAPI_L1_DCA | Level 1 data cache accesses |
25 | PAPI_L2_DCA | Level 2 data cache accesses |
26 | PAPI_L1_ICH | Level 1 instruction cache hits (derived) |
27 | PAPI_L2_ICH | Level 2 instruction cache hits |
28 | PAPI_L1_ICA | Level 1 instruction cache accesses |
29 | PAPI_L2_ICA | Level 2 instruction cache accesses |
30 | PAPI_L1_ICR | Level 1 instruction cache reads |
31 | PAPI_L1_TCH | Level 1 total cache hits (derived) |
32 | PAPI_L2_TCH | Level 2 total cache hits (derived) |
33 | PAPI_L1_TCA | Level 1 total cache accesses (derived) |
34 | PAPI_L2_TCA | Level 2 total cache accesses |
35 | PAPI_L3_TCR | Level 3 total cache reads |
36 | PAPI_FML_INS | Floating point multiply instructions |
37 | PAPI_FAD_INS | Floating point add instructions (Also includes subtract instructions) |
38 | PAPI_FDV_INS | Floating point divide instructions (Counts both divide and square root instructions) |
39 | PAPI_FSQ_INS | Floating point square root instructions (Counts both divide and square root instructions) |
40 | PAPI_FP_OPS | Floating point operations |
- Upon execution, pat_hwpc generates a report showing hardware counter data including a number of derived metrics and calculated values. These data files can be viewed and examined with pat_report or apprentice2.
Troubleshooting
- This page describes the basics of xt-craypat and apprentice2. For further information please check the CRAY documentation.
- For further information, read the following man pages : intro_craypat, craypat, pat_build, pat_report, pat_hwpc, hwpc, app2 and run the command pat_help.
- If everything fails, please contact the helpdesk.
