EN Bereichsnavigation EN

Performance Analysis

As high-performance computing resources become larger and more heterogeneous, using them to their full potential for scientific research becomes increasingly challenging. CSCS intends to extend the production projects allocation process to include performance information about user applications in order to improve the allocation of compute resources, thus enabling more results of scientific merit.

CRAY Performance Analysis Tool (CrayPat)

CrayPat is the recommended performance analysis tool developed by Cray for the XT/XE/XK platforms.

CrayPat provides detailed information about application performance. It can be used for basic profi ling, MPI tracing and hardware performance counter based analysis. CrayPat provides access to a wide variety of performance experiments that measure how an executable program consumes resources while it is running, as well as several diff erent user interfaces that provide access to the experiment and reporting functions.

In the following paragraphs we describe how to collect the performance data required for the extended production project proposal form.

First, you need to instrument your application with CrayPat. Load the perftools module,

$ module load perftools
 

and then recompile your application as normal (you may need to compile and link in separate steps).

Then, use pat_build to produce an instrumented executable:

$ pat_build -g mpi,io <executable>  
 

This will create a new binary called "<executable>+pat". This needs to be launched with aprun, just like a normal executable.

Run your benchmarks with the new executable, making sure you have the environment variable PAT_RT_HWPC=1 set in your SLURM job script and make sure you are working in /scratch:

$ cd $SCRATCH ; export PAT_RT_HWPC=1 
or
$ cd $SCRATCH ; setenv PAT_RT_HWPC 1
 

After a batch job has completed, one or more files with the extension ".xf" will have been produced. Run "pat_report" at the command line to process these files and produce the performance data required 

$ pat_report <executable>+pat+* > xf.txt
 

A summary of the performance data should be extracted and attached into the proposal as follows:

Number of cores

1024

Wallclock time (sec)

6.28

Memory (MB/process)

33.65

MPI (% of total walltime)

38.1 %

MPI_SYNC (% of total walltime) 

0.1 %

MPI call1 (% of total walltime)

MPI_WAIT 25.8 %

MPI call2 (% of total walltime)

MPI_FILE_ISEND 5.9 %

%peak (DP)

2.8

PAPI FP OPS / process

2904681343

PAPI L1 DCM / process

130130744

Write Rate (MB; MB/sec)

0.34; 4.74

The data required in this table can be found in the output file produced by pat_report:

* WallClock Time (sec) => Table 2: Profile by Function Group and Function (line "Total")

* Memory (MB/process) => Table 7: Wall Clock Time, Memory High Water Mark

* MPI (% of total walltime) => Table 1: Profile by Function Group and Function (First column, line "MPI")

* MPI_SYNC (% of total walltime) => Table 1: Profile by Function Group and Function (First column, line "MPI_SYNC")

* MPI call1 => Table 1: Profile by Function Group and Function (First column, First mpi call)

* MPI call2 => Table 1: Profile by Function Group and Function (First column, Second mpi call)

* %peak (DP) => Table 2: Profile by Function Group and Function (Group Total, line HW FP Ops => %peak)

* PAPI L1 DCM / process => Table 2: Profile by Function Group and Function (Third column)

* PAPI FP OPS / process => Table 2: Profile by Function Group and Function (Third column)

* Write Rate (MB; MB/sec) => Table 6: File Output Stats by Filename (Total line, second and third columns)

An example output can be found in /apps/rosa/bt/2012/xf.txt

In addition to the table, you need to attach the text output of pat report (xf.txt) for your most representative run

In case of hybrid MPI/OpenMP applications, it will be necessary to add the -g omp option to the pat build command.

We strongly encourage you to choose meaningful and representative job sizes and configurations: all proposals can only be evaluated by the data you provide, the better the data, the easier to pass the review process. Performance data must come from jobs run on CSCS systems, for a problem (or problems) similar to that proposed in the project description (same machine, same computational model, same core count).

If you make use of centrally installed applications you can use the instrumented versions of those applications provided by us on Rosa. The applications have their usual names, suffixed with "+pat".  The following modules are available: namd/2.8, gromacs/4.5.5, cp2k, cpmd 3.13.2 and 3.15.1, espresso/4.3.2. 

If your application does not run with CrayPat or help is needed to recompile it please do not hesitate to contact help@cscs.ch for details and assistance.