EN Seite drucken (öffnet Druckerfenster) EN Seite schliessen (schliesst Druckansicht)

Performance > Craypat

 

Quick start guide

The following instructions are for those who want to get CrayPat up and running quickly.

Read on for complete details.

Description

CrayPat is a performance analysis tool developed by Cray for the XT platform.

The CrayPat tool provides detailed information about application performance. It can be used for basic profiling, MPI tracing and hardware performance counter based analysis. CrayPat provides access to a wide variety of performance experiments that measure how an executable program consumes resources while it is running, as well as several different user interfaces that provide access to the experiment and reporting functions.

CrayPat consists of three major components:

Usage

CrayPat enables you to sample, trace, measure and evaluate your program's behaviour during execution, and may help you find opportunities to significantly improve program performance.

                   module load xt-craypat

                   ftn -c fcode.f90 ; ftn -o fexe fcode.o

                   cc -c ccode.c     ; cc -o cexe ccode.o

                   /usr/bin/time -p aprun -n 128 ./myprogram+pat

                   pat_report  ./myprogram+pat*.xf

                   pat_report  ./myprogram+pat*.ap2

             pat_build -O apa                 ./<apafile>.apa

             /usr/bin/time -p aprun -n 128 ./myprogram+apa

             pat_report ./myprogram+pat*.xf

xt-craypat trace groups

Trace group

Description

biolib

Cray Bioinformatics library routines

blacs

Basic Linear Algebra communication subprograms

blas

Basic Linear Algebra subprograms

caf

Co-Array Fortran (Cray X2 systems only)

fftw

Fast Fourier Transform library (64-bit only)

hdf5

manages extremely large and complex data collections

heap

dynamic heap

io

includes stdio and sysio groups

lapack

Linear Algebra Package

lustre

Lustre File System

math

ANSI math

mpi

MPI

netcdf

network common data form (manages array-oriented scientific data)

omp

OpenMP API (not supported on Catamount)

omp-rtl

OpenMP runtime library (not supported on Catamount)

portals

Lightweight message passing API

pthreads

POSIX threads (not supported on Catamount)

scalapack

Scalable LAPACK

shmem

SHMEM

stdio

all library functions that accept or return the FILE* construct

sysio

I/O system calls

system

system calls

upc

Unified Parallel C (Cray X2 systems only)

 

For more information, see Troubleshooting below.

Cray Apprentice2

Usage

After you instrument a program for a performance analysis experiment, execute the instrumented program, and generate one or more performance analysis data files, use Cray Apprentice2 to explore the experiment data and generate a variety of interactive graphical reports. Cray Apprentice2 is a GUI tool that requires that your workstation support the X Window System. Depending on your system configuration, you may need to use the ssh -X option to enable X Window System support in your shell session.

                   module load apprentice2

                   app2  ./myprogram+pat*.ap2

Apprentice2 can produce a number of very informative plots of performance data. For more information, see Troubleshooting below.

Cray Apprentice2 on your pc

app2 is a 64-bit executable, hence you cannot run it on your desktop pc (unless it is a 64-bit cpu). It is possible to get a 32-bit linux desktop copy of Cray Apprentice2 for use on your local machine, rather than on CSCS machines. It is available in /apps/apprentice2-desktop. This desktop version is available on a convenience basis only.

Hardware performance counters

 CrayPat can also provides hardware counter data

The basic process of using pat_hwpc follows these steps.

                   module load xt-craypat

                   ftn -c myprogram.f

                   ftn -o myprogram myprogram.o

                   export PAT_RT_HWPC=1

Hardware counter groups

Group

Description

0

Summary with instruction metrics

1

Summary with TLB metrics

2

L1 and L2 metrics

3

Bandwidth information

4

Hypertransport information (not supported on Quad-core AMD Opteron processors !)

5

Floating point mix

6

Cycles stalled, resources idle

7

Cycles stalled, resources full

8

Instructions and branches

9

Instruction cache

10

Cache hierarchy

11

Floating point operations mix (2)

12

Floating point operations mix (vectorization)

13

Floating point operations mix (SP)

14

Floating point operations mix (DP)

15

L3 (socket-level)

16

L3 (core-level reads)

17

L3 (core-level misses)

18

L3 (core-level fills caused by L2 evictions)

19

Prefetches

 

                   pat_hwpc -E aprun -n 128 ./ myprogram

                   pat_hwpc -g 1 aprun -n 128 ./ myprogram

Hardware counter events

#

Name

Description

01

PAPI_L1_DCM

Level 1 data cache misses

02

PAPI_L1_ICM

Level 1 instruction cache misses

03

PAPI_L2_DCM

Level 2 data cache misses

04

PAPI_L2_ICM

Level 2 instruction cache misses

05

PAPI_L1_TCM

Level 1 cache misses (derived)

06

PAPI_L2_TCM

Level 2 cache misses

07

PAPI_L3_TCM

Level 3 cache misses

08

PAPI_FPU_IDL

Cycles floating point units are idle

09

PAPI_TLB_DM

Data translation lookaside buffer misses

10

PAPI_TLB_IM

Instruction translation lookaside buffer misses

11

PAPI_TLB_TL

Total translation lookaside buffer misses (derived)

12

PAPI_STL_ICY

Cycles with no instruction issue

13

PAPI_HW_INT

Hardware interrupts

14

PAPI_BR_TKN

Conditional branch instructions taken

15

PAPI_BR_MSP

Conditional branch instructions mispredicted

16

PAPI_TOT_INS

Instructions completed

17

PAPI_FP_INS

Floating point instructions

18

PAPI_BR_INS

Branch instructions

19

PAPI_VEC_INS

Vector/SIMD instructions

20

PAPI_RES_STL

Cycles stalled on any resource

21

PAPI_TOT_CYC

Total cycles

22

PAPI_L1_DCH

Level 1 data cache hits (derived)

23

PAPI_L2_DCH

Level 2 data cache hits (derived)

24

PAPI_L1_DCA

Level 1 data cache accesses

25

PAPI_L2_DCA

Level 2 data cache accesses

26

PAPI_L1_ICH

Level 1 instruction cache hits (derived)

27

PAPI_L2_ICH

Level 2 instruction cache hits

28

PAPI_L1_ICA

Level 1 instruction cache accesses

29

PAPI_L2_ICA

Level 2 instruction cache accesses

30

PAPI_L1_ICR

Level 1 instruction cache reads

31

PAPI_L1_TCH

Level 1 total cache hits (derived)

32

PAPI_L2_TCH

Level 2 total cache hits (derived)

33

PAPI_L1_TCA

Level 1 total cache accesses (derived)

34

PAPI_L2_TCA

Level 2 total cache accesses

35

PAPI_L3_TCR

Level 3 total cache reads

36

PAPI_FML_INS

Floating point multiply instructions

37

PAPI_FAD_INS

Floating point add instructions (Also includes subtract instructions)

38

PAPI_FDV_INS

Floating point divide instructions (Counts both divide and square root instructions)

39

PAPI_FSQ_INS

Floating point square root instructions (Counts both divide and square root instructions)

40

PAPI_FP_OPS

Floating point operations

Troubleshooting

 

 

To top

 
 
© 09.09.2010 

CSCS Swiss National Supercomputing Centre, Galleria 2 - Via Cantonale, CH-6928 Manno(Switzerland), Phone: +41 (91) 610 8211 Fax: +41 (91) 6108282