Bereichsnavigation

Performance > Craypat

 

Quick start guide

The following instructions are for those who want to get CrayPat up and running quickly.

  • module load xt-craypat
  • ftn -c fcode.f90 ; ftn -o fexe fcode.o
  • pat_build -O apa ./myprogram
  • /usr/bin/time -p aprun -n 128 ./myprogram+pat
  • pat_report  ./myprogram+pat*.xf
    • module load apprentice2
    • app2  ./myprogram+pat*.ap2
  • pat_build -O  <apafile>.apa
  • /usr/bin/time -p aprun -n 128 ./myprogram+apa
  • pat_report  ./myprogram+apa*.xf

Read on for complete details.

Description

CrayPat is a performance analysis tool developed by Cray for the XT platform.

The CrayPat tool provides detailed information about application performance. It can be used for basic profiling, MPI tracing and hardware performance counter based analysis. CrayPat provides access to a wide variety of performance experiments that measure how an executable program consumes resources while it is running, as well as several different user interfaces that provide access to the experiment and reporting functions.

CrayPat consists of three major components:

  • pat_build - used to instrument the program to be analyzed.
  • pat_report - a standalone text report generator that can be use to further explore the data generated by instrumented program execution.
  • Apprentice2 - a graphical analysis tool that can be used, in addition to pat_report to further explore and visualize the data generated by instrumented program execution.

Usage

    CrayPat enables you to sample, trace, measure and evaluate your program's behaviour during execution, and may help you find opportunities to significantly improve program performance.

    • First, load the xt-craypat module to get the correct environment variables set.

                       module load xt-craypat

    • Compile and link your application as usual.

                       ftn -c fcode.f90 ; ftn -o fexe fcode.o

                       cc -c ccode.c     ; cc -o cexe ccode.o

    • Instrument your code with pat_build. Use the -u flag to trace all user-defined functions in your program. Use the -g [group] flag to instrument all functions belonging to a specified group. For instance, in order to instrument mpi, io, heap and user functions calls, type :

                   pat_build -u -g mpi,io,heap ./myprogram                  or,
                   pat_build -O apa                 ./myprogram
    • The name of the instrumented version of the executable will end with +pat, so in the example, the result will be ./myprogram+pat.
    • Run the instrumented executable (by modifying your PBS batch job script).

                       /usr/bin/time -p aprun -n 128 ./myprogram+pat

    • Upon successful execution, the report file will be generated and will end with .xf, so in the example, the result will be ./myprogram+pat+*.xf. By default, when you use pat_report to generate a report from one or more .xf files, pat_report also generates a corresponding .ap2 file with the same base name as the original executable. Data in .ap2 format can be viewed in text form using pat_report or viewed and manipulated using GUI tools in Cray Apprentice2. The most significant difference between .xf and .ap2 format is that .xf files require the original instrumented executable to be available to provide mapping from addresses to function names and source line numbers, while .ap2 files incorporate this data mapping and are self-contained. Therefore the .ap2 format is recommended if you wish to preserve the data for future reference. Use pat_report to generate a human readable performance report.

                       pat_report  ./myprogram+pat*.xf

                       pat_report  ./myprogram+pat*.ap2

    • Apprentice2 can also be used to analyze the results. Read on for more details about apprentice2.
    • Using the -O apa flag will create a .apa file. That .apa file will allow you to instrument your application for further analysis.

                 pat_build -O apa                 ./<apafile>.apa

                 /usr/bin/time -p aprun -n 128 ./myprogram+apa

                 pat_report ./myprogram+pat*.xf

    xt-craypat trace groups

    Trace group

    Description

    biolib

    Cray Bioinformatics library routines

    blacs

    Basic Linear Algebra communication subprograms

    blas

    Basic Linear Algebra subprograms

    caf

    Co-Array Fortran (Cray X2 systems only)

    fftw

    Fast Fourier Transform library (64-bit only)

    hdf5

    manages extremely large and complex data collections

    heap

    dynamic heap

    io

    includes stdio and sysio groups

    lapack

    Linear Algebra Package

    lustre

    Lustre File System

    math

    ANSI math

    mpi

    MPI

    netcdf

    network common data form (manages array-oriented scientific data)

    omp

    OpenMP API (not supported on Catamount)

    omp-rtl

    OpenMP runtime library (not supported on Catamount)

    portals

    Lightweight message passing API

    pthreads

    POSIX threads (not supported on Catamount)

    scalapack

    Scalable LAPACK

    shmem

    SHMEM

    stdio

    all library functions that accept or return the FILE* construct

    sysio

    I/O system calls

    system

    system calls

    upc

    Unified Parallel C (Cray X2 systems only)

     

    For more information, see Troubleshooting below.

    Cray Apprentice2

    • Cray Apprentice2 is a post-processing performance data visualization tool. It will allow you to pinpoint problems in load balance, MPI overhead, I/O strategy and so on.

    Usage

    After you instrument a program for a performance analysis experiment, execute the instrumented program, and generate one or more performance analysis data files, use Cray Apprentice2 to explore the experiment data and generate a variety of interactive graphical reports. Cray Apprentice2 is a GUI tool that requires that your workstation support the X Window System. Depending on your system configuration, you may need to use the ssh -X option to enable X Window System support in your shell session.

    • Load the apprentice2 module to get the correct environment variables set.

                       module load apprentice2

    • Launch the Cray Apprentice2 graphical interface in order to visualize data.

                       app2  ./myprogram+pat*.ap2

    Apprentice2 can produce a number of very informative plots of performance data. For more information, see Troubleshooting below.

    Cray Apprentice2 on your pc

    app2 is a 64-bit executable, hence you cannot run it on your desktop pc (unless it is a 64-bit cpu). It is possible to get a 32-bit linux desktop copy of Cray Apprentice2 for use on your local machine, rather than on CSCS machines. It is available in /apps/apprentice2-desktop. This desktop version is available on a convenience basis only.

    Hardware performance counters

     CrayPat can also provides hardware counter data

    • pat_hwpc is an alternative to pat_build and pat_report, used specifically to perform simplified hardware counter analysis experiments and generate reports from the resulting data.

    The basic process of using pat_hwpc follows these steps.

    • Load the xt-craypat module and compile your code. Programs compiled without the CrayPat module loaded cannot be used with pat_hwpc.

                       module load xt-craypat

                       ftn -c myprogram.f

                       ftn -o myprogram myprogram.o

    • Set the runtime environment variable PAT_RT_HWPC to monitor hardware counter group 1.

                       export PAT_RT_HWPC=1

    • The available counter groups are (man hwpc) :
    Hardware counter groups

    Group

    Description

    0

    Summary with instruction metrics

    1

    Summary with TLB metrics

    2

    L1 and L2 metrics

    3

    Bandwidth information

    4

    Hypertransport information (not supported on Quad-core AMD Opteron processors !)

    5

    Floating point mix

    6

    Cycles stalled, resources idle

    7

    Cycles stalled, resources full

    8

    Instructions and branches

    9

    Instruction cache

    10

    Cache hierarchy

    11

    Floating point operations mix (2)

    12

    Floating point operations mix (vectorization)

    13

    Floating point operations mix (SP)

    14

    Floating point operations mix (DP)

    15

    L3 (socket-level)

    16

    L3 (core-level reads)

    17

    L3 (core-level misses)

    18

    L3 (core-level fills caused by L2 evictions)

    19

    Prefetches

     

    • Instrument and run the program. pat_hwpc cannot be used to run an executable that has already been instrumented using pat_build

                       pat_hwpc -E aprun -n 128 ./ myprogram

      • In this example, the -E option is required in order to force the pat_hwpc command to use the contents of the PAT_RT_HWPC environment variable. It is also possible to specify a hardware performance counter group directly on the command line

                         pat_hwpc -g 1 aprun -n 128 ./ myprogram

      • By default,pat_hwpc monitors the following hardware counter and derived events (group 1) : PAPI_L1_DCM, PAPI_L1_DCA, PAPI_TLB_DM, PAPI_FP_OPS and CYCLES_USER.
      • Of 103 possible papi events, 40 are available, of which 8 are derived :
      Hardware counter events

      #

      Name

      Description

      01

      PAPI_L1_DCM

      Level 1 data cache misses

      02

      PAPI_L1_ICM

      Level 1 instruction cache misses

      03

      PAPI_L2_DCM

      Level 2 data cache misses

      04

      PAPI_L2_ICM

      Level 2 instruction cache misses

      05

      PAPI_L1_TCM

      Level 1 cache misses (derived)

      06

      PAPI_L2_TCM

      Level 2 cache misses

      07

      PAPI_L3_TCM

      Level 3 cache misses

      08

      PAPI_FPU_IDL

      Cycles floating point units are idle

      09

      PAPI_TLB_DM

      Data translation lookaside buffer misses

      10

      PAPI_TLB_IM

      Instruction translation lookaside buffer misses

      11

      PAPI_TLB_TL

      Total translation lookaside buffer misses (derived)

      12

      PAPI_STL_ICY

      Cycles with no instruction issue

      13

      PAPI_HW_INT

      Hardware interrupts

      14

      PAPI_BR_TKN

      Conditional branch instructions taken

      15

      PAPI_BR_MSP

      Conditional branch instructions mispredicted

      16

      PAPI_TOT_INS

      Instructions completed

      17

      PAPI_FP_INS

      Floating point instructions

      18

      PAPI_BR_INS

      Branch instructions

      19

      PAPI_VEC_INS

      Vector/SIMD instructions

      20

      PAPI_RES_STL

      Cycles stalled on any resource

      21

      PAPI_TOT_CYC

      Total cycles

      22

      PAPI_L1_DCH

      Level 1 data cache hits (derived)

      23

      PAPI_L2_DCH

      Level 2 data cache hits (derived)

      24

      PAPI_L1_DCA

      Level 1 data cache accesses

      25

      PAPI_L2_DCA

      Level 2 data cache accesses

      26

      PAPI_L1_ICH

      Level 1 instruction cache hits (derived)

      27

      PAPI_L2_ICH

      Level 2 instruction cache hits

      28

      PAPI_L1_ICA

      Level 1 instruction cache accesses

      29

      PAPI_L2_ICA

      Level 2 instruction cache accesses

      30

      PAPI_L1_ICR

      Level 1 instruction cache reads

      31

      PAPI_L1_TCH

      Level 1 total cache hits (derived)

      32

      PAPI_L2_TCH

      Level 2 total cache hits (derived)

      33

      PAPI_L1_TCA

      Level 1 total cache accesses (derived)

      34

      PAPI_L2_TCA

      Level 2 total cache accesses

      35

      PAPI_L3_TCR

      Level 3 total cache reads

      36

      PAPI_FML_INS

      Floating point multiply instructions

      37

      PAPI_FAD_INS

      Floating point add instructions (Also includes subtract instructions)

      38

      PAPI_FDV_INS

      Floating point divide instructions (Counts both divide and square root instructions)

      39

      PAPI_FSQ_INS

      Floating point square root instructions (Counts both divide and square root instructions)

      40

      PAPI_FP_OPS

      Floating point operations

        •   Upon execution, pat_hwpc generates a report showing hardware counter data including a number of derived metrics and calculated values. These data files can be viewed and examined with pat_report or apprentice2.

        Troubleshooting

        • This page describes the basics of xt-craypat and apprentice2. For further information please check the CRAY documentation.
        • For further information, read the following man pages : intro_craypat, craypat, pat_build, pat_report, pat_hwpc, hwpc, app2 and run the command pat_help.
        • If everything fails, please contact the helpdesk.

         

         

        To top