Course Syllabus | 1. CPU Architecture vs. GPU Architecture CPU Architecture BasicsMulticore and multiprocessor basics GPU Architecture basics How is the GPU connected to the host? Why is parallel programming for GPUs different than for multicore? What is a GPU thread and how does it execute? How can I identify my GPU?
Part II. CUDA C and Fortran
1. Low-level GPU Programming and CUDA How does data get to the GPU? How does a program run on the GPU? What kinds of parallelism is appropriate for a GPU? The CUDA programming model Host code to control GPU, allocate memory, launch kernels Kernel code to execute on GPU Scalar routine executed on one thread Launched in parallel on a grid of thread blocks
2. The Host Program Declaring and allocating device memory data Moving data to and from the device Launching kernels
SHORT LAB
3. Writing Kernels What is allowed in a kernel vs. what is not allowed Grids, Blocks, Threads, Warps
4. Building and Running CUDA Programs Compiler options Running your program The CUDA Runtime API CUDA Fortran vs. CUDA C
LAB
5. Performance Tuning, Tips and Tricks Measuring performance, using cudaprof Optimizing your kernels Optimize communication between host and GPU Optimize device memory accesses, shared memory usage Optimize the kernel code loop unrolling thread block unrolling grid unrolling pipelining Debugging
PERFORMANCE LAB
Part III. PGI Accelerator Model
1. High-level GPU Programming using the PGI Accelerator Model What role does a high-level model play? Basic concepts and directive syntax Accelerator compute and data regions Appropriate algorithms for a GPU
2. Building and Running Accelerator Programs Command line options Enabling compiler feedback
SHORT LAB
3. Accelerator Directive Details Compute regions Clauses on the compute region directive What can appear in a compute region Obstacles to successful acceleration Loop directive Clauses on the loop directive Loop schedules Data regions Clauses on the data region directive
LAB
4. Interpreting compiler feedback Using pgprof source browser Hindrances to parallelism Data movement feedback Reading kernel schedules
LAB
5. Performance Tuning, Tips and Tricks Appropriate algorithm Optimizing data movement between host and GPU Data regions, mirrored / reflected data, CUDA data Optimizing kernel performance Tuning the kernel schedule unroll clauses Choosing accelerator device PGI Unified Binary Performance profiling information GPU initialization time on Linux
PERFORMANCE LAB
Part IV. Wrapup, Questions
1. Accelerators in HPC Past, present, future role of accelerators in HPC Past, present, future of programming models for accelerators
Day 3, Molecular Dynamics Codes on GUGPUs - CSCS: Sadaf Alam, Jeff Poznanovic, and Tim Robinson
Pre-requisites: familiarity with running parallel MD on clusters
- Introduction of GPGPU technologies for scientific computing - Overview of parallel classical molecular dynamics software - Evolution of GPU acceleration for classical molecular dynamics software - Walkthrough using GPU accelerated NAMD / pmemd (Rosa vs. Eiger) - Demo with Case studies - Tips and tricks for optimal usage of GPU accelerated simulations - Advanced topics and future outlook - LAB session |