Sorry, the registration period for this event is over.
Getting the best out of multi-core - 10-12 December 2012
Modern multi-core x86 processors have 100 times more peak performance than similar single-core processors from ten years ago, but most applications haven't been able to leverage this power to their advantage. This three-day hands-on oriented course shows how to get the most out of Intel Sandy Bridge and AMD Interlagos processors by investigating the following
- Code vectorization
- Understanding processor architecture and the potential speedup from vectorization
- Using compiler feedback to understand where vectorization is and is not achieved
- Using compiler feedback, compiler options and pragmas to improve vectorization
- Tuning for the cache hierarchy
- Understanding the cache and memory hierarchy on modern multi-core processors
- Analysing performance reports to determine poor cache utilisation
- Code changes and compiler options to improve cache utilisation
- An example of a threading model - OpenMP
- Use of tools to help produce multi-threaded code
- Understanding of threading pitfalls that affect code correctness
- Understanding of threading performance issues on multi-socket multi-core nodes
We will make use of powerful tools to help understand code performance and to introduce vectorization and threading, with the Cray tools CrayPAT/Apprentice2/Reveal being used on a Cray system and Intel tools on a Sandy Bridge cluster. In particular we will use the Reveal tool to analyse compiler optimisations and performance reports and to use its powerful OpenMP directive insertion options to help introduce multi-threading into codes.
The course will be rich in hands-on practical sessions to demonstrate these tools and in addition the course will allow the developer to see the critical effects of poor resource utilisation, methods to alleviate these problems, and best practices in implementing multi-process multi-threaded codes.
We will also give a demonstration of how these techniques can be applies to the Intel Xeon Phi (also known as MIC - Many Integrated Core) architecture.
You will need to bring a laptop computer with the capability of ssh access to CSCS machines and the ability to display output from X11 applications.
Registration deadline: December 4, 2012.
Please contact neil.stringfellow(at)cscs.ch for further technical informations
Neil Stringfellow, Sadaf Alam, Gilles Fourestey, Themis Athanassiadou, Ben Cumming from CSCS
CSCS, Via Trevano 131, Lugano www.cscs.ch/about_us/visitor_information/index.html
Day 1: 9:30 - 18; Day 2 and 3: 9:00 - 18:00
Participants are expected to bring a laptop for hands-on training
Competency in C++, Fortran or C; Basic understanding of OpenMP; Basic understanding of MPI
Maximum number of participants
Minimal number of participants
If the minimal number of participants is not reached we reserve the right to cancel the course
Participants are kindly requested to make their own arrangements for accommodation
Motivation for threading and vectorization on multi-core processors.
Potential speedups through vectorization and multi-threading
Realistic expectations of speedup and some examples from real applications
Use of algorithm analysis to determine upper performance bound of compute kernels
Overview of node computer architectures - Cray XE6/XK7 with AMD Interlagos, and cluster node with Intel Sandy Bridge
Cache and memory hierarchy and TLB (translation lookaside buffer)
Understanding of cache and memory hierarchy
- Practical session to show cache and TLB effects
- Techniques to improve cache efficiency and TLB performance
- Introductory examples of code performance with and without vectorization
- Introduction to Cray tools Apprentice2 and Reveal
- Use of Intel compiler reports and compiler recommendations
- ** Guided case study and examples of improvements from vectorization
Days 2 and 3
Introduction to multi-threading
An OpenMP primer (review of OpenMP, participants are expected to have some understanding prior to the course)
- Simple practicals to review OpenMP
Use of tools for producing multi-threaded code
- Practical sessions using Cray Apprentice2 and Reveal to introduce threading directives into code
- ** Guided case study and examples of multi-threading a serial code
Threading correctness issues
- Debugging tools and techniques to identify threading correctness issues
Threading performance issues
Cache coherency and NUMA effects, thread launch overheads, processor and memory affinity
- Use of tools and techniques to analyse performance issues
Architectural differences between AMD and Intel processors
- Performance of code examples on different processors
Introduction to multi-threaded MPI
Introduction and demonstration of the Intel Xeon Phi (MIC) architecture
- Practical sessions
** Participants are invited to bring their own codes to use in these highlighted (**) practical sessions
Practicals will be carried out on CSCS Cray systems Rosa/Tödi and a Sandy Bridge cluster named Pilatus.
For specific questions about the course you should direct queries to Neil Stringfellow [neil.stringfellow(at)cscs.ch].