Event Detail


Sorry, the registration period for this event is over.

Getting the best out of multi-core - 10-12 December 2012


Modern multi-core x86 processors have 100 times more peak performance than similar single-core processors from ten years ago, but most applications haven't been able to leverage this power to their advantage. This three-day hands-on oriented course shows how to get the most out of Intel Sandy Bridge and AMD Interlagos processors by investigating the following

  • Code vectorization
    • Understanding processor architecture and the potential speedup from vectorization
    • Using compiler feedback to understand where vectorization is and is not achieved
    • Using compiler feedback, compiler options and pragmas to improve vectorization
  • Tuning for the cache hierarchy
    • Understanding the cache and memory hierarchy on modern multi-core processors
    • Analysing performance reports to determine poor cache utilisation
    • Code changes and compiler options to improve cache utilisation
  • Multi-threading
    • An example of a threading model - OpenMP
    • Use of tools to help produce multi-threaded code
    • Understanding of threading pitfalls that affect code correctness
    • Understanding of threading performance issues on multi-socket multi-core nodes

We will make use of powerful tools to help understand code performance and to introduce vectorization and threading, with the Cray tools CrayPAT/Apprentice2/Reveal being used on a Cray system and Intel tools on a Sandy Bridge cluster. In particular we will use the Reveal tool to analyse compiler optimisations and performance reports and to use its powerful OpenMP directive insertion options to help introduce multi-threading into codes.

The course will be rich in hands-on practical sessions to demonstrate these tools and in addition the course will allow the developer to see the critical effects of poor resource utilisation, methods to alleviate these problems, and best practices in implementing multi-process multi-threaded codes.

We will also give a demonstration of how these techniques can be applies to the Intel Xeon Phi (also known as MIC - Many Integrated Core) architecture.

You will need to bring a laptop computer with the capability of ssh access to CSCS machines and the ability to display output from X11 applications.


Registration deadlineDecember 4, 2012.  

Please contact neil.stringfellow(at)cscs.ch for further technical informations


Neil Stringfellow, Sadaf Alam, Gilles Fourestey, Themis Athanassiadou, Ben Cumming from CSCS

VenueCSCS, Via Trevano 131, Lugano     www.cscs.ch/about_us/visitor_information/index.html
Day 1: 9:30 - 18; Day 2 and 3: 9:00 - 18:00

Participants are expected to bring a laptop for hands-on training

Competency in C++, Fortran or C; Basic understanding of OpenMP; Basic understanding of MPI

Maximum number of participants 

Minimal number of participantsIf the minimal number of participants is not reached we reserve the right to cancel the course
Participants are kindly requested to make their own arrangements for accommodation




Day 1

Motivation for threading and vectorization on multi-core processors.

Potential speedups through vectorization and multi-threading

Realistic expectations of speedup and some examples from real applications

Use of algorithm analysis to determine upper performance bound of compute kernels

Overview of node computer architectures - Cray XE6/XK7 with AMD Interlagos, and cluster node with Intel Sandy Bridge

Cache and memory hierarchy and TLB (translation lookaside buffer)

Understanding of cache and memory hierarchy

  • Practical session to show cache and TLB effects
  • Techniques to improve cache efficiency and TLB performance


  • Introductory examples of code performance with and without vectorization
  • Introduction to Cray tools Apprentice2 and Reveal
  • Use of Intel compiler reports and compiler recommendations
  • ** Guided case study and examples of improvements from vectorization

Days 2 and 3

Introduction to multi-threading

An OpenMP primer (review of OpenMP, participants are expected to have some understanding prior to the course)

  • Simple practicals to review OpenMP

Use of tools for producing multi-threaded code

  • Practical sessions using Cray Apprentice2 and Reveal to introduce threading directives into code
  • ** Guided case study and examples of multi-threading a serial code

Threading correctness issues

Race conditions

  • Debugging tools and techniques to identify threading correctness issues

Threading performance issues

Cache coherency and NUMA effects, thread launch overheads, processor and memory affinity

  • Use of tools and techniques to analyse performance issues

Architectural differences between AMD and Intel processors

  • Performance of code examples on different processors

Introduction to multi-threaded MPI

Introduction and demonstration of the Intel Xeon Phi (MIC) architecture

  • Practical sessions

** Participants are invited to bring their own codes to use in these highlighted (**) practical sessions

Practicals will be carried out on CSCS Cray systems Rosa/Tödi and a Sandy Bridge cluster named Pilatus.

For specific questions about the course you should direct queries to Neil Stringfellow [neil.stringfellow(at)cscs.ch].

Back to listing