Bereichsnavigation

 

Strategies for I/O Tuning and Optimization

The parallel file system available on Rosa is called Lustre (/scratch/$USER), which offers a set of user-level commands to tune and optimize file access operations.  Details about the Lustre file system and its configurations are available http://wiki.lustre.org/.

Default Configuration

The following command displays the IDs of the 80 Lustre disks (called OSTs), as well as the default stripe count, stripe size and stripe offset:

> lfs osts
OBDS:
0: scratch-OST0000_UUID ACTIVE
1: scratch-OST0001_UUID ACTIVE

79: scratch-OST004f_UUID ACTIVE
/lus/scratch
(Default) stripe_count: 4 stripe_size: 1048576 stripe_offset: 0

The stripe count defines how many Lustre disks a single file can be distributed over; the default stripe count on Rosa is 4.  The stripe size (default, 1MB) is the number of bytes written on one OST before targeting the next (where applicable).  The stripe offset is the starting OST ID.
To find out striping information for files and directories, the following command can be used:

> lfs getstripe --quiet <dir|file>
        23           2016524         0x1ec50c                 0
        37           2016996         0x1ec6e4                 0
        28           2016138         0x1ec38a                 0
        27           2016397         0x1ec48d                 0

This example shows IDs for 4 target OSTs on the system.

Modifications

Lustre provides a user command setstripe for modifying one or more striping parameters for individual files or directories.  For example, the following command would change the default stripe size to 2MB:

> lfs setstripe <dir|file> 2m -1 4

where dir is an existing directory, and file is a file that does not yet exist. The first parameter (2m) represents the stripe size, the second parameter is the stripe offset (-1 is for a load-balanced, round robin assignment), while the third parameter represents the default stripe count.  It is highly recommended that the default offset value is left unchanged.  

When the setstripe is invoked on an existing directory, any new files that are created in that directory in the future will inherit the newly defined striping parameters. Existing files in that directory are not affected, however. When setstripe is invoked for a new file the file is created with the new striping parameters. Setstripe cannot be invoked for an existing file.
For example, to limit the number of OSTs to 1 issue the following command:

> lfs setstripe <dir|file> 1m -1 1

and to use all available OSTs:

> lfs setstripe <dir|file> 1m -1 -1

Details of the supported Lustre commands are available on the lfs man page.  Note that the commands relevant to system administrators may not work in user mode.

 

Note of Caution When Manipulating Stripe Sizes

Once a stripe size is set for a file (at file or directory level) it cannot be modified.  Directory level stripe size could be changed at any time but the changes only effect the newly created files. An existing file needs to be removed and created again if a change in the number of stripe count is essential.  It is therefore important to understand the implication of choosing a stripe size for given file read and write patterns and size of files.

 

Optimization Tips for a Subset of File Access Patterns

There are different strategies for optimizing I/O performance on Rosa depending on the implementation of file I/O operations in an application and the behavior and sizes of data transfers as well as the file sizes.  The table below lists some suggestions for commonly used file I/O implementations in scientific applications. Note that these guidelines assumes burst read/write of specified file sizes.

File size     

I/O pattern  

Recommended setting

< 1GB        

Single file per MPI task/core

lfs setstripe <dir|file> 1m -1 1

< 1GB

Single file (read/written by a single MPI task)

lfs setstripe <dir|file> 1m -1 1

< 1GB

Single shared file accessed by multiple MPI tasks/core

Default

< 100GB

Single file per MPI task/core

Default

< 100GB

Single file (read/written by a single MPI task)

Default

< 100GB

Single shared file accessed by multiple MPI tasks/core

lfs setstripe <dir|file> 1m -1 10

>100 GB

Single file per MPI task/core

Potential scaling bottleneck

>100 GB

Single file (read/written by a single MPI task)

Potential scaling bottleneck

>100 GB

Single shared file accessed by multiple MPI tasks/core

lfs setstripe <dir|file> 1m -1 40

 

File Format Considerations

Several serial and parallel I/O libraries and file formats are supported on Rosa including POSIX, MPI-IO, HDF and NetCDF.  Parallel libraries such as MPI-IO and HDF are optimized on Rosa for collective IO operations to single files.  Users are advised to not only consider high performance options, for instance, concurrent POSIX reads and writes to separate files, but also consider portability and post-processing tradeoffs that high-level parallel IO libraries such as NetCDF and HDF offer, for instance, file aggregation and conversion for visualization.  

Further Information

Consult CSCS (help@cscs.ch) for issues and details about current file I/O optimization techniques and future scalable I/O design considerations for your application.