Lustre

MOGON II has several Lustre fileserver for different purposes:

  • Project (/lustre/project or /lustre/miifs01)
  • ATLAS (private, /lustre/miifs02)
  • HIMster2_th (private, /lustre/miifs04)
  • HIMster2_exp (private, /lustre/miifs05)

MOGON II projects get access to the project fileserver if it is requested during the application process.

NO BACKUP


There is NO BACKUP AT ALL on any Lustre fileservers. The fileservers are not an archive system. Please remove all data from the fileservers that is no longer needed.

Basics

Lustre is awesome!
Design decision to be made!!

  • Random IO bad
  • Sequential IO good

Many nodes working on one file
= bad
Thousands of files
= bad

Many nodes working on many files
= good
Striped file and many nodes
= good


Architecture

A Lustre filesystem consists of multiple Object storage targets (OST) and at least one Metadata target (MDT). The actual data within a file is stored on the OSTs at a block level, while the MDT holds the metadata to a file. Servers providing the MDTs are called Metadata Servers (MDS), servers providing the OSTs are called Object Storage Servers (OSS).

In order to leverage the Lustre capabilities of parallelization, it is possible to store single files across several OSTs. This enables a file to be read from multiple sources in parallel, reducing both access times and bandwidth for very large files.

Quota

The usage of the project fileserver is limited on a per project basis. You can find out your project’s quota and the amount you are currently using

lfs quota -hg demoproject /lustre/miifs01/
  Disk quotas for grp demoproject (gid 1234567):
      Filesystem    used   quota   limit   grace   files   quota   limit   grace
  /lustre/miifs01/
                  6.716T     10T     10T       -  166363       0       0       -

The example shows that the project uses $6.716TB$ of the assigned quota of $10TB$. If any limit is exceeded, the entry is marked with an asterisk.

Please note:
If your project is above its quota, file creation will be prohibited!

Striping

Striping


The process of distributing file blocks across multiple storage targets is called striping.

We have implemented a default striping scheme of all files on the project fileservers. All files are striped across four OSTs, beginning at a size of $4GB$. This not only improves read performance for these files but also distributes load better across storage targets and storage servers.

You can find out the striping pattern of your files with:

lfs getstripe /path/to/file
  /path/to/filee
    lcm_layout_gen:    2
    lcm_mirror_count:  1
    lcm_entry_count:   2
      lcme_id:             1
      lcme_mirror_id:      0
      lcme_flags:          init
      lcme_extent.e_start: 0
      lcme_extent.e_end:   1073741824
        lmm_stripe_count:  1
        lmm_stripe_size:   1048576
        lmm_pattern:       raid0
        lmm_layout_gen:    0
        lmm_stripe_offset: 37
        lmm_objects:
        1. 0: { l_ost_idx: 37, l_fid: [0x100250000:0x3bf4b:0x0] }

      lcme_id:             2
      lcme_mirror_id:      0
      lcme_flags:          0
      lcme_extent.e_start: 1073741824
      lcme_extent.e_end:   EOF
        lmm_stripe_count:  4
        lmm_stripe_size:   1048576
        lmm_pattern:       raid0
        lmm_layout_gen:    0
        lmm_stripe_offset: -1

The output shows that the file has a stripe count of 1 for the first $1GB$, and consists of 4 stripes afterwards. As the testfile is empty, there is only one object created on OST 37 at this point. Should the file exceed $1GB$, more objects on others OSTs will be assigned.

The current striping pattern is a tradeoff that should give good performance in most scenarios. Feel free to change the striping layout if you find it necessary with the lfs setstripe command.

Lustre usage guidelines

Avoid …

  • a large amount of files in single directories
  • random IO
    The project fileserver consists of spinning disks. Reading data randomly from files is considerably slower than sequential reading.
  • using ls or wildcard operation
    Although it is not intuitive, invocations of ls on directories imply a series of operations, that cause a high load on both MDS and OSS. Please avoid them where possible.
  • too many parallel IO operations on the same file
    It is easy to start hundreds of parallel threads operating on the same file. It is not easy to actually process this IO on the server side due to low throughput and high latency for spinning disks. This applies to read operations and is even more important for write operations due to lock contention.
  • more than 1000 files per directory

Use the node local scratch if you can not avoid the issues listed above. Feel free to contact us any time if you have questions about your workload and would like to have advice on how to improve your IO.

Performance considerations …

  • accessing files in Lustre involves a big overhead
  • accessing a lot of files in the same directory causes file locking conflicts
    heavily reducing the efficiency and speed of these operations.
  • reading small files is not efficient
    Every read and open operation on a file comes with a huge overhead and network latency. You should use bigger files instead of many small ones wherever possible.

Recent changes

During February and March 2022, we have extended the storage in the project filesystem by adding more OSTs. The available space has been increased by a significant amount, so that the OSTs were extremely unbalanced.

Prior to this addition we did not use any striping per default. The usage statistics have shown that load is often quite unbalanced towards single OSTs, creating bottlenecks in terms of bandwidth and $IOPS$. As a data migration towards the new storage targets was necessary to restore balance, we also striped files based on the aforementioned striping pattern.

Please Note:

The access times of those files that have been migrated might have changed to the date of migration. This is natural for the migration process and no need for concern.