====== FS: Lustre02 ======

===== Characteristics =====
 
<data_cdcl>
name:Lustre Phase2
</data_cdcl>


===== Description =====

The DKRZ system was procured in two phases that are roughly the same size.
The second phase consists of [[http://www.seagate.com/files/www-content/product-content/xyratex-branded/clustered-file-systems/_shared/datasheets/seagate-clusterStor-l300-datasheet-11-06-15.pdf|ClusterStor L300]] equipped with Seagate Enterprise Capacity V5 disks (8 TB, ST8000NM0095).

Both systems are configured in Scalable System Units (SSUs); pairs of servers in active/active fail-over mode that manages an extension unit (JBOD containing additional devices), resulting in two OSTs per OSS.

Initially, we planned of creating one big shared file system, but now are using two file systems (one for the storage of phase 1 and one for phase 2). Both file systems are mounted on all compute nodes.

===== Measurement protocols =====

==== Peak performance ====

The peak performance is derived from the maximum performance possible on a L300 that is 5.4 GiB/s, multiplied with the number of servers in the SSU/extension pairs we have installed (34 in phase 2).
The L300 actually manages to achieve a better performance and operates at Infiniband speed. Still for the theoretic maximum, we consider the limit of 5.4 GiB/s.

Lustre's obd-filter survey demonstrates that the phase 2 system alone is able to deliver 480 GB/s and 580 GB/s for write and read, respectively.

==== Sustained metadata performance ====

Performance has been measured using [[tools:benchmarks:parabench|Parabench]], see the description in [[lustre01]].
The benchmark runs for a considerable time on 16 nodes with 16 processes per node but does not explicitly synchronize between the individual parabench runs.
Theoretically, a single Parabench run could handle this setup, but the simpler approach has been chosen.

In phase 2, we received additional 7 metadata servers, they now delivered between 30 and 35k Ops/s if stressed individually resulting in 210 kOPS/s.

While both benchmarks have been executed individually, there is strong evidence that the way measurement is done allows us to add up the results of both runs.

==== Sustained performance ====

The reported performance result is only for the new phase 2 system.

Performance of the phase 1 system has been measured with [[tools:benchmarks:ior|IOR]] (see also [[lustre01]]).

Similarly performance of the phase 2 system has been measured.
The configuration was as follows:
  * Striping 128 OSTs = 32 SSUs
  * 852 compute nodes, 4 IOR procs per node
  * Arguments to IOR: -b 2000000 -t 2000000
  * The amount of data was about 3x main memory of the used nodes

The measurement has been conducted while production in Phase 1 was active. Since both systems share the Infiniband tree network, the observed performance is lower than the system capabilities.