User Tools

Site Tools


no way to compare when less than two revisions

Differences

This shows you the differences between two versions of the page.


hpsl:2020:deu:dkrz:lustre02 [2020/08/13 17:09] (current) – created - external edit 127.0.0.1
Line 1: Line 1:
 +====== FS: Lustre02 ======
  
 +===== Characteristics =====
 + 
 +<data_cdcl>
 +name:Lustre Phase2
 +</data_cdcl>
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +
 +===== Description =====
 +
 +The DKRZ system was procured in two phases that are roughly the same size.
 +The second phase consists of [[http://www.seagate.com/files/www-content/product-content/xyratex-branded/clustered-file-systems/_shared/datasheets/seagate-clusterStor-l300-datasheet-11-06-15.pdf|ClusterStor L300]] equipped with Seagate Enterprise Capacity V5 disks (8 TB, ST8000NM0095).
 +
 +Both systems are configured in Scalable System Units (SSUs); pairs of servers in active/active fail-over mode that manages an extension unit (JBOD containing additional devices), resulting in two OSTs per OSS.
 +
 +Initially, we planned of creating one big shared file system, but now are using two file systems (one for the storage of phase 1 and one for phase 2). Both file systems are mounted on all compute nodes.
 +
 +===== Measurement protocols =====
 +
 +==== Peak performance ====
 +
 +The peak performance is derived from the maximum performance possible on a L300 that is 5.4 GiB/s, multiplied with the number of servers in the SSU/extension pairs we have installed (34 in phase 2).
 +The L300 actually manages to achieve a better performance and operates at Infiniband speed. Still for the theoretic maximum, we consider the limit of 5.4 GiB/s.
 +
 +Lustre's obd-filter survey demonstrates that the phase 2 system alone is able to deliver 480 GB/s and 580 GB/s for write and read, respectively.
 +
 +==== Sustained metadata performance ====
 +
 +Performance has been measured using [[tools:benchmarks:parabench|Parabench]], see the description in [[lustre01]].
 +The benchmark runs for a considerable time on 16 nodes with 16 processes per node but does not explicitly synchronize between the individual parabench runs.
 +Theoretically, a single Parabench run could handle this setup, but the simpler approach has been chosen.
 +
 +In phase 2, we received additional 7 metadata servers, they now delivered between 30 and 35k Ops/s if stressed individually resulting in 210 kOPS/s.
 +
 +While both benchmarks have been executed individually, there is strong evidence that the way measurement is done allows us to add up the results of both runs.
 +
 +==== Sustained performance ====
 +
 +The reported performance result is only for the new phase 2 system.
 +
 +Performance of the phase 1 system has been measured with [[tools:benchmarks:ior|IOR]] (see also [[lustre01]]).
 +
 +Similarly performance of the phase 2 system has been measured.
 +The configuration was as follows:
 +  * Striping 128 OSTs = 32 SSUs
 +  * 852 compute nodes, 4 IOR procs per node
 +  * Arguments to IOR: -b 2000000 -t 2000000
 +  * The amount of data was about 3x main memory of the used nodes
 +
 +The measurement has been conducted while production in Phase 1 was active. Since both systems share the Infiniband tree network, the observed performance is lower than the system capabilities.