Saturday, 16 May 2020

Running Linux and IBM Spectrum Scale on IBM supercomputers

Overview


Almost all of the world’s top 500 supercomputers today run Linux®. Mostly they have batch job submission systems, which partition the supercomputer as required for the applications, and run the applications in sequence in their allocated partitions in an attempt to keep the expensive supercomputer at maximum utilization.

It is also possible to run Linux in the compute fabric as a multiuser operating system. This standard programming environment broadens the set of applications which can run on the leadership hardware and makes it easy to put the supercompute capability in the hands of scientists, engineers, and other business personnel who need it.

This article shows a Linux application running on an IBM® POWER9™ (model 8335-GTW) supercomputer cluster, and presents the software you need if you have a machine like this and want to get started with Linux.

Supercomputers and cloud computers


Having your own supercomputer is like having your own Amazon Elastic Compute Cloud. The benchmarks and test cases that you use to measure previous generations of computers (mainframes, PCs, games consoles, cellphones) don’t really apply in this new world.

Fortunately, some of the software developed for those other types of computers can be pressed in to service to make some basic measurements, to showcase these new computers, and to illustrate who in a modern competitive business needs to have access to these facilities.

Writing this article in five years’ time would be simple; we might most likely have oil reservoir models, airline seat pricing models, gas turbine flow visualizations, and similar techniques to show off; the market would be mature. However, today is today, we’re in at the ground floor of new and growing business, so we’re adapting IBM General Parallel File System (IBM GPFS™) for the purpose.

IBM General Parallel File System is now IBM Spectrum Scale


IBM GPFS, now IBM Spectrum Scale™, started life as the multimedia file system, intended for streaming video at predictable bandwidth from server farms. It is now actively marketed for data management in enterprise data centers.

A typical IBM Spectrum Scale installation consists of maybe 10 servers, each with up to a few hundred disk spindles. These servers provide POSIX file system services for hundreds to thousands of network-connected client systems.

IBM Spectrum Scale provides data replication, volume management, backup/restore, continuous operation in case of disk and server failures, improved serviceability, and scalability. These are features needed by enterprises and are what distinguish this IBM technology from open technology such as Network File System (NFS).

In our scenario with the POWER9 cluster, we allocate a solid-state disk of 1.5 TB on each POWER9 node as if it was a disk spindle. The whole IBM Spectrum Scale system consists of 16 server nodes, each with one disk of size 1.5 TB, providing a coherent POSIX file system image to client applications running on the 16 server nodes. This is an unusual geometry for an IBM Spectrum Scale cluster; but it is viable.

I had access to 16 server nodes; cluster sizes vary from two nodes all the way up to several thousand nodes depending on the intended application.

Interleaved or Random


Interleaved or Random (IOR) is a file system benchmark from the University of California. Figure 1 shows a screen capture of it running on the 16 nodes of the POWER9.

Figure 1. Running IOR

LPI Exam Prep, LPI Tutorial and Material, LPI Guides, LPI Study Material

Refer to Listing 1 for the text from Figure 1.

Listing 1. Running IOR

+ jsrun --rs_per_host 1 --nrs 16 -a 1 /gpfs/wscgpfs01/tjcw/workspace/IOR/src/C/IOR -r -w -o /gpfs/ssdfilesys/tjcw//iorfile -b 921G
IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Fri Nov 16 03:10:53 2018
Command line used: /gpfs/wscgpfs01/tjcw/workspace/IOR/src/C/IOR -r -w -o /gpfs/ssdfilesys/tjcw//iorfile -b 921G
Machine: Linux c699c010

Summary:
        api                = POSIX
        test filename      = /gpfs/ssdfilesys/tjcw//iorfile
        access             = single-shared-file
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 16 (1 per node)
        repetitions        = 1
        xfersize           = 262144 bytes
        blocksize          = 921 GiB
        aggregate filesize = 14736 GiB

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)
---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write       33291.72   33291.72    33291.72      0.00  133166.87  133166.87   133166.87      0.00 453.25578   EXCEL
read        86441.95   86441.95    86441.95      0.00  345767.81  345767.81   345767.81      0.00 174.56413   EXCEL

Max Write: 33291.72 MiB/sec (34908.90 MB/sec)
Max Read:  86441.95 MiB/sec (90640.96 MB/sec)

Run finished: Fri Nov 16 03:21:21 2018

real    10m28.375s
user    0m0.084s
sys     0m0.017s

This shows a session from a desktop to the supercomputer. c699c010 is one of the 16 nodes allocated to this job, each with 44 POWER9 processors, six NVIDIA Tesla GPUs; a 1.5 TB solid-state disk and 605 GB of RAM, for a total of 704 POWER9 processors, 96 GPUs, 24 TB of solid-state disk, and 9.6 TB of RAM.

Log on to the launch node named c699launch01, and issue the jsrun command to ask for one processor core on each processing node to be joined up over TCP/IP as a Message Passing Interface (MPI) job.

jsrun --rs_per_host 1 --nrs 16 -a 1
/gpfs/wscgpfs01/tjcw/workspace/IOR/src/C/IOR -r -w -o
/gpfs/ssdfilesys/tjcw//iorfile -b 921G

MPI runs IOR, a distributed file system benchmark which you could run over NFS among a cluster of workstations. In this case, IOR is running over the IBM Spectrum Scale File System with its data in solid-state disk, and it achieves an average write data rate of 33.3 GBps and an average read data rate of 86.4 GBps over Mellanox InfiniBand among the 16 nodes. These data rates are limited by the transfer speeds to and from the solid-state disks.

It would be possible to ask jsrun to run the MPI job over all 704 processor cores on the 16 nodes by specifying --rs_per_host 44 –nrs 704, but one core per node is sufficient in this benchmark to use the whole capability of the solid-state disks.

Source: ibm.com

Related Posts

0 comments:

Post a Comment