HPC Tech Tips Archives - Microway https://www.microway.com/category/hpc-tech-tips/ We Speak HPC & AI Mon, 03 Jun 2024 16:42:53 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 DGX A100 review: Throughput and Hardware Summary https://www.microway.com/hardware/dgx-a100-review-throughput-and-hardware-summary/ https://www.microway.com/hardware/dgx-a100-review-throughput-and-hardware-summary/#respond Fri, 26 Jun 2020 20:17:42 +0000 https://www.microway.com/?p=12767 When NVIDIA launched the Ampere GPU architecture, they also launched their new flagship system for HPC and deep learning – the DGX 100. This system offers exceptional performance, but also new capabilities. We’ve seen immediate interest and have already shipped to some of the first adopters. Given our early access, we wanted to share a […]

The post DGX A100 review: Throughput and Hardware Summary appeared first on Microway.

]]>
When NVIDIA launched the Ampere GPU architecture, they also launched their new flagship system for HPC and deep learning – the DGX 100. This system offers exceptional performance, but also new capabilities. We’ve seen immediate interest and have already shipped to some of the first adopters. Given our early access, we wanted to share a deeper dive into this impressive new system.

Photo of NVIDIA DGX A100 packaged, being lifted out of packaging, and being tested

The focus of this NVIDIA DGX™ A100 review is on the hardware inside the system – the server features a number of features & improvements not available in any other type of server at the moment. DGX will be the “go-to” server for 2020. But hardware only tells part of the story, particularly for NVIDIA’s DGX products. NVIDIA employs more software engineers than hardware engineers, so be certain that application and GPU library performance will continue to improve through updates to the DGX Operating System and to the whole catalog of software containers provided through the NGC hub. Expect more details as the year continues.

Overall DGX A100 System Architecture

This new DGX system offers top-bin parts across-the-board. Here’s the high-level overview:

  • Dual 64-core AMD EPYC 7742 CPUs
  • 1TB DDR4 system memory (upgradeable to 2TB)
  • Eight NVIDIA A100 SXM4 GPUs with NVLink
  • NVIDIA NVSwitch connectivity between all GPUs
  • 15TB high-speed NVMe SSD Scratch Space (upgradeable to 30TB)
  • Eight Mellanox 200Gbps HDR InfiniBand/Ethernet Single-Port Adapters
  • One or Two Mellanox 200Gbps Ethernet Dual-Port Adapter(s)

As you’ll see from the block diagram, there is a lot to break down within such a complex system. Though it’s a very busy diagram, it becomes apparent that the design is balanced and well laid out. Breaking down the connectivity within DGX A100 we see:

  • The eight NVIDIA A100 GPUs are depicted at the bottom of the diagram, with each GPU fully linked to all other GPUs via six NVSwitches
  • Above the GPUs are four PCI-Express switches which act as nexuses between the GPUs and the rest of the system devices
  • Linking into the PCI-E switch nexuses, there are eight 200Gbps network adapters and eight high-speed SSD devices – one for each GPU
  • The devices are broken into pairs, with 2 GPUs, 2 network adapters, and 2 SSDs per PCI-E nexus
  • Each of the AMD EPYC CPUs connects to two of the PCI-E switch nexuses
  • At the top of the diagram, each EPYC CPU is shown with a link to system memory and a link to a 200Gbps network adapter

We’ll dig into each aspect of the system in turn, starting with the CPUs and making our way down to the new NVIDIA A100 GPUs. Readers should note that throughput and performance numbers are only useful when put into context. You are encouraged to run the same tests on your existing systems/servers to better understand how the performance of DGX A100 will compare to your existing resources. And as always, reach out to Microway’s DGX experts for additional discussion, review, and design of a holistic solution.

Index of our DGX A100 review:

AMD EPYC CPUs and System Memory

Diagram depicting the CPU cores, cache, and memory in the NVIDIA DGX A100
DGX A100 CPU/Memory topology (Click to expand)

With two 64-core EPYC CPUs and 1TB or 2TB of system memory, the DGX A100 boasts respectable performance even before the GPUs are considered. The architecture of the AMD EPYC “Rome” CPUs is outside the scope of this article, but offers an elegant design of its own. Each CPU provides 64 processor cores (supporting up to 128 threads), 256MB L3 cache, and eight channels of DDR4-3200 memory (which provides the highest memory throughput of any mainstream x86 CPU).

Most users need not dive further, but experts will note that each EPYC 7742 CPU has four NUMA nodes (for a total of eight nodes). This allows best performance for parallelized applications and can also reduce the impact of noisy neighbors. Pairs of GPUs are connected to NUMA nodes 1, 3, 5, and 7. Here’s a snapshot of CPU capabilities from the lscpu utility:

Architecture:        x86_64
CPU(s):              256
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
CPU MHz:             3332.691
CPU max MHz:         2250.0000
CPU min MHz:         1500.0000
NUMA node0 CPU(s):   0-15,128-143
NUMA node1 CPU(s):   16-31,144-159
NUMA node2 CPU(s):   32-47,160-175
NUMA node3 CPU(s):   48-63,176-191
NUMA node4 CPU(s):   64-79,192-207
NUMA node5 CPU(s):   80-95,208-223
NUMA node6 CPU(s):   96-111,224-239
NUMA node7 CPU(s):   112-127,240-255

High-speed NVMe Storage

Although DGX A100 is designed to support extremely high-speed connectivity to network/cluster storage, it also provides internal flash storage drives. Redundant 2TB NVMe SSDs are provided to host the Operating System. Four non-redundant striped NVMe SSDs provide a 14TB space for scratch storage (which is most frequently used to cache data coming from a centralized storage system).

Here’s how the filesystems look on a fresh DGX A100:

Filesystem      Size  Used Avail Use%    Mounted on
/dev/md0        1.8T   14G  1.7T   1%    /
/dev/md1         14T   25M   14T   1%    /raid

The industry is trending towards Linux software RAID rather than hardware controllers for NVMe SSDs (as such controllers present too many performance bottlenecks). Here’s what the above md0 and md1 arrays look like when healthy:

md0 : active raid1 nvme1n1p2[0] nvme2n1p2[1]
      1874716672 blocks super 1.2 [2/2] [UU]
      bitmap: 1/14 pages [4KB], 65536KB chunk

md1 : active raid0 nvme5n1[2] nvme3n1[1] nvme4n1[3] nvme0n1[0]
      15002423296 blocks super 1.2 512k chunks

It’s worth noting that although all the internal storage devices are high-performance, the scratch drives making up the /raid filesystem support the newer PCI-E generation 4.0 bus which doubles I/O throughput. NVIDIA leads the pack here, as they’re the first we’ve seen to be shipping these new super-fast SSDs.

High-Throughput and Low-Latency Communications with Mellanox 200Gbps

Photo of the internal system sled of DGX A100 with CPUs, Memory, and HCAs
Sled from DGX A100 showing ten 200Gbps adapters

Depending upon the deployment, nine or ten Mellanox 200Gbps adapters are present in each DGX A100. These adapters support Mellanox VPI, which enables each port to be configured for 200G Ethernet or HDR InfiniBand. Though Ethernet is particularly prevalent in particular sectors (healthcare and other industry verticals), InfiniBand tends to be the mode of choice when highest performance is required.

In practice, a common configuration is for the GPU-adjacent adapters be connected to an InfiniBand fabric (which allows for high-performance RDMA GPU-Direct and Magnum IO communications). The adapter(s) attached to the CPUs are then used for Ethernet connectivity (often meeting the speed of the existing facility Ethernet, which might be any one of 10GbE, 25GbE, 40GbE, 50GbE, 100GbE, or 200GbE).

Leveraging the fast PCI-E 4.0 bus available in DGX A100, each 200Gbps port is able to push up to 24.6GB/s of throughput (with latencies typically ranging from 1.09 to 202 microseconds as measured by OSU’s osu_bw and osu_latency benchmarks). Thus, a properly tuned application running across a cluster of DGX systems could push upwards of 200 gigabytes per second to the fabric!

GPU-to-GPU Transfers with NVSwitch and NVLink

NVIDIA built a new generation of NVIDIA NVLink into the NVIDIA A100 GPUs, which provides double the throughput of NVLink in the previous “Volta” generation. Each NVIDIA A100 GPU supports up to 300GB/s throughput (600GB/s bidirectional). Combined with NVSwitch, which connects each GPU to all other GPUs, the DGX A100 provides full connectivity between all eight GPUs.

Running NVIDIA’s p2pBandwidthLatencyTest utility, we can examine the transfer speeds between each set of GPUs:

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 1180.14 254.47 258.80 254.13 257.67 247.62 257.21 251.53
     1 255.35 1173.05 261.04 243.97 257.09 247.20 258.64 257.51
     2 253.79 260.46 1155.70 241.66 260.23 245.54 259.49 255.91
     3 256.19 261.29 253.87 1142.18 257.59 248.81 250.10 259.44
     4 252.35 260.44 256.82 249.11 1169.54 252.46 257.75 255.62
     5 256.82 257.64 256.37 249.76 255.33 1142.18 259.72 259.95
     6 261.78 260.25 261.81 249.77 258.47 248.63 1173.05 255.47
     7 259.47 261.96 253.61 251.00 259.67 252.21 254.58 1169.54

The above values show GPU-to-GPU transfer throughput ranging from 247GB/s to 262GB/s. Running the same test in bidirectional mode shows results between 473GB/s and 508GB/s. Execution within the same GPU (running down the diagonal) shows data rates around 1,150GB/s.

Turning to latencies, we see fairly uniform communication times between GPUs at ~3 microseconds:

P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3      4      5      6      7
     0   2.63   2.98   2.99   2.96   3.01   2.96   2.96   3.00
     1   3.02   2.59   2.96   3.00   3.03   2.96   2.96   3.03
     2   3.02   2.95   2.51   2.97   3.03   3.04   3.02   2.96
     3   3.05   3.01   2.99   2.49   2.99   2.98   3.06   2.97
     4   2.88   2.88   2.95   2.87   2.39   2.87   2.90   2.88
     5   2.87   2.95   2.89   2.87   2.94   2.49   2.87   2.87
     6   2.89   2.86   2.86   2.88   2.93   2.93   2.53   2.88
     7   2.90   2.90   2.94   2.89   2.87   2.87   2.87   2.54

   CPU     0      1      2      3      4      5      6      7
     0   4.54   3.86   3.94   4.10   3.92   3.93   4.07   3.92
     1   3.99   4.52   4.00   3.96   3.98   4.05   3.92   3.93
     2   4.09   3.99   4.65   4.01   4.00   4.01   4.00   3.97
     3   4.10   4.01   4.03   4.59   4.02   4.03   4.04   3.95
     4   3.89   3.91   3.83   3.88   4.29   3.77   3.76   3.77
     5   4.20   3.87   3.83   3.83   3.89   4.31   3.89   3.84
     6   3.76   3.72   3.77   3.71   3.78   3.77   4.19   3.77
     7   3.86   3.79   3.78   3.78   3.79   3.83   3.81   4.27

As with the bandwidths, the values down the diagonal show execution within that particular GPU. Latencies are lower when executing within a single GPU as there’s no need to hop across the bus to NVSwitch or another GPU. These values show that the same-device latencies are 0.3~0.5 microseconds faster than when communicating with a different GPU via NVSwitch.

Finally, we want to share the full DGX A100 topology as reported by the nvidia-smi topo --matrix utility. While a lot to digest, the main takeaways from this connectivity matrix are the following:

  • all GPUs have full NVLink connectivity (12 links each)
  • each pair of GPUs is connected to a pair of Mellanox adapters via a PXB PCI-E switch
  • each pair of GPUs is closest to a particular set of CPU cores (CPU and NUMA affinity)
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	mlx5_0	mlx5_1	mlx5_2	mlx5_3	mlx5_4	mlx5_5	mlx5_6	mlx5_7	mlx5_8	mlx5_9	CPU Affinity	NUMA Affinity
GPU0	 X 	NV12	NV12	NV12	NV12	NV12	NV12	NV12	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3
GPU1	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NV12	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3
GPU2	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NV12	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	16-31,144-159	1
GPU3	NV12	NV12	NV12	 X 	NV12	NV12	NV12	NV12	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	16-31,144-159	1
GPU4	NV12	NV12	NV12	NV12	 X 	NV12	NV12	NV12	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	112-127,240-255	7
GPU5	NV12	NV12	NV12	NV12	NV12	 X 	NV12	NV12	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	112-127,240-255	7
GPU6	NV12	NV12	NV12	NV12	NV12	NV12	 X 	NV12	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	80-95,208-223	5
GPU7	NV12	NV12	NV12	NV12	NV12	NV12	NV12	 X 	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	80-95,208-223	5

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Host-to-Device Transfer Speeds with PCI-Express generation 4.0

Just as it’s important for the GPUs to be able to communicate with each other, the CPUs must be able to communicate with the GPUs. A100 is the first NVIDIA GPU to support the new PCI-E gen4 bus speed, which doubles the transfer speeds of generation 3. True to expectations, NVIDIA bandwidthTest demonstrates 2X speedups on transfer speeds from the system to each GPU and from each GPU to the system:

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			24.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			26.1

As you might notice, these performance values are right in line with the throughput of each Mellanox 200Gbps adapter. Having eight network adapters with the exact same bandwidth as each of the eight GPUs allows for perfect balance. Data can stream into each GPU from the fabric at line rate (and vice versa).

Diving into the NVIDIA A100 SXM4 GPUs

The DGX A100 is unique in leveraging NVSwitch to provide the full 300GB/s NVLink bandwidth (600GB/s bidirectional) between all GPUs in the system. Although it’s possible to examine a single GPU within this platform, it’s important to keep in mind the context that the GPUs are tightly connected to each other (as well as their linkage to the EPYC CPUs and the Mellanox adapters). The single-GPU information we share below will likely match that shown for A100 SXM4 GPUs in other non-DGX systems. However, their overall performance will depend on the complete system architecture.

To start, here is the ‘brief’ dump of GPU information as provided by nvidia-smi on DGX A100:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   31C    P0    60W / 400W |      0MiB / 40537MiB |      7%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:0F:00.0 Off |                    0 |
| N/A   30C    P0    63W / 400W |      0MiB / 40537MiB |     14%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:47:00.0 Off |                    0 |
| N/A   30C    P0    62W / 400W |      0MiB / 40537MiB |     24%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:4E:00.0 Off |                    0 |
| N/A   29C    P0    58W / 400W |      0MiB / 40537MiB |     23%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM4-40GB      On   | 00000000:87:00.0 Off |                    0 |
| N/A   34C    P0    62W / 400W |      0MiB / 40537MiB |     23%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM4-40GB      On   | 00000000:90:00.0 Off |                    0 |
| N/A   33C    P0    60W / 400W |      0MiB / 40537MiB |     23%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-SXM4-40GB      On   | 00000000:B7:00.0 Off |                    0 |
| N/A   34C    P0    65W / 400W |      0MiB / 40537MiB |     22%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-SXM4-40GB      On   | 00000000:BD:00.0 Off |                    0 |
| N/A   33C    P0    63W / 400W |      0MiB / 40537MiB |     21%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

The clock speed and power consumption of each GPU will vary depending upon the workload (running low when idle to conserve energy and running as high as possible when executing applications). The idle, default, and max boost speeds are shown below. You will note that memory speeds are fixed at 1215 MHz.

    Clocks
        Graphics                          : 420 MHz (GPU is idle)
        Memory                            : 1215 MHz
    Default Applications Clocks
        Graphics                          : 1095 MHz
        Memory                            : 1215 MHz
    Max Clocks
        Graphics                          : 1410 MHz
        Memory                            : 1215 MHz

Those who have particularly stringent efficiency or power requirements will note that the NVIDIA A100 SXM4 GPU supports 81 different clock speeds between 210 MHz and 1410MHz. Power caps can be set to keep each GPU within preset limits between 100 Watts and 400 Watts. Microway’s post on nvidia-smi for GPU control offers more details for those who need such capabilities.

Each new generation of NVIDIA GPUs introduces new architecture capabilities and adjustments to existing features (such as resizing cache). Some details can be found through the deviceQuery utility, reports the CUDA capabilities of each NVIDIA A100 GPU device:

  CUDA Driver Version / Runtime Version          11.0 / 11.0
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 40537 MBytes (42506321920 bytes)
  (108) Multiprocessors, ( 64) CUDA Cores/MP:     6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1215 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes

In the NVIDIA A100 GPU, NVIDIA increased cache & global memory size, introduced new instruction types, enabled new asynchronous data copy capabilities, and more. More complete information is available in our Knowledge Center article which summarizes the features of the Ampere GPU architecture. However, it could be argued that the biggest architecture change is the introduction of MIG.

Multi-Instance GPU (MIG)

For years, virtualization has allowed CPUs to be virtually broken into chunks and shared between a wide group of users and/or applications. One physical CPU device might be simultaneously running jobs for a dozen different users. The flexibility and security offered by virtualization has spawned billion dollar businesses and whole new industries.

NVIDIA GPUs have supported multiple users and virtualization for a couple of generations, but NVIDIA A100 GPUs with MIG are the first to support physical separation of those tasks. In essence, one GPU can now be sliced into up to seven distinct hardware instances. Each instance then runs its own completely independent applications with no interruption or “noise” from other applications running on the GPU:

Diagram of NVIDIA Multi-Instance GPU demonstrating seven separate user instances on one GPU
NVIDIA Multi-Instance GPU supports seven separate user instances on one GPU

The MIG capabilities are significant enough that we won’t attempt to address them all here. Instead, we’ll highlight the most important aspects of MIG. Readers needing complete implementation details are encouraged to reference NVIDIA’s MIG documentation.

Each GPU can have MIG enabled or disabled (which means a DGX A100 system might have some shared GPUs and some dedicated GPUs). Enabling MIG on a GPU has the following effects:

  • One NVIDIA A100 GPU may be split into anywhere between 2 and 7 GPU Instances
  • Each of the GPU Instances receives a dedicated set of hardware units: GPU compute resources (including streaming multiprocessors/SMs, and GPU engines such as copy engines or NVDEC video decoders), and isolated paths through the entire memory system (L2 cache, memory controllers, and DRAM address busses, etc)
  • Each of the GPU Instances can be further divided into Compute Instances, if desired. Each Compute Instance is provided a set of dedicated compute resources (SMs), but all the Compute Instances within the GPU Instance share the memory and GPU engines (such as the video decoders)
  • A unique CUDA_VISIBLE_DEVICES identifier will be created for each Compute Instance and the corresponding parent GPU Instance. The identifier follows this convention:
    MIG-<GPU-UUID>/<GPU instance ID>/<compute instance ID>
  • Graphics API support (e.g. OpenGL etc.) is disabled
  • GPU to GPU P2P (either PCI-Express or NVLink) is disabled
  • CUDA IPC across GPU instances is not supported (though IPC across the Compute Instances within one GPU Instance is supported)

Though the above caveats are important to note, they are not expected to be significant pain points in practice. Applications which require NVLink will be workloads that require significant performance and should not be run on a shared GPU. Applications which need to virtualize GPUs for graphical applications are likely to use a different type of NVIDIA GPU.

Also note that the caveats don’t extend all the way through the CUDA capabilities and software stack. The following features are supported when MIG is enabled:

  • MIG is transparent to CUDA and existing CUDA programs can run under MIG unchanged
  • CUDA MPS is supported on top of MIG
  • GPUDirect RDMA is supported when used from GPU Instances
  • CUDA debugging (e.g. using cuda-gdb) and memory/race checking (e.g. using cuda-memcheck or compute-sanitizer) is supported

When MIG is fully-enabled on the DGX A100 system, up to 56 separate GPU Instances can be executed simultaneously. That could be 56 unique workloads, 56 separate users each running a Jupyter notebook, or some other combination of users and applications. And if some of the users/workloads have more demanding needs than others, MIG can be reconfigured to issue larger slices of the GPU to those particular applications.

DGX A100 Review Summary

DGX-POD with DGX A100

As mentioned at the top, this new hardware is quite impressive, but is only one part of the DGX story. NVIDIA has multiple software stacks to suit the broad range of possible uses for this system. If you’re just getting started, there’s a lot left to learn. Depending upon what you need next, I’d suggest a few different directions:

The post DGX A100 review: Throughput and Hardware Summary appeared first on Microway.

]]>
https://www.microway.com/hardware/dgx-a100-review-throughput-and-hardware-summary/feed/ 0
Deploying GPUs for Classroom and Remote Learning https://www.microway.com/hpc-tech-tips/deploying-gpus-for-classroom-and-remote-learning-2/ https://www.microway.com/hpc-tech-tips/deploying-gpus-for-classroom-and-remote-learning-2/#respond Fri, 22 May 2020 16:34:37 +0000 https://www.microway.com/?p=12601 As one of NVIDIA’s Elite partners, we see a lot of GPU deployments in higher education. GPUs have been proving themselves in HPC for over a decade, and they are the de-facto standard for deep learning research. They’re also becoming essential for other types of machine learning and data science. But GPUs are not always […]

The post Deploying GPUs for Classroom and Remote Learning appeared first on Microway.

]]>
As one of NVIDIA’s Elite partners, we see a lot of GPU deployments in higher education. GPUs have been proving themselves in HPC for over a decade, and they are the de-facto standard for deep learning research. They’re also becoming essential for other types of machine learning and data science. But GPUs are not always available to students, particularly undergraduate students.

GPU-accelerated Classrooms at MSOE

Photo of the ROSIE cluster, with artwork featuring a rose tattoo
Photo of MSOE’s ROSIE cluster

One deployment I’m particularly proud of runs at the Milwaukee School of Engineering, where it is used for undergraduate education, as well as for faculty and industry research. This cluster leverages a combination of NVIDIA’s Volta-generation DGX systems, as well as NVIDIA Tesla T4 GPUs, Mellanox Ethernet, and NetApp storage.

Rather than having to learn a more arcane supercomputer interface, students are able to start GPU-accelerated Jupyter sessions with the click of a button in their web browser.

The cluster is connected to NVIDIA’s NGC hub, providing pre-built containers with the latest HPC & AI software stacks. The DGX systems do the heavy lifting and the Tesla T4 systems service less demanding needs (such as student sessions during class).

Microway’s team delivered all of this fully integrated and ready-to-run, allowing MSOE’s undergrads to get hands on the latest, highest-performing hardware and software tools. And they don’t have to dive down into huge levels of complexity until they’re ready.

Close up photo of the equipment in the ROSIE cluster
Close up photo of the DGX-1, servers, and storage in ROSIE

Multi-Instance GPU amplifies Remote Learning

DGX A100 Hero ImageWhat changed this month is that NVIDIA’s new DGX A100 simplifies your infrastructure. Institutions won’t need one set of systems for the most demanding work and a separate set of systems for less intensive classes/labs. Instead DGX A100 wraps all these capabilities into one powerful and configurable HPC/AI system. It can handle anything from a huge neural network training to a classroom of 56 students. Or a combination of the two.

NVIDIA calls this capability Multi-Instance GPU (MIG). The details might sound a bit hairy, but think of MIG as providing the same kinds of flexibility that virtualization has been providing for years. You can use the whole GPU, or divide it up to support several different applications/users.

DGX A100 is the only system currently providing this capability, and provides anywhere from 8 to 56 GPU instances (other NVIDIA A100 GPU systems will be shipping later this year).

The diagram below depicts seven students/users each running their own GPU-accelerated workload on a single NVIDIA A100 GPU. Each of the eight GPUs in the DGX A100 supports up to seven GPU instances, for a total of 56 instances.

Diagram of NVIDIA Multi-Instance GPU demonstrating seven separate user instances on one GPU
NVIDIA Multi-Instance GPU supports seven separate user instances on one GPU

Consider how these new capabilities might enable your institution. For example, by offering GPU-accelerated sessions to each student in a remote learning course. The traditional classroom of lab PCs might be replaced by a single DGX system.

Each DGX A100 system can serve 56 separate Jupyter notebooks, each with GPU performance similar to a Tesla T4. Microway deploys these systems with a workload manager that supports resource sharing between classroom requests and other types of work, so the full horsepower of the DGX can be leveraged for faculty research when class is not in session. Further, your IT team no longer needs to support dozens of physical workstations – the computer resources are centralized and can be managed from a single location.

Flexible Platforms Support Diverse Workloads

These types of high-performance computer labs are likely familiar for curriculums in traditionally compute-demanding fields (e.g., computer science, engineering, computational chemistry). However, we hear increasing calls for these computational resources from other departments across campuses. As the power of data analytics and machine learning become utilized in other fields, this type of deployment might even be an opportunity for cost-sharing between traditionally disconnected departments.

This year, we’re all being challenged to conceive of new, seamless methods for remote access, collaboration, and instruction. Our team would be thrilled to be a part of transformation at your institution. The first DGX A100 units in academia will be at the University of Florida next month, where Microway will be performing the integration. I know NVIDIA’s DGX A100 systems will prove invaluable to current GPU users, and I hope they will also extend into the hands of graduate and even undergraduate students. Let’s talk about what’s possible now.

The post Deploying GPUs for Classroom and Remote Learning appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/deploying-gpus-for-classroom-and-remote-learning-2/feed/ 0
What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science https://www.microway.com/hpc-tech-tips/what-can-you-do-with-a-15k-nvidia-data-science-workstation-change-healthcare/ https://www.microway.com/hpc-tech-tips/what-can-you-do-with-a-15k-nvidia-data-science-workstation-change-healthcare/#respond Wed, 04 Mar 2020 22:31:20 +0000 https://www.microway.com/?p=12259 NVIDIA’s Data Science Workstation Platform is designed to bring the power of accelerated computing to a broad set of data science workflows. Recently, we found out what happens when you lend a talented data scientist (with a serious appetite for after-hours projects + coffee) a $15k accelerated data science tool. You can recreate a massive […]

The post What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science appeared first on Microway.

]]>
NVIDIA’s Data Science Workstation Platform is designed to bring the power of accelerated computing to a broad set of data science workflows. Recently, we found out what happens when you lend a talented data scientist (with a serious appetite for after-hours projects + coffee) a $15k accelerated data science tool. You can recreate a massive Pubmed literature search project on a Data Science WhisperStation in hours versus weeks.

Kyle Gallatin, an engineer at Pfizer, has deep data science credentials. He’s been working on projects for over 10 years. At the end of 2019 we gave him special access to one of our Data Science WhisperStations in partnership with NVIDIA:

When NVIDIA asked if I wanted to try one of the latest data science workstations, I was stoked. However, a sobering thought followed the excitement: what in the world should I use this for?

I thought back to my first data science project: a massive, multilingual search engine for medical literature. If I had access to the compute and GPU libraries I have now in 2020 back in 2017, what might I have been able to accomplish? How much faster would I have accomplished it?

Experimentation, Performance, and GPU Accelerated Data Science Tooling

Gallatin used Data Science WhisperStation to rapidly create an accelerated data science workflow for a healthcare—and tell us about his experience. And it was a remarkable one.

Not only was a previously impossible workflow made possible, but portions of the application were accelerated up to 39X!

The Data Science Workstation allowed him to design a Pubmed healthcare article search engine where he:

  1. Ingested a larger database than ever imagined (30,000,000 research article abstracts!)
  2. Didn’t require massive code changes to GPU accelerate the algorithm
  3. Used familiar looking tools for his workflow
  4. Had unsurpassed agility—he could search large portions of the abstract database in .1 seconds!

This last point is really critical and shows why we believe the NVIDIA Data Science Workstation Platform and its RAPIDS tools are so special. As Kyle put it:

Data science is a field grounded in experimentation. With big data or large models, the number of times a scientist can try out new configurations or parameters is limited without massive resources. Everyone knows the pain of starting a computationally-intensive process, only be blindsided by an unforeseen error literal hours into running it. Then you have to correct it and start all over again.

Walkthrough with Step-by-Step Instructions

The new article is available on Medium. It provides a complete step-by-step walkthrough of how NVIDIA Rapids tools and NVIDIA Quadro RTX 6000 with NVLink were utilized to revolutionize this process.

A short set of Kyle’s key findings about the environment and the hardware are below. We’re excited about how this kind of rapid development could change healthcare:

Running workflows with GPU libraries can speed up code by orders of magnitude — which can mean hours instead of weeks with every experiment run

Additionally, if you’ve ever set up a data science environment from scratch you know it can really suck. Having Docker, RAPIDs, tensorflow, pytorch and everything else installed and configured out-of-the-box saved hours in setup time

..

With these general-purpose data science libraries offering massive computational enhancements for traditionally CPU-bound processes (data loading, cleansing, feature engineering, linear models, etc…), the path is paved to entirely new frontier of data science.

Read on at Medium.com

The post What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/what-can-you-do-with-a-15k-nvidia-data-science-workstation-change-healthcare/feed/ 0
Multi-GPU Scaling of MLPerf Benchmarks on NVIDIA DGX-1 https://www.microway.com/hpc-tech-tips/multi-gpu-scaling-of-mlperf-benchmarks-on-nvidia-dgx-1/ https://www.microway.com/hpc-tech-tips/multi-gpu-scaling-of-mlperf-benchmarks-on-nvidia-dgx-1/#respond Fri, 23 Aug 2019 15:39:17 +0000 https://www.microway.com/?p=11628 In this post, we discuss how the training of deep neural networks scales on DGX-1. Considering 6 models across 4 out of 5 popular domains covered in the MLPerf v0.5 benchmarking suite, we discuss the time to state-of-the-art accuracy as set by MLPerf.  We also highlight the models that scale well and should be trained […]

The post Multi-GPU Scaling of MLPerf Benchmarks on NVIDIA DGX-1 appeared first on Microway.

]]>
In this post, we discuss how the training of deep neural networks scales on DGX-1. Considering 6 models across 4 out of 5 popular domains covered in the MLPerf v0.5 benchmarking suite, we discuss the time to state-of-the-art accuracy as set by MLPerf.  We also highlight the models that scale well and should be trained on larger numbers of GPUs. Models with poor scalability should be trained on fewer GPUs, which allows for resource sharing among multiple users. As such, we provide insight into common deep learning workloads and how to best leverage the multi-gpu DGX-1 deep learning system for training the models.

MLPerf – a benchmarking suite for deep learning applications

Just as HPC system design is evolving to achieve good performance for Deep Learning applications, there is also an ever-increasing need to have a good set of benchmarks to quantify this performance. Many benchmarking tools have been proposed. For example, Baidu Research released DeepBench which focuses on basic operations involved in neural networks like convolution, GEMM, Recurrent Layers, and All Reduce. Yet there is no provision to compare different systems/workstations or even software frameworks. Tensorflow introduced TF_CNN_BENCH which is only single-domain and benchmarks only convolutional network-based deep-learning workloads. With a diversity of workloads and a variety of different hardware configurations, we need a more general approach to benchmarking deep learning applications.

With support from both industry, universities, and inspired by SPEC and TPC standards, MLPerf is a leading choice as a set of benchmarks covering different areas of Machine Learning. The goals here are multi-fold which includes a fair comparison of different hardware configurations and software frameworks, while encouraging innovation and also easy reproducibility of results.

MLPerf suite includes Image Classification, Object Detection (light and heavy), Language Translation (Recurrent and Non-Recurrent), Recommendation Systems, and Reinforcement Learning benchmarks. The suite is divided into two divisions: Closed and Open. In the Closed division the data preprocessing, training method, and model must be the same as the MLPerf reference implementation. Only very limited changes to hyperparameters are allowed. This aims for fair comparison of different deep learning hardware platforms. In the Open division any model, preprocessing, or training method can be used.

Version v0.5 received no submissions to the Open division. However, Google, NVIDIA, and Intel made submissions to the Closed division. Only Google (on cloud instance) and NVIDIA submitted GPU-accelerated results. No GPU submissions were made for the reinforcement learning benchmark, but Intel did submit a CPU-only result on Skylake processors. Software frameworks varied from Tensorflow v1.12, to MXNet for image classification, and PyTorch for the rest of the domains.

The results discussed in this post largely replicate NVIDIA’s submission in the Closed Model Division of MLPerf v0.5.0 for training. This division places restrictions on modifying hyperparameters like learning rate and batch size to provide a fair comparison of hardware/software systems. However, minor changes were required to successfully train on small numbers of GPUs. All our changes are reflected in the below log files for interested folks who want to dive deeper. We performed scaling analysis on 1, 4, and 8 GPUs on DGX-1. Our findings help deep learning practitioners and researchers determine the best options for their deep learning problem/application(s).

Training Deep Neural Networks

Training deep neural networks can be a formidable task. With millions of parameters, the model risks overfitting the training data. The deep layers in the model can have extreme gradients that lead to vanishing/exploding gradient problems. Even after accounting for all these pitfalls, the training of a network can be really slow. As a non-convex optimization problem, there can be multiple solutions and training neural networks boils down to finding a right selection of hyperparameters in order to achieve a certain threshold of accuracy. This can be done by manually tuning parameters, observing a low generalization error, and reiterating with a different combination of values until reaching the desired accuracy. When there are only a few hyperparameters, a grid search can be applied, which is more computationally intensive. A range of discrete values for each parameter is selected and the model is trained on every combination of parameters as described by the Cartesian product (grid) of the values chosen.

The following is a brief description of each model being used in the MLPerf benchmarks:

  1. Convolutional Neural Networks (CNN):  Most widely used for image processing and pattern recognition applications like object detection/localization, human pose estimation, scene recognition; also for certain non-image workflows (e.g., processing acoustic, seismic, radio, or radar signals). In general, any data that has a grid-like topology can be processed using CNNs. Typical CNNs consist of convolutional layers, pooling layers, and fully connected layers. The convolution operation involves convolving a filter on the image, which extracts features in a local region of the image. In any image the pixels at large distances are randomly related, as opposed to smaller distances where they are correlated. The size of the filter, stride, and padding are some of the hyperparameters that need proper tuning. Pooling layers are used to reduce the number of parameters in the network, in turn reducing the number of computations. Fully connected layers help in classifying images based on the features extracted by the convolution layers. The MLPerf benchmarks Image Classification, Single Stage Detector, and Object Detection make use of a special type of CNN called ResNet. Introduced by Microsoft, ResNet [1] won the ILSVRC 2015 challenge and continues to lead. ResNets consist of residual blocks which ease the process of training extremely deep networks. A residual connection is a shortcut from one layer to another usually after skipping a few layers, basically copying the output from one layer and adding it to another layer just before applying non-linearity. MLPerf benchmarks Image Classification and Object Detection use ResNet-50 (50 layers) while the Single-Stage detector uses ResNet-34 (34 layers) as the backbone.
  2. Recurrent Neural Network (RNN): RNNs are interesting neural networks that offer a lot of flexibility in designing the model. It lets you operate with sequenced data at input, output, or both.  For example, in image captioning with a fixed-size image input, where the RNN model generates a sequence of words describing the contents of the image. In the case of sentiment analysis, the input is a sequence of words and the output is the sentiment of the sentence: whether it is good (positive) or bad (negative).  The MLPerf RNN benchmark uses the sequenced input and sequenced output model, similar to Google’s Neural Machine Translation (GNMT). GNMT has 3 components: an encoder, a decoder, and an attention network. The encoder modifies the input sequence into a list of vectors and the decoder decodes the vector into another sequence of words as an output. The encoder and decoder are connected via an attention network that allows for giving attention to different parts of the input sentence/sequence while decoding. For a more detailed description of the model, read the GNMT [2] paper.
  3. Transformers : A Transformer is a new type of sequence-to-sequence architecture for machine translation that uses both an encoder and a decoder, but does not use Recurrent layers like LSTMs or GRUs. Transformers are a new advancement in NLP which perform better than RNNs. A typical Transformer model would have an encoder and a decoder, with both containing modules like ‘Multi-Head Attention’ and ‘Feed Forward layers’. Since there is no RNN, there is no way of knowing the order of the words fed to the network. Therefore, we need part of the model to have a positional encoding of the words in the sequence. The source language sequence is fed to the encoder and the corresponding target language sequence is fed into the decoder, but shifted by a position. The model tries to predict the next word in the target sequence while having seen only the words prior to that position, and avoids simply copying the decoder sequence as the output. For more detailed model description, read the Attention is all you need [3] paper.
  4. Neural Collaborative Filtering (NCF) : Many online services (e.g., e-commerce, social networking) provide their customers with millions of options to choose from. With digital transformation resulting in huge amounts of data overload, it’s almost impossible to browse through an entire online collection. Recommender systems are needed to filter these options and help users make selections. Collaborative Filtering models the past interactions between the user and the collection. This essentially boils down to a Matrix Factorization problem where the user and collection are projected onto a latent space and the similarity (using the inner product) between the latent vectors is computed. The predictions are based on similarities. However, ‘Inner Product‘ is not a good choice of function to model complex interactions and an alternate approach of using a neural architecture to learn the arbitrary function from the data was devised. This approach is known as Neural Collaborative Filtering (NCF) [4]. Both the user and collection are represented as one-hot encoded in the input layer (sparse). A fully-connected (Embedding) layer projects this sparse representation to a dense vector. The output of the embedding layer is then fed into the Neural CF layers where each layer can learn certain structure among the interactions.

MLPerf Scaling on NVIDIA DGX-1

The MLPerf results submitted by NVIDIA make use of single-node and multi-node DGX-1 and DGX-2 systems, utilizing the entirety of the systems to train a single network. Our post discusses how performance scales when using a single DGX-1 (using 1, 4, or all 8 NVIDIA Tesla GPUs). This is important to understand how a single DGX-1 system can be used as a shared resource among multiple users, or to be used to run multiple cases of the same problem. It also helps establish which deep learning domains require the training to be done on a large scale.

Image Classification

Trained on the ILSVRC2012 dataset with 1.2 million images, this benchmark scales well. It achieves better than linear speedups going from 1 to 4 (~5x) and 1 to 8 GPUs (~10x). DGX users will achieve better throughput if they use the full system for each job.

Figure 1. Evaluation accuracy vs Epochs for Image Classification.

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
151216.2183fp-16
416644.5663fp-16
816642.200263fp-16

Table 1. Synopsis of Image Classification benchmarks

Figure 1. shows the validation accuracy versus the number of epochs it took to reach that accuracy. The accuracy set by MLPerf for this benchmark is 74.9%. The 4- and 8-GPU plots achieve this accuracy in the same number of epochs, however, the average time for each epoch are different as reported in the Table 1. For a single-GPU run, the batch size needed to be reduced in order to avoid “Out of Memory (OOM)” errors. With less data being processed per epoch on a single GPU compared to 4 and 8 GPUs, it took more epochs to train the model to the same accuracy.

Object Detection – Heavy

This is the heaviest workload among all the benchmarks considered in MLPerf. Utilizing the full DGX-1, it takes ~325 minutes to train on the COCO2014 dataset. The model used is the same ResNet-50 as the Image-Classification benchmark. The speedup obtained is ~2.5x going from 1 to 4 GPUs and ~6x when going from 1 to 8 GPUs (which is sub-linear).

Figure 2. Mask mAP and Bounding Box mAP vs Epochs for heavy Object Detection.

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
12 179.18311fp-16
4444.31518fp-16
8424.989513fp-16

Table 2. Synopsis of Object Detection (heavy) benchmarks

Figure 2a and 2b (click on the tabs to toggle between figures) shows the accuracy plots for the heavy object detection benchmark. There are two different accuracy thresholds here: BBOX (Fig. 2b) which stands for Bounding Box accuracy and SEGM (Fig. 2a) which stands for Instance Segmentation. Simply put, an object detection problem requires that the object be correctly located within the image and also that the object be correctly identified/categorized. Instance segmentation refers to instance of each pixel associated with an object in the image.

Object Detection – Light

The light weight object detection benchmark makes use of the COCO2017 dataset and scales with close to linear speedups: about ~3.7x going from 1 to 4 GPUs, and ~7.3x going from 1 to 8 GPUs. Total runtime varies from more than 3 hours on a single GPU to less than half an hour on 8 GPUs.

Figure 3. Accuracy vs Epoch for Single Stage Detector

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
11524.08049fp-16
41521.11549fp-16
81520.562849fp-16

Table 3. Synopsis of SSD benchmark.

Figure 3. shows the accuracy plots for the single stage detector benchmark. The evaluation of the model occurs only at epoch 32, 43, and 48 – hence the 3 data points in the plot. This, of course, can be modified to evaluating more often to have more data points for the plot. However, we stuck to the default values.

Language Translation – Recurrent (GNMT) and Non-Recurrent (Transformer)

The Recurrent model is trained on the WMT16 English-German dataset and the Transformer model is trained on the WMT17 EN-DE dataset. Both language translation models scale well, however transformer not only scales better but also achieves higher accuracy and averaging more in total training time.

Figure 4. BLEU score vs Epochs for Google’s NMT and Transformer Translation models.

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
151239.985fp-16
451212.315fp-16
810246.43fp-16

Table 4. Synopsis of RNN benchmark for Language Translation (GNMT)

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
1512060.348fp-16
4512022.384fp-16
851207.6484fp-16

Table 5. Synopsis of Non-Recurrent benchmark for Language Translation (Transformer)

Figure 4a and 4b (click on the tabs to toggle between images) shows the validation accuracy plots vs epochs for the language translation models. Google’s NMT uses a Recurrent Neural Network based model and achieves an accuracy of 21.80 BLEU. The Transformer model is a new advancement in the models used in language translation which does not use Recurrent Neural network and performs better achieving a higher quality target of 25.00 BLEU.

Table 4. and 5. shows the synopsis for these benchmarks. The length of the sequence is a key parameter for a Recurrent model and does affect the scaling.

Recommendation Systems

This is the quickest benchmark to run. Even on a single GPU, it only takes a little over a minute to train to the desired accuracy of 0.635. The speedups are ~1.8x and ~2.8x when going from 1 to 4 and 1 to 8 GPUs, respectively.

Figure 5. Evaluation accuracy vs Epochs of Neural Collaborative Filtering model for Recommendation Systems .

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
110485760.13538461513fp-16
410485760.07692307713fp-16
810485760.04846153813fp-16

Table 6. Synopsis of Recommendation Systems benchmark

Figure 5. shows the accuracy plots for the recommendation benchmark.All the plots in the figure are quite close to each other. This suggests that it’s not a cost effective strategy to use multiple GPUs for this type of workload. The benefit of using a machine like DGX-1 for such workloads is to run multiple cases, each on a single GPU. Dedicating an entire DGX-1 to a single training will reduce the training time, but is not as efficient if overall throughput is the goal.

MLPerf Scaling Results

This section summarizes the scaling results and discusses the speedups. Figure 6 (click to enlarge) shows the scaling analysis of six MLPerf benchmarks on 1, 4, and 8 GPUs on an NVIDIA DGX-1 (with Tesla V100 32GB GPUs). A general conclusion to draw from the Figure is that “all the models do not scale the same way”. Most of the models scale well. The better a model scales, the more efficiently you can train networks on large resources (an entire DGX or a cluster of DGX).

Figure 6. Scaling plots on 1-4-8 GPUs for MLPerf v0.5 Closed Model Division Benchmarks submitted by NVIDIA. The X-axis shows the number of GPUs and the Y-axis shows the training time to desired accuracy in minutes (the metric set by MLPerf). The inset axis shows a zoomed in view of the plot.

We see substantial speedups for Image Classification and Transformer Translation benchmarks (both are super-linear, running more quickly the more GPUs are added). Single-Stage Detector and Mask-RCNN Object Detection benchmark remain close to linear, while the RNN benchmark goes from linear speedup on 4 GPUs to super-linear speedup on 8 GPUs (which indicates that all of the above will scale efficiently). The Recommendation benchmark scales poorly, with fairly insignificant time savings when run on many GPUs. Table 7 lists the speedups for all benchmarks, including a calculated speed-up as the ratio of total training time on a single GPU to the total training time on multiple GPUs.

For a more detailed understanding of hyperparameters used to train these models, please reference the log files below [10].

BenchmarkSpeed Up (1-4 GPU)Speed Up  (1-8 GPU)
Image Classification4.769.70
Single Stage Detector3.667.25
Object Detection2.476.066
RNN GNMT3.2410.411
Transformer Translation5.39215.778
Recommendation Systems (NCF)1.76*2.789*

(*) Recommendation systems is not a good benchmark for studying scaling analysis of deep learning workloads, since it is the quickest of the bunch and the achieved speedup is on the order of seconds.

Table 7. Speed Ups for all the benchmarks going from 1 to 4 to 8 GPUs

Based on the results, a general takeaway message would be to select systems based on the type of deep learning application one is trying to build. We see that the recommendation systems benchmark doesn’t scale well, which suggests that such projects should limit multi-GPU training and instead share the resources (either shared between multiple users or between multiple models).  On the other hand, if your team trains neural networks on large image sets (image classification, object localization, object detection, instance segmentation), using multi-GPU systems is crucial for quick results.

Next Steps for Successful Deep Learning Deployment

Of course, a powerful compute resource is just one part of successful deep learning implementation. Depending upon your project needs and the anticipated growth of your datasets, storage requirements may eclipse compute requirements. Connectivity also becomes critical, as neural network training stresses system and network I/O.

Whether you are planning a new project or looking to improve your existing deep learning practice, Microway’s team would be happy to help you define the requirements and deliver a successful solution. With experience in everything from GPU workstations to DGX-2 SuperPODS, our experts can ensure the deployment meets your needs. Contact an AI expert today!

References

[1]
Deep Residual Learning for Image Recognition
[2]
Google’s Neural Machine Translation System
[3]
Attention is all you need
[4]
Neural Collaborative Filtering
[5]
Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville
[6]
Demystifying Hardware Infrastructure Choices for Deep Learning Using MLPerf
[7]
MLPerfv0.5 Training Results
[8]
Mask R-CNN for Object Detection
[9]
Single Shot Multibox Detector
[10]
Training results log files

The post Multi-GPU Scaling of MLPerf Benchmarks on NVIDIA DGX-1 appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/multi-gpu-scaling-of-mlperf-benchmarks-on-nvidia-dgx-1/feed/ 0
2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC https://www.microway.com/hpc-tech-tips/amd-epyc-rome-cpu-review/ https://www.microway.com/hpc-tech-tips/amd-epyc-rome-cpu-review/#comments Wed, 07 Aug 2019 23:00:00 +0000 https://www.microway.com/?p=11787 The 2nd Generation AMD EPYC “Rome” CPUs are here! Rome brings greater core counts, faster memory, and PCI-E Gen4 all to deliver what really matters: up to a 2X increase in HPC application performance. We’re excited to present our thoughts on this advancement, and the return of x86 server CPU competition, in our detailed AMD […]

The post 2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC appeared first on Microway.

]]>

The 2nd Generation AMD EPYC “Rome” CPUs are here! Rome brings greater core counts, faster memory, and PCI-E Gen4 all to deliver what really matters: up to a 2X increase in HPC application performance. We’re excited to present our thoughts on this advancement, and the return of x86 server CPU competition, in our detailed AMD EPYC Rome review. AMD is unquestionably back to compete for the performance crown in HPC.

2nd Generation AMD EPYC “Rome” CPUs are offered in 8-64 cores and clock speeds from 2.2-3.2Ghz. They are available in dual socket as well as aselect number of single socket only SKUs.

Important changes in AMD EPYC “Rome” CPUs include:

  • Up to 64 cores, 2X the max in the previous generation for a massive advancement in aggregate throughput
  • PCI-E Gen 4 support for 2X the I/O bandwidth of the x86 competition— in a first for an x86 server CPU
  • 2X the FLOPS per core of the previous generation EPYC CPUs with the new Zen2 architecture
  • DDR4-3200 support for improved memory bandwidth across 8 channels, reaching up to 208GB/sec per socket
  • Next Generation Infinity Fabric with higher bandwidth for intra and inter-die connection, with roots in PCI-E Gen4
  • New 14nm + 7nm chiplet architecture that separates the 14nm IO and 7nm compute core dies to yield the performance per watt benefits of the new TSMC 7nm process node

Leadership HPC Performance

There’s no other way to say it: the 2nd Generation AMD EPYC “Rome” CPUs (EPYC 7xx2) break new ground for HPC performance. In our experience, we haven’t seen this type of advancement in CPU performance in many years or without exotic architectural changes. This leap applies across floating point and integer applications.

Note: This article focuses on SPEC benchmark performance (which is rooted in real integer and floating point applications). If you’re hunting for a more raw FLOPS/dollar calculation, please visit our Knowledge Center Article on AMD EPYC 7xx2 “Rome” CPUs.

Floating Point Benchmark Performance

In short: at the top bin, you may see up to 2.12X the performance of the competition. This is compared to the top bin of Xeon Gold Processor (Xeon Gold 6252) on SPECrate2017_fp_base.

Compared to the top Xeon Platinum 8200 series SKU (Xeon Platinum 8280), up to 1.79X the performance.
AMD Rome SPECfp 2017 vs Xeon CPUs - Top Bin

Integer Benchmark Performance

Integer performance largely mirrors the same story. At the top bin, you may see up to 2.49X the performance of the competition. This is compared to the top bin of Xeon Gold Processor (Xeon Gold 6252) on SPECrate2017_int_base.

Compared to the top Xeon Platinum 8200 series SKU (Xeon Platinum 8280), up to 1.90X the performance.
AMD Rome SPECint 2017 vs Xeon CPUs - Top Bin

What Makes EPYC 7xx2 Series Perform Strongly?

Contributions towards this leap in performance come from a combination of:

  • The 2X the FLOPS per core available in the new architecture
  • Improved performance of Zen2 microarchitecture
  • Moderate increases in clock speeds
  • Most importantly dramatic increases in core count

These last 2 items are facilitated by the new 7nm process node and the chiplet architecture of EPYC. Couple that with the advantages in memory bandwidth, and you have a recipe for HPC performance.

Performance Outlook


The dramatic increase in core count coupled with Zen2 means that we predict that most of the 32 core models and above, about half AMD’s SKU stack, is likely to outperform the top Xeon Platinum 8200 series SKU. Stay tuned for the SPEC benchmarks that confirm this assertion.

If you’re comparing against more modest Xeon Gold 62xx or Silver 52xx/42xx SKUs, we predict even an even more dramatic performance uplift. This is the first time in many years we’ve seen such an incredibly competitive product from the AMD Server Group.

Class Leading Price/Performance

AMD EPYC 7xx2 series isn’t just impressive from an absolute performance perspective. It’s also a price performance machine.

Examine these same two top-bin SKUs once again:
AMD Rome SPECfp 2017 vs Xeon CPUs - Price Performance

The top-bin AMD SKU does 1.79X the floating point work at approximately 2/3 the price of Xeon Platinum 8280. It delivers 2.13X the floating point performance to the Xeon Gold 6252 for about similar price/performance.

Should you be willing to accept more modest core counts with the lower cost SKUS, these comparisons just get better.

Finally, if you’re looking to roughly match or exceed the performance of the top-bin Xeon Gold 6252 SKU, we predict you’ll be able to do so with the 24-core EPYC 7352. This will be at just over 1/3 the price of the Xeon socket.

This much more typical comparison is emblematic of the price-performance advantage AMD has delivered in the new generation of CPUs. Stay tuned for more benchmark results and charts to support the prediction.

A Few Caveats: Performance Tuning & Out of the Box

Application Performance Engineers have spent years optimizing applications for the most widely available x86 server CPU. For a number of years now, that has meant Intel’s Xeon processors. The benchmarks presented here represent performance-tuned results.

We don’t yet have great data on how easy it is to achieve optimized performance with these new AMD “Rome” CPUs yet. For those of us in HPC for some time, we know out of the box performance and optimized performance often can mean very different things.

AMD does recommend specific compilers (AOCC, GCC, LLVM) and libraries (BLIS over BLAS and FLAME over LAPACK) to achieve optimized results with all EPYC CPUs. We don’t yet have a complete understanding how much these help end users achieve these superior results. Does it require a lot of tuning for the most exceptional performance?

AMD however has released a new Compiler Options Quick Reference Guide for the new CPUs. We strongly recommend using these flags and options for tuning your application.

Chiplet and Multi-Die Architecture: IO and Compute Dies

AMD EPYC Rome Die

One of the chief innovations in the 2nd Generation AMD EPYC CPUs is in the evolution of the multi-die architecture pioneered in the first EPYC CPUs.

Rather than create one, monolithic, hard to yield die, AMD has opted to lash together “chiplets” together in a single socket with Infinity Fabric technology.

Compute Dies (now in 7nm)

8 compute chiplets (formally, Core Complex Dies or CCDs) are brought together to create a single socket. These CCDs take advantage of the latest 7nm TSMC process node. By using 7nm for the compute cores in 2nd Generation EPYC, AMD takes advantage of the space and power efficiencies of the latest process—without the yield issues of single monolithic die.

What does it mean for you? More cores than anticipated in a single socket, a reasonable power efficiency for the core count, and a less costly CPU.

The 14nm IO Die

In 2nd Generation EPYC CPUs, AMD has gone a step further with the chiplet architecture. These chiplets are now complemented by an separate I/O die. The IO Die contains the memory controllers, PCI-Express controllers, and Infinity Fabric connection to the remote socket.Also, this resolves any NUMA affinity quirks of the 1st generation EPYC Processors.

Moreover, the I/O die is created in the established 14nm node process. It’s less important that it utilize the same 7nm power efficiencies.

DDR4-3200 and Improved Memory Bandwidth

AMD EPYC 7xx2 series improves its theoretical memory bandwidth when compared to both its predecessor and the competition.

DDR4-3200 DIMMs are supported, and they are clocked 20% faster than DDR4-2666 and 9% faster than DDR4-2933.
In summary, the platform offers:

  • Compared to Cascade Lake-SP (Xeon Platinum/Gold 82xx, 62xx): Up to a 45% improvement in memory bandwidth
  • Compared to Skylake-SP (Xeon Platinum/Gold 81xx, 61xx): Up to a 60% improvement in memory bandwidth
  • Compared to AMD EPYC 7xx1 Series (Naples): Up to a 20% improvement in memory bandwidth



These comparisons are created for a system where only the first DIMM per channel is populated. Part of this memory bandwidth advantage is derived from the increase in DIMM speeds (DDR4-3200 vs 2933/2666); part of it is derived from EPYC’s 8 memory channels (vs 6 on Xeon Skylake/Cascade Lake-SP).

While we’ve yet to see final STREAM testing numbers for the new CPUs, we do anticipate them largely reflecting the changes in theoretical memory bandwidth.

PCI-E Gen4 Support: 2X the I/O bandwidth

EPYC “Rome” CPUs have an integrated PCI-E generation 4.0 controller on the I/O die. Each PCI-E lane doubles in maximum theoretical bandwidth to 4GB/sec (bidirectional).

A 16 lane connection (PCI-E x16 4.0 slot) can now deliver up to 64GB/sec of bidirectional bandwidth (32GB/uni). That’s 2X the bandwidth compared to first generation EPYC and the x86 competition.

Broadening Support for High Bandwidth I/O Devices

Mellanox ConnectX-6 Adapter
The new support allows for higher bandwidth connection to InfiniBand and other fabric adapters, storage adapters, NVMe SSDs, and in the future GPU Accelerators and FPGAs.

Some of these devices, like Mellanox ConnectX-6 200Gb HDR InfiniBand adapters, were unable to realize their maximum bandwidth in a PCI-E Gen3 x16 slot. Their performance should improve in PCI-E Gen4 x16 slot with 2nd Generation AMD EPYC Processors.

2nd Generation AMD EPYC “Rome” is the only x86 server CPU with PCI-E Gen4 support at its launch in 3Q 2019. However, we have seen PCI-E Gen4 support before in the POWER9 platform.

System Support for PCI-E Gen4

Unlike in the previous generation AMD EPYC “Naples” CPUs, there is not strong affinity of PCI-E lanes to a particular chiplet inside the processor. In Rome, all I/O traffic routes through the I/O die and all chiplets reach PCI-E devices through this die.

In order to support PCI-E Gen4, server and motherboard manufacturers are producing brand new versions of their platforms. Not every Rome-ready platform supports Gen4, so if this is a requirement be sure to specify this to your hardware vendor. Our team can help you select a server with full Gen4 capability.

Infinity Fabric

AMD Infinity Fabric DiagramDeeply interrelated with PCI-Express Gen4, AMD has also improved the Infinity Fabric Link between chiplets and sockets with the new generation of EPYC CPUs.

AMD’s Infinity Fabric has many commonalities with PCI-Express used to connect I/O devices. With 2nd Generation AMD EPYC “Rome” CPUs, the link speed of Infinity Fabric has doubled. This allows for higher bandwidth communication between dies on the same socket and to dies on remote sockets.

The result should be improved application performance for NUMA-aware and especially non- NUMA-aware applications. The increased bandwidth should help hide any transport bandwidth issues to I/O devices on a remote socket as well. The overall result is “smoother” performance when applications scale across multiple chiplets and sockets.

SKUs and Strategies to Consider for HPC Clusters

Here are the complete list of SKUs and 1KU (1000 unit) prices (Source: AMD). Please note that these costs are those for CPUs sold to channel integrators, not those for fully integrated systems with these CPUs.

Dual Socket SKUs

SKUCoresBase ClockBoost ClockL3 CacheTDPPrice
7742642.253.4256MB225W$6950
77022.03.35200W$6450
7642482.33.3225W$4775
75522.23.3192MB200W$4025
7542322.93.4128MB225W$3400
75022.53.35180W$2600
74522.353.35155W$2025
7402242.83.35128MB180W$1783
73522.33.2155W$1350
7302163.03.3128MB$978
72822.83.264MB120W$650
7272122.93.2$625
726283.23.4128MB155W$575
72523.23.464MB120W$475

EPYC 7742 or 7702 (64c): Select a High-End SKU, yield up to 2X the performance

Assuming your application scales with core count and maximum performance at a premium cost fits with your budget, you can’t beat the top 64core EPYC 7742 or 7702 SKUs. These will deliver greater throughput on a wide variety of multi-threaded applications.

Anything above EPYC 7452 (32c, 48c): Select a Mid-High Level SKU, reach new performance heights

While these SKUs aren’t inexpensive, they take application performance to new heights and break new benchmark ground. You can take advantage of that performance advantage for your application if it’s multi-threaded. From a price/performance perspective, these SKUs may also be attractive.

EPYC 7452 (32c): Select a Mid Level SKU, improve price performance vs previous generation EPYC

Previous generation AMD EPYC 7xx1 Series CPUs also featured 32 cores. However, the 32 core entrant in the new 7xx2 stack is far less costly than the prior generation while delivering greater memory bandwidth and 2X the FLOPS per core.

EPYC 7452 (32c): Select a Mid Level SKU, match top Xeon Gold and Platinum with far better price/performance

If you’re optimizing for price/performance compared to the top Intel Xeon Platinum 8200 or Xeon Gold 6200 series SKUs, consider this SKU or ones near it. We predict this to be at or near the price/performance sweet-spot for the new platform.

EPYC 7402 (24c): Select a Mid Level SKU, come close to top Xeon Gold and Platinum SKUs

The higher clock speed of this SKU also means it is well suited to some applications.

EPYC 7272-7402 (12, 16 24c):Select an affordable SKU, yield better performance and price performance

Treat these SKUs as much more affordable alternatives to most Xeon Gold or Silver CPUs. We’ll await further benchmarks to see exactly where the further sweet-spots are compared to these SKUs. They also compare favorably from a price/performance standpoint to prior generation 1st Generation EPYC 7xx1 processors with 12, 16, or 24 cores. Same performance, fewer dollars!

Single Socket Performance

As with the previous generation, AMD is heavily promoting the concept of replacing dual socket Intel Xeon servers with single sockets of 2nd Generation AMD EPYC “Rome.” They are producing discounted “P” SKUs with only single socket platform support at reduced prices to help further boost the price-performance advantage of these systems.

Single Socket SKUs

SKUCoresBase ClockBoost ClockL3 CacheTDPPrice
7702P642.03.35256MB200W$4425
7502P322.53.35128MB180W$2300
7402P242.83.35$1250
7302P163.03.3155W$825
7232P83.13.232MB120W$450

Due to the boosted capability of the new CPUs, a single socket configuration my be increasingly viable comparison to a dual socket Xeon platform for many workloads.

Next Steps: get started today!

Read More

If you’d like to read more speeds and feeds about these new processors, check out our article with detailed specifications of the 2nd Gen AMD EPYC “Rome” CPUs. We summarize and compare the specifications of each model, and provide guidance over and beyond what you’ve seen here.

Try 2nd Gen AMD EPYC CPUs for Yourself

Groups which prefer to verify performance before making a design are encouraged to sign up for a Test Drive, which will provide you with access to bare-metal hardware with AMD EPYC CPUs, large-memory, and more.

Browse Our Navion AMD EPYC Product Line

WhisperStation

Ultra-Quiet AMD EPYC workstations

Learn More

Servers

High performance AMD EPYC rackmount servers

Learn More

Clusters

Leadership performance clusters from 5-500 nodes

Learn More

The post 2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/amd-epyc-rome-cpu-review/feed/ 2
Improvements in scaling of Bowtie2 alignment software and implications for RNA-Seq pipelines https://www.microway.com/hpc-tech-tips/improvements-in-scaling-of-bowtie2-alignment-software-and-implications-for-rna-seq-pipelines/ https://www.microway.com/hpc-tech-tips/improvements-in-scaling-of-bowtie2-alignment-software-and-implications-for-rna-seq-pipelines/#respond Fri, 28 Jun 2019 17:48:05 +0000 https://www.microway.com/?p=11665  This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries What is Bowtie2? Bowtie2 is a commonly used, open-source, fast, and memory efficient application used as part of a Next Generation Sequencing (NGS) workflow.It aligns the sequencing reads, which are the genomic data output […]

The post Improvements in scaling of Bowtie2 alignment software and implications for RNA-Seq pipelines appeared first on Microway.

]]>
 
This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries

What is Bowtie2?

Bowtie2 is a commonly used, open-source, fast, and memory efficient application used as part of a Next Generation Sequencing (NGS) workflow.It aligns the sequencing reads, which are the genomic data output from an NGS device such as an Illumina HiSeq Sequencer, to a reference genome.Applications like Bowtie2 are used as the first step in pipelines such as those for variant determination, and an area of continuously growing research interest, RNA-Seq.

What is RNA-Seq?

RNA Sequencing (RNA-Seq) is a type of NGS that seeks to identify the presence and quantity of RNA in a sample at a given point in time.This can be used to quantify changes in gene expression, which can be a result of time, external stimuli, healthy or diseased states, and other factors.Through this quantification, researchers can obtain a unique snapshot of the genomic status of the organism to identify genomic information previously undetectable with other technologies.

There is considerable research effort being put into RNA-Seq, and the number of publications has grown steadily since its first use in 2009.

Plot of the number of RNA-Seq research publications accepted each year
Figure 1. RNA-Seq research publications published per year as of April 2019.Note the continuous growth.At the current rate, there will be 60% more publications in 2019 as compared to 2018. Source: NCBI PubMed

RNA-Seq is being applied to many research areas and diseases, and a few notable examples of using the technology include:

  • Oral Cancer: Researchers used an RNA-Seq approach to identify differences in gene expression between oral cancer and normal tissue samples.
  • Alzheimer’s Disease: Researchers compared the gene expression of different lobes of deceased Alzheimer’s Disease patients brain with the brain of healthy individuals.They were able to identify genomic differences between the diseased and unaffected individuals.
  • Diabetes: Researchers identified novel gene expression information from pancreatic beta-cells, which are cells critical for glycemic control.

Compute Infrastructure for aligning with Bowtie2

Designing a compute resource to meet the sequence analysis needs of Bioinformatics researchers can be a daunting task for IT staff.Limited information is available about multithreading and performance increases in the diverse portfolio of software related to NGS analysis.To further complicate things, processors are now available in a variety of models, with a large range of core counts and clock speeds, from both AMD and Intel. See, for example, the latest Intel Xeon “Cascade Lake” CPUs: Intel Xeon Scalable “Cascade Lake SP” Processor Review

Though many sequence analysis tools have multithreading options, the ability to scale is often limited, and rarely linear.In some cases, performance can decrease as more threads are added.Multithreading applications does not guarantee a performance improvement.

ThreadsRun Time (seconds)
8620
16340
32260
48385
64530

Table 1. Research data showing previous version of Bowtie2 scaling with thread count.Performance would decrease above 32 threads.

Plot of Bowtie2 run time as the number of threads increases
Figure 2. Plot of thread scaling of previous version of Bowtie2.Performance decreases after 32 threads due to a variety of factors.Non-linear scaling and performance decreases with core count have been shown in other scientific applications as well.

However, researchers recently greatly improved the thread scaling of Bowtie2.Original versions of this tool did not scale linearly, and demonstrated reduced performance when using more than 32 threads.Aware of these problems, the developers of Bowtie2 have implemented superior multithread scaling in their applications.Depending on processor type, their results show:

  • Removal of performance decreases over 32 threads
  • An increase in read throughput of up to 44%
  • Reduced memory usage with thread scaling
  • Up to a 4 hour reduction in time to align 40x coverage human genome

This new version of the software is open-source and available for download.

Right Sizing your NGS Cluster

With the recent release of Intel’s Cascade Lake-AP Xeons providing up to 112 threads per socket, as well as high density AMD EPYC processors, it can be tempting to assume that more cores will result in more performance for NGS applications.However, this is not always the case, and some applications will show reduced performance with higher thread count.

When selecting compute systems for NGS analysis, researchers and IT staff need to evaluate which software products will be used, and how they scale with threads.Depending on the use cases, more nodes with fewer, faster, threads could provide better performance than high thread density nodes.Unfortunately there is no “one size fits all” solution, and applications are in constant development, so research into the most recent versions of analysis software is always required.

References

[1] https://www.ncbi.nlm.nih.gov/pubmed/
[2] https://doi.org/10.1371/journal.pone.0016266
[3] https://doi.org/10.1101/205328
[4] https://link.springer.com/article/10.1007/s10586-017-1015-0


If you are interested in testing your NGS workloads on the latest Intel and AMD HPC systems, please consider our free HPC Test Drive. We provide bare-metal benchmarking access to HPC and Deep Learning systems.

The post Improvements in scaling of Bowtie2 alignment software and implications for RNA-Seq pipelines appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/improvements-in-scaling-of-bowtie2-alignment-software-and-implications-for-rna-seq-pipelines/feed/ 0
CryoEM takes center stage: how compute, storage, and networking needs are growing with CryoEM research https://www.microway.com/hpc-tech-tips/cryoem-takes-center-stage-how-compute-storage-networking-needs-growing/ https://www.microway.com/hpc-tech-tips/cryoem-takes-center-stage-how-compute-storage-networking-needs-growing/#respond Thu, 11 Apr 2019 14:05:48 +0000 https://www.microway.com/?p=11409  This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries Background and history Cryogenic Electron Microscopy (CryoEM) is a type of electron microscopy that images molecular samples embedded in a thin layer of non-crystalline ice, also called vitreous ice.Though CryoEM experiments have been performed […]

The post CryoEM takes center stage: how compute, storage, and networking needs are growing with CryoEM research appeared first on Microway.

]]>
 
This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries

Background and history

Cryogenic Electron Microscopy (CryoEM) is a type of electron microscopy that images molecular samples embedded in a thin layer of non-crystalline ice, also called vitreous ice.Though CryoEM experiments have been performed since the 1980s, the majority of molecular structures have been determined with two other techniques, X-ray crystallography and Nuclear Magnetic Resonance (NMR).The primary advantage of X-ray crystallography and NMR is that molecules were able to be determined at very high resolution, several fold better than historical CryoEM results.

However, recent advancements in CryoEM microscope detector technology and analysis software have greatly improved the capability of this technique.Before 2012, CryoEM structures could not achieve the resolution of X-ray Crystallography and NMR structures.The imaging and analysis improvements since that time now allow researchers to image structures of large molecules and complexes at high resolution.The primary advantages of Cryo-EM over X-ray Crystallography and NMR are:

  • Much larger structures can be determined than by X-ray or NMR
  • Structures can be determined in a more native state than by using X-ray

The ability to generate these high resolution large molecular structures through CryoEM enables better understanding of life science processes and improved opportunities for drug design. CryoEM has been considered so impactful, that the inventors won the 2017 Nobel Prize in chemistry.

CryoEM structure and publication growth

While the number of molecular structures determined by CryoEM is much lower than those determined by X-ray crystallography and NMR, the rate at which these structures are released has greatly increased in the past decade. In 2016, the number CryoEM structures deposited in the Protein Data Bank (PDB) exceeded those of NMR for the first time.

The importance of CryoEM in research is growing, as shown by the steady increase in publications over the past decade (Table 1, Figure 1).Interestingly, though there are ~10 times as many X-ray crystallography publications per year, the number of related publications per year has decreased consistently since 2013.

Experimental Structure TypeApproximate total number of structures as of March 2019
X-ray Crystallography134,000
NMR12,500
Cryo-EM3,000

Table 1. Total number of CryoEM structures available in the publicly accessible Protein Data Bank (PDB). Source: RCSB

Chart showing the growth of CryoEM structures available in the Protein Data Bank
Figure 1. Total number of structures available in the Protein Data Bank (PDB) by year.
Note the rapid and consistent growth. Source: RCSB
Plot of the number of CryoEM publications accepted each year
Figure 2. Number of CryoEm publications accepted each year.
Note the rapid increase in publications. Source: NCBI PubMed
Plot showing the declining number of X-ray crystallography publications per year
Figure 3. Number of X-ray crystallography publications per year. Note the steady decline in publications. While publications related to X-ray crystallography may be decreasing, opportunities exist for integrating both CryoEM and X-ray crystallography data to further our understanding of molecular structure. Source: NCBI PubMed

CryoEM is part of our research – do I need to add GPUs to my infrastructure?

A major challenge facing researchers and IT staff is how to appropriately build out infrastructure for CryoEM demands.There are several software products that are used for CryoEM analysis, with RELION being one of the most widely used open source packages.While GPUs can greatly accelerate RELION workflows, support for them has only existed since Version 2 (released in 2016).Worldwide, the vast majority of individual servers and centralized resources available to researchers are not GPU accelerated.Those systems that do have professional grade GPUs are often oversubscribed and can have considerable queue wait times.The relative high cost of server grade GPU systems can put those devices out of the reach of many individual research labs.

While advanced GPU hardware like the DGX-1 continue to give the best analysis times, not every GPU system provides the same throughput. Large datasets can create issues with consumer grade GPUs, in that the dataset must fit within the GPU memory to fully take advantage of the acceleration.Though RELION can parallelize the datasets, GPU memory is still limited when compared to the large amounts of system memory available to CPUs that can be installed in a single device (DGX-1 provides 256GB GPU memory; DGX-2 provides 512GB).This problem is amplified if the researcher has access to only a single consumer grade graphic card (e.g., an NVIDIA GeForce GTX 1080 Ti GPU with 11GB memory).

With the Version 3 release of the software (late 2018), RELION authors have implemented CPU acceleration to broaden the usable hardware for efficient CryoEM reconstruction.The authors have shown a 1.5x improvement on Broadwell processors and a 2.5x improvement on Skylake over the previous code.However, taking advantage of AVX instructions during compilation can further improve performance, with the authors demonstrating a 5.4x improvement on Skylake processors.This improvement is approaching the performance increases of professional grade GPUs without the additional cost.

Additional infrastructure considerations

CryoEM datasets are being generated at a higher rate and with larger data sizes than ever before.Currently, the largest raw dataset in the Electron Microscopy Public Image Archive (EMPIAR) is 12.4TB, with a median dataset size of approximately 2TB.Researchers and IT staff can expect datasets in this order of magnitude to become the norm as CryoEM continues to grow as an experimental resource in the life sciences space.

Many CryoEM labs function as microscopy cores, where they provide the service of generating the 2D datasets for different researchers, which are then analyzed by individual labs.Given the high cost of professional GPUs as compared to the ubiquitous availability of multicore CPU systems, researchers may consider modern multicore servers or using centralized clusters to meet their CryoEM analysis needs.This is with the caveat that they use Version 3 of RELION software with appropriate compilation flags.

Dataset transfer is also a concern, and organizations that have a centralized Cryo-EM core would greatly benefit from upgraded networking (10Gbps+) from the core location to centralized compute resources, or to individual labs.

Visualization of the structure of beta-galactosidase from the EMPIAR database
Figure 4. A 2.2 angstrom resolution CryoEM structure of beta-galactosidase. This is currently the largest dataset in the EMPIAR database, totaling 12.4 TB. Source: EMPIAR

CryoEM takes center stage

The increase in capabilities, interest, and research related to CryoEM shows it is now a mainstream experimental technique.IT staff and scientists alike are rapidly becoming aware of this fact as they face the data analysis, transfer, and storage challenges associated with this technique.Careful consideration must be given to the infrastructure of an organization that is engaging in CryoEM research.

In an organization that is performing exclusively CryoEM experiments, a GPU cluster would be the most cost-effective solution for rapid analysis.Researchers with access to advanced professional grade GPU systems, such as a DGX-1, will see analysis times that are even faster than modern CPU optimized RELION.While these professional GPUs can greatly accelerate CryoEM analysis, it is unlikely in the short term that all researchers wanting to use CryoEM data will have access to such high-spec GPU hardware, as compared to mixed-use commodity clusters, which are ubiquitous at all life science organizations.A large multicore CPU machine, when properly configured, can give better performance than a low core workstation or server with a single consumer grade GPU (e.g., an NVIDIA GeForce GPU).

IT departments and researchers must work together to define expected turnaround time, analysis workflow requirements, budget, and configuration of existing hardware.In doing so, researcher needs will be met and IT can implement the most effective architecture for CryoEM.

References

[1] https://doi.org/10.7554/eLife.42166.001
[2] https://febs.onlinelibrary.wiley.com/doi/10.1111/febs.12796
[3] https://www.ncbi.nlm.nih.gov/pubmed/
[4] https://www.ebi.ac.uk/pdbe/emdb/empiar/
[5] https://www.rcsb.org/


If you are interested in trying out RELION performance on some of the latest CPU and GPU-accelerated systems (including NVIDIA DGX-1), please consider our free HPC Test Drive. We provide bare-metal benchmarking access to HPC and Deep Learning systems.

The post CryoEM takes center stage: how compute, storage, and networking needs are growing with CryoEM research appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/cryoem-takes-center-stage-how-compute-storage-networking-needs-growing/feed/ 0
Intel Xeon Scalable “Cascade Lake SP” Processor Review https://www.microway.com/hpc-tech-tips/intel-xeon-scalable-cascade-lake-sp-processor-review/ https://www.microway.com/hpc-tech-tips/intel-xeon-scalable-cascade-lake-sp-processor-review/#comments Tue, 02 Apr 2019 17:00:45 +0000 https://www.microway.com/?p=11305 With the launch of the latest Intel Xeon Scalable processors (previously code-named “Cascade Lake SP”), a new standard is set for high performance computing hardware. These latest Xeon CPUs bring increased core counts, faster memory, and faster clock speeds. They are compatible with the existing workstation and server platforms that have been shipping since mid-2017. […]

The post Intel Xeon Scalable “Cascade Lake SP” Processor Review appeared first on Microway.

]]>
With the launch of the latest Intel Xeon Scalable processors (previously code-named “Cascade Lake SP”), a new standard is set for high performance computing hardware. These latest Xeon CPUs bring increased core counts, faster memory, and faster clock speeds. They are compatible with the existing workstation and server platforms that have been shipping since mid-2017. Starting today, Microway is shipping these new CPUs across our entire line of turn-key Xeon workstations, systems, and clusters.

Important changes in Intel Xeon Scalable “Cascade Lake SP” Processors include:

  • Higher CPU core counts for many SKUs in the product stack
  • Improved CPU clock speeds (with Turbo Boost up to 4.4GHz)
  • Introduction of the new AVX-512 VNNI instruction for Intel Deep Learning Boost (VNNI)
    provides significant, more efficient deep learning inference acceleration
  • Higher memory capacity & performance:
    • Most CPU models provide increased memory speeds
    • Support for DDR4 memory speeds up to 2933MHz
    • Large-memory capabilities with Intel Optane DC Persistent Memory
    • Support for up to 4.5TB-per-socket system memory
  • Integrated hardware-based security mitigations against side-channel attacks

More for Your Dollar: performance uplift

With an increase in core counts, clock speeds, and memory speeds, applications will achieve better performance across the board. Particularly in the lower-end Xeon 4200- and 5200-series CPUs, the cost-effectiveness of the processors has increased considerably. The plot below compares the price of each processor against its performance. Both the current “Cascade Lake CP” and previous-generation “Skylake-SP” are shown:

Comparison chart of Intel Xeon Cascade Lake SP cost-effectiveness vs Skylake-SP for applications with AVX-512 instructions
In the diagram above, the wide colored bars indicate the price performance of these new Xeon CPUs. The dots indicate the price performance of the previous generation, which allows us to compare the two generations SKU by SKU (though a few of the newer models do not have previous-generation counterparts). In this comparison, lower values are better and indicate a higher quantity of computation per dollar spent.

Same SKU – More Performance

As shown above, many models offer more performance than their previous-generation counterpart. Here we highlight models which are showing particularly substantial improvements:

  • Xeon 4210 is 34% more price-performant than Xeon 4110
  • Xeon 4214 is 30% more price-performant than Xeon 4114
  • Xeon 4216 is 25% more price-performant than Xeon 4116
  • Xeon 5218 is 40% more price-performant than Xeon 5118
  • Xeon 5220 is 34% more price-performant than Xeon 5120
  • Xeon 6242 saw an 8% increase in clock speed and ~10% reduction in price
  • Xeon 8270 is 28% more price-performant than Xeon 8170

To summarize: this latest generation will provide more performance for the same cost if you stick with the model numbers you’ve been using. In the next section, we’ll review opportunities for cost reduction.

More for Less: Select a more modest Cascade Lake SKU for the same core count or performance

With generational improvements, it’s not unusual for a new CPU to replace a higher-end version of the older generation. There are many cases where this is true in the Cascade Lake Xeon CPUs, so be sure to consider if you can leverage such savings.

Guaranteed savings

  • Xeon 4208 replaces the Xeon 4110: providing the same 8 cores for a lower price
  • Xeon 4210 replaces the Xeon 4114: providing the same 10 cores for a lower price
  • Xeon 4214 surpasses the Xeon 4116: providing the same 12 cores at higher clock speeds
  • Xeon 5218 surpasses the Xeon 5120: providing more cores, higher clock speeds, and faster memory speeds

Worthy of consideration

  • Xeon 4216 may replace most of the 5100-series: Xeon 5115, 5118 and 5120
    Nearly all specifications are equivalent, but the UPI speed of the Xeon 4216 is 9.6GT/s rather than 10.4GT/s
  • Xeon 6230 likely replaces the Xeon 6130, 6138, 6140: providing the same or more cores for a lower price
  • Xeon 6240 competes with every Xeon 6100-series model
    with the exception that it does not provide 3+GHz processor frequencies

Greater Memory Bandwidth

For computationally-intensive applications, rapid access to data is critical. Thus, memory speed increases are valuable improvements. This generation of CPUs brings a 10% improvement to the Xeon 5200-series (2666MHz; up from 2400MHz) and the Xeon 6200-/8200-series (2933MHz; up from 2666MHz). This means that the Xeon 5200-series CPUs are more competitive (they’re running memory at the same speed as last generation’s Xeon 6100- and 8100-series processors). And the higher-end Xeon 6200-/8200-series CPUs have a 10% memory performance advantage over all others.

While a 10% improvement may seem to be only a modest improvement, keep in mind that it’s essentially a free upgrade. Combined with the other features and improvements discussed above, you can be confident you’re making the right choice by upgrading to these newest Intel Xeon Scalable CPUs.

Enabling Very Large Memory Capacity

With the official launch of Intel Optane DC Persistent Memory, it is now possible to deploy systems with multiple terabytes of system memory. Well-equipped systems provide each Xeon CPU with six Optane memory modules (alongside six standard memory modules). This results in up to 3TB of Optane memory and 1.5TB of standard DRAM per CPU! Look for more information on these possibilities as HPC sites begin adopting and exploring this new technology.

Transitioning from the “Skylake-SP” Intel Xeon Scalable CPUs

Because the new “Cascade Lake SP” CPUs are socket-compatible with the previous-generation “Skylake SP” CPUs, the upgrade path is simple. All existing platforms that support the earlier CPUs can also accept these new CPUs. This also simplifies the choice for those considering a new system: the new CPUs use existing, proven platforms. There’s little risk in selecting the latest and highest-performance components. HPC sites adding to existing clusters will find they have a choice: spend the same for increased performance or spend less for the same performance. Below are peak performance comparisons of the previous generation CPUs with the new generation:

The wider/colored bars indicate peak performance for the new Xeon CPUs. The slim grey bars indicate peak performance for the previous-generation Xeon CPUs. Without exception, the new CPUs are expected to outperform their predecessors. The widest margins of improvement are in the lower-end Xeon 4200- and 5200-series.

Standout performance in a single socket

This generation introduces three CPU models designed for single-socket systems (providing very high throughput at relatively low-cost). They provide 20+ CPU cores at prices as much as $2,000 less than their multi-socket counterparts. If your workload performs well with a single CPU, these SKUs will be incredibly valuable:

  • Xeon 6209U outperforms nearly all of last generation’s Xeon Gold 6100-series CPUs
  • Xeon 6210U outperforms all Xeon 6100-series and many 6200-series CPUs
  • Xeon 6212U outperforms several of the Xeon 8100-series CPUs

The only exception to the above would be for applications which require very high clock speeds, as these single-socket CPU models do not provide base processor frequencies higher than 2.5GHz. The strength of these single-socket processors is in high throughput (via high core count) and decent clock speeds.

Next Steps: get started today!

Read More

If you’d like to read more about these new processors, check out our article with detailed specifications of the Intel Xeon “Cascade Lake SP” CPUs. We summarize and compare the specifications of each model, and provide guidance on which models are likely to be best suited to computationally-intensive HPC & Deep Learning applications.

Try Intel Xeon Scalable CPUs for Yourself

Groups which prefer to verify performance before making a design are encouraged to sign up for a Test Drive, which will provide you with access to bare-metal hardware with Intel Xeon Scalable CPUs, large-memory, and more.

Speak with an Expert

If you’re expecting to be upgrading or deploying new systems in the coming months, our experts would be happy to help you consider your options and design a custom cluster optimized to your workloads. We also help groups writing budget proposals to ensure they’re requesting the correct resources. Please get in touch!

The post Intel Xeon Scalable “Cascade Lake SP” Processor Review appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/intel-xeon-scalable-cascade-lake-sp-processor-review/feed/ 1
NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/ https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/#respond Fri, 15 Mar 2019 17:06:57 +0000 https://www.microway.com/?p=11118 Performance benchmarks are an insightful way to compare new products on the market. With so many GPUs available, it can be difficult to assess which are suitable to your needs. Various benchmarks provide information to compare performance on individual algorithms or operations. Since there are so many different algorithms to choose from, there is no […]

The post NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks appeared first on Microway.

]]>
Performance benchmarks are an insightful way to compare new products on the market. With so many GPUs available, it can be difficult to assess which are suitable to your needs. Various benchmarks provide information to compare performance on individual algorithms or operations. Since there are so many different algorithms to choose from, there is no shortage of benchmarking suites available.

For this comparison, the SHOC benchmark suite (https://github.com/vetter/shoc/) is used to compare the performance of the NVIDIA Tesla T4 with other GPUs commonly used for scientific computing: the NVIDIA Tesla P100 and Tesla V100.

The Scalable Heterogeneous Computing Benchmark Suite (SHOC) is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing, and the software used to program them. Its initial focus is on systems containing Graphics Processing Units (GPUs) and multi-core processors, and on the OpenCL programming standard. It can be used on clusters as well as individual hosts.

The SHOC benchmark suite includes options for many benchmarks relevant to a variety of scientific computations. Most of the benchmarks are provided in both single- and double-precision and with and without PCIE transfer consideration. This means that for each test there are up to four results for each benchmark. These benchmarks are organized into three levels and can be run individually or all together.

The Tesla P100 and V100 GPUs are well-established accelerators for HPC and AI workloads. They typically offer the highest performance, consume the most power (250~300W), and have the highest price tag (~$10k). The Tesla T4 is a new product based on the latest “Turing” architecture, delivering increased efficiency along with new features. However, it is not a replacement for the bigger/more power-hungry GPUs. Instead, it offers good performance while consuming far less power (70W) at a lower price (~$2.5k). You’ll want to use the right tool for the job, which will depend upon your workload(s). A summary of each Tesla GPU is shown below.

In our testing, both single- and double-precision SHOC benchmarks were run, which allows us to make a direct comparison of the capabilities of each GPU. A few HPC-relevant benchmarks were selected to compare the T4 to the P100 and V100. Tesla P100 is based on the “Pascal” architecture, which provides standard CUDA cores. Tesla V100 features the “Volta” architecture, which introduced deep-learning specific TensorCores to complement CUDA cores. Tesla T4 has NVIDIA’s “Turing” architecture, which includes TensorCores and CUDA cores (weighted towards single-precision). This product was designed primarily with machine learning in mind, which results in higher single-precision performance and relatively low double-precision performance. Below, some of the commonly-used HPC benchmarks are compared side-by-side for the three GPUs.

Double Precision Results

GPUTesla T4Tesla V100Tesla P100
Max Flops (GFLOPS)253.387072.864736.76
Fast Fourier Transform (GFLOPS)132.601148.75756.29
Matrix Multiplication (GFLOPS)249.575920.014256.08
Molecular Dynamics  (GFLOPS)105.26908.62402.96
S3D (GFLOPS)59.97227.85161.54

 

Single Precision Results

GPUTesla T4Tesla V100Tesla P100
Max Flops (GFLOPS)8073.2614016.509322.46
Fast Fourier Transform (GFLOPS)660.052301.321510.49
Matrix Multiplication (GFLOPS)3290.9413480.408793.33
Molecular Dynamics (GFLOPS)572.91997.61480.02
S3D (GFLOPS)99.42434.78295.20

 

What Do These Results Mean?

The single-precision results show Tesla T4 performing well for its size, though it falls short in double precision compared to the NVIDIA Tesla V100 and Tesla P100 GPUs. Applications that require double-precision accuracy are not suited to the Tesla T4. However, the single precision performance is impressive and bodes well for the performance of applications that are optimized for lower or mixed precision.

Plot comparing the performance of Tesla T4 with the Tesla P100 and Tesla V100 GPUs

To explain the single-precision benchmarks shown above:

  • The Max Flops for the T4 are good compared to V100 and competitive with P100. Tesla T4 provides more than half as many FLOPS as V100 and more than 80% of P100.
  • The T4 shows impressive performance in the Molecular Dynamics benchmark (an n-body pairwise computation using the Lennard-Jones potential). It again offers more than half the performance of Tesla V100, while beating the Tesla P100.
  • In the Fast Fourier Transform (FFT) and Matrix Multiplication benchmarks, the performance of Tesla T4 is on par for both price/performance and power/performance (one fourth the performance of V100 for one fourth the price and one fourth the wattage). This reflects how the T4 will perform in a large number of HPC applications.
  • For S3D, the T4 falls behind by a few additional percent.

Looking at these results, it’s important to remember the context. Tesla T4 consumes only ~25% the wattage of the larger Tesla GPUs and costs only ~25% as much. It is also a physically smaller GPU that can be installed in a wider variety of servers and compute nodes. In that context, the Tesla T4 holds its own as a powerful option for a reasonable price when compared to the larger NVIDIA Tesla GPUs.

What to Expect from the NVIDIA Tesla T4

Cost-Effective Machine Learning

The T4 has substantial single/mixed precision machine learning focused performance, with a price tag significantly lower than larger Tesla GPUs. What the T4 lacks in double precision, it makes up for with impressive single-precision results. The single-precision performance available will strongly cater to the machine learning algorithms with potential to be applied to mixed precision. Future work will examine this aspect more closely, but Tesla T4 is expected to be of high interest for deep learning inference and to have specific use-cases for deep learning training.

Impressive Single-Precision HPC Performance

In the molecular dynamics benchmark, the T4 outperforms the Tesla P100 GPU. This is extremely impressive, and for those interested in single- or mixed-precision calculations involving similar algorithms, the T4 could provide an excellent solution. With some adapting algorithms, the T4 may be a strong contender for scientific applications that also want to utilize machine learning capabilities to analyze results or run a variety of different types of algorithms from both machine learning and scientific computing on an easily accessible GPU.

In addition to the outright lower price tag, the T4 also operates at 70 Watts, in comparison to the 250+ Watts required for the Tesla P100 / V100 GPUs. Running on one quarter of the power means that it is both cheaper to purchase and cheaper to operate.

Next Steps for leveraging Tesla T4

If it appears the new Tesla T4 will accelerate your workload, but you’d like to benchmark, please sign up to Test Drive for yourself. We also invite you to contact one of our experts to discuss your needs further. Our goal is to understand your requirements, provide guidance on best options, and see the project through to successful system/cluster deployment.

Full SHOC Benchmark Results

The post NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/feed/ 0
NVIDIA Tesla V100 Price Analysis https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/#respond Wed, 09 May 2018 00:52:23 +0000 https://www.microway.com/?p=10150 Now that NVIDIA has launched their new Tesla V100 32GB GPUs, the next questions from many customers are “What is the Tesla V100 Price?” “How does it compare to Tesla P100?” “How about Tesla V100 16GB?” and “Which GPU should I buy?” Tesla V100 32GB GPUs are shipping in volume, and our full line of […]

The post NVIDIA Tesla V100 Price Analysis appeared first on Microway.

]]>
Now that NVIDIA has launched their new Tesla V100 32GB GPUs, the next questions from many customers are “What is the Tesla V100 Price?” “How does it compare to Tesla P100?” “How about Tesla V100 16GB?” and “Which GPU should I buy?”

Tesla V100 32GB GPUs are shipping in volume, and our full line of Tesla V100 GPU-accelerated systems are ready for the new GPUs. If you’re planning a new project, we’d be happy to help steer you towards the right choices.

Tesla V100 Price

The table below gives a quick breakdown of the Tesla V100 GPU price, performance and cost-effectiveness:

Tesla GPU modelPriceDouble-Precision Performance (FP64)Dollars per TFLOPSDeep Learning Performance (TensorFLOPS or 1/2 Precision)Dollars per DL TFLOPS
Tesla V100 PCI-E 16GB or 32GB$10,664* $11,458* for 32GB7 TFLOPS$1,523 $1,637 for 32GB112 TFLOPS$95.21 $102.30 for 32GB
Tesla P100 PCI-E 16GB$7,374*4.7 TFLOPS$1,56918.7 TFLOPS$394.33
Tesla V100 SXM 16GB or 32GB$10,664* $11,458* for 32GB7.8 TFLOPS$1,367 $1,469 for 32GB125 TFLOPS$85.31 $91.66 for 32GB
Tesla P100 SXM2 16GB$9,428*5.3 TFLOPS$1,77921.2 TFLOPS$444.72

* single-unit list price before any applicable discounts (ex: EDU, volume)

Key Points

  • Tesla V100 delivers a big advance in absolute performance, in just 12 months
  • Tesla V100 PCI-E maintains similar price/performance value to Tesla P100 for Double Precision Floating Point, but it has a higher entry price
  • Tesla V100 delivers dramatic absolute performance & dramatic price/performance gains for AI
  • Tesla P100 remains a reasonable price/performance GPU choice, in select situations
  • Tesla P100 will still dramatically outperform a CPU-only configuration

Tesla V100 Double Precision HPC: Pay More for the GPU, Get More Performance

VMD visualization of a nucleosome

You’ll notice that Tesla V100 delivers an almost 50% increase in double precision performance. This is crucial for many HPC codes. A variety of applications have been shown to mirror this performance boost. In addition, Tesla V100 now offers the option of 2X the memory of Tesla P100 16GB for memory bound workloads.

Tesla V100 can is a compelling choice for HPC workloads: it will almost always deliver the greatest absolute performance. However, in the right situation a Tesla P100 can still deliver reasonable price/performance as well.

Both Tesla P100 and V100 GPUs should be considered for GPU accelerated HPC clusters and servers. A Microway expert can help you evaluate what’s best for your needs and applications and/or provide you remote benchmarking resources.

Tesla V100 for Deep Learning: Enormous Advancement & Value- The New Standard


If your goal is maximum Deep Learning performance, Tesla V100 is an enormous on-paper leap in performance. The dedicated TensorCores have huge performance potential for deep learning applications. NVIDIA has even termed a new “TensorFLOP” to measure this gain. Tesla V100 delivers a 6X on-paper advancement.

If your budget allows you to purchase at least 1 Tesla V100, it’s the right GPU to invest in for deep learning performance. For the first time, the beefy Tesla V100 GPU is compelling for not just AI Training, but AI Inference as well (unlike Tesla P100).

Moreover, only a selection of Deep Learning frameworks are fully taking advantage of the TensorCore today. As more and more DL Frameworks are optimized to use these new TensorCores and their instructions, the gains will grow. Even before many major optimizations, many workloads have advanced 3X-4X.

Finally, there is no more SXM cost premium for Tesla V100 GPUs (and only a modest premium for SXM-enabled host-servers). Nearly all DL applications benefit greatly from the NVLink interface from GPU:GPU; a selection of HPC applications (ex: AMBER) do today.

If you’re running DL frameworks, select Tesla V100 and if possible the SXM-enabled GPUs and servers.

FLOPS vs Real Application Performance

Unless you firmly know your workload correlates, we strongly discourage anyone from making purchasing decisions strictly based upon raw $/FLOP calculations.

While the generalizations above are useful, application performance differs dramatically from any simplistic FLOPS calculation. Device/device bandwidth, host-device bandwidth, GPU memory bandwidth, code maturity, are all equal levers to FLOPS on realized application performance.

Here’s some of NVIDIA’s own application performance testing across some real applications


You’ll see that some codes scale similarly to the on-paper FLOPS gains, and others are frankly far more removed.

At the most, use such simplistic FLOPS and price/performance calculations to guide the following higher level decision-making: to help predict new hardware relative to prior testing of FLOPS vs. actual performance, to steer what GPUs to consider, to decide what to purchase for POCs, or as the way to identify appropriate GPUs to remotely test to validate actual application performance.

No one should buy based upon price/performance per FLOP; most should buy based upon price/performance per workload (or basket of workloads).

When Paper Performance + Intuition Collide with Reality

While the above guidelines are helpful, there are still a wide diversity of workloads out there in the field. Apart from testing that steers you to one GPU or another, here’s some good reasons we’ve seen or advised customers to use to make other selections:

Tesla V100 SXM 2.0 GPU
  • Your application has shown diminishing returns to advances in GPU performance in the past (Tesla P100 might be a price/performance choice)
  • Your budget doesn’t allow for even a single Tesla V100 (pick Tesla P100, still great speedups)
  • Your budget allows for a server with 2 Tesla P100s, but not 2 Tesla V100s (Pick 2 Tesla P100s vs 1 Tesla V100)
  • Your application is GPU memory capacity-bound (pick Tesla V100 32GB)
  • There are workload sharing considerations (ex: preferred scheduler only allocates whole GPUs)
  • Your application isn’t multi-GPU enabled (pick Tesla V100, the most powerful single GPU)
  • Your application is GPU memory bandwidth limited (test it, but potential case for Tesla P100)

Further Resources

You may wish to reference our Comparison of Tesla “Volta” GPUs, which summarizes the technical improvements made in these new GPUs or Tesla V100 GPU Review for more extended discussion.

If you’re looking to see how these GPUs will be deployed in production, read our NVIDIA GPU Clusters page. As always, please feel free to reach out to us if you’d like to get a better understanding of these latest HPC systems and what they can do for you.

The post NVIDIA Tesla V100 Price Analysis appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/feed/ 0