hpc-tech-tips Archives - Microway https://www.microway.com/tag/hpc-tech-tips/ We Speak HPC & AI Mon, 03 Jun 2024 16:42:53 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 DGX A100 review: Throughput and Hardware Summary https://www.microway.com/hardware/dgx-a100-review-throughput-and-hardware-summary/ https://www.microway.com/hardware/dgx-a100-review-throughput-and-hardware-summary/#respond Fri, 26 Jun 2020 20:17:42 +0000 https://www.microway.com/?p=12767 When NVIDIA launched the Ampere GPU architecture, they also launched their new flagship system for HPC and deep learning – the DGX 100. This system offers exceptional performance, but also new capabilities. We’ve seen immediate interest and have already shipped to some of the first adopters. Given our early access, we wanted to share a […]

The post DGX A100 review: Throughput and Hardware Summary appeared first on Microway.

]]>
When NVIDIA launched the Ampere GPU architecture, they also launched their new flagship system for HPC and deep learning – the DGX 100. This system offers exceptional performance, but also new capabilities. We’ve seen immediate interest and have already shipped to some of the first adopters. Given our early access, we wanted to share a deeper dive into this impressive new system.

Photo of NVIDIA DGX A100 packaged, being lifted out of packaging, and being tested

The focus of this NVIDIA DGX™ A100 review is on the hardware inside the system – the server features a number of features & improvements not available in any other type of server at the moment. DGX will be the “go-to” server for 2020. But hardware only tells part of the story, particularly for NVIDIA’s DGX products. NVIDIA employs more software engineers than hardware engineers, so be certain that application and GPU library performance will continue to improve through updates to the DGX Operating System and to the whole catalog of software containers provided through the NGC hub. Expect more details as the year continues.

Overall DGX A100 System Architecture

This new DGX system offers top-bin parts across-the-board. Here’s the high-level overview:

  • Dual 64-core AMD EPYC 7742 CPUs
  • 1TB DDR4 system memory (upgradeable to 2TB)
  • Eight NVIDIA A100 SXM4 GPUs with NVLink
  • NVIDIA NVSwitch connectivity between all GPUs
  • 15TB high-speed NVMe SSD Scratch Space (upgradeable to 30TB)
  • Eight Mellanox 200Gbps HDR InfiniBand/Ethernet Single-Port Adapters
  • One or Two Mellanox 200Gbps Ethernet Dual-Port Adapter(s)

As you’ll see from the block diagram, there is a lot to break down within such a complex system. Though it’s a very busy diagram, it becomes apparent that the design is balanced and well laid out. Breaking down the connectivity within DGX A100 we see:

  • The eight NVIDIA A100 GPUs are depicted at the bottom of the diagram, with each GPU fully linked to all other GPUs via six NVSwitches
  • Above the GPUs are four PCI-Express switches which act as nexuses between the GPUs and the rest of the system devices
  • Linking into the PCI-E switch nexuses, there are eight 200Gbps network adapters and eight high-speed SSD devices – one for each GPU
  • The devices are broken into pairs, with 2 GPUs, 2 network adapters, and 2 SSDs per PCI-E nexus
  • Each of the AMD EPYC CPUs connects to two of the PCI-E switch nexuses
  • At the top of the diagram, each EPYC CPU is shown with a link to system memory and a link to a 200Gbps network adapter

We’ll dig into each aspect of the system in turn, starting with the CPUs and making our way down to the new NVIDIA A100 GPUs. Readers should note that throughput and performance numbers are only useful when put into context. You are encouraged to run the same tests on your existing systems/servers to better understand how the performance of DGX A100 will compare to your existing resources. And as always, reach out to Microway’s DGX experts for additional discussion, review, and design of a holistic solution.

Index of our DGX A100 review:

AMD EPYC CPUs and System Memory

Diagram depicting the CPU cores, cache, and memory in the NVIDIA DGX A100
DGX A100 CPU/Memory topology (Click to expand)

With two 64-core EPYC CPUs and 1TB or 2TB of system memory, the DGX A100 boasts respectable performance even before the GPUs are considered. The architecture of the AMD EPYC “Rome” CPUs is outside the scope of this article, but offers an elegant design of its own. Each CPU provides 64 processor cores (supporting up to 128 threads), 256MB L3 cache, and eight channels of DDR4-3200 memory (which provides the highest memory throughput of any mainstream x86 CPU).

Most users need not dive further, but experts will note that each EPYC 7742 CPU has four NUMA nodes (for a total of eight nodes). This allows best performance for parallelized applications and can also reduce the impact of noisy neighbors. Pairs of GPUs are connected to NUMA nodes 1, 3, 5, and 7. Here’s a snapshot of CPU capabilities from the lscpu utility:

Architecture:        x86_64
CPU(s):              256
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
CPU MHz:             3332.691
CPU max MHz:         2250.0000
CPU min MHz:         1500.0000
NUMA node0 CPU(s):   0-15,128-143
NUMA node1 CPU(s):   16-31,144-159
NUMA node2 CPU(s):   32-47,160-175
NUMA node3 CPU(s):   48-63,176-191
NUMA node4 CPU(s):   64-79,192-207
NUMA node5 CPU(s):   80-95,208-223
NUMA node6 CPU(s):   96-111,224-239
NUMA node7 CPU(s):   112-127,240-255

High-speed NVMe Storage

Although DGX A100 is designed to support extremely high-speed connectivity to network/cluster storage, it also provides internal flash storage drives. Redundant 2TB NVMe SSDs are provided to host the Operating System. Four non-redundant striped NVMe SSDs provide a 14TB space for scratch storage (which is most frequently used to cache data coming from a centralized storage system).

Here’s how the filesystems look on a fresh DGX A100:

Filesystem      Size  Used Avail Use%    Mounted on
/dev/md0        1.8T   14G  1.7T   1%    /
/dev/md1         14T   25M   14T   1%    /raid

The industry is trending towards Linux software RAID rather than hardware controllers for NVMe SSDs (as such controllers present too many performance bottlenecks). Here’s what the above md0 and md1 arrays look like when healthy:

md0 : active raid1 nvme1n1p2[0] nvme2n1p2[1]
      1874716672 blocks super 1.2 [2/2] [UU]
      bitmap: 1/14 pages [4KB], 65536KB chunk

md1 : active raid0 nvme5n1[2] nvme3n1[1] nvme4n1[3] nvme0n1[0]
      15002423296 blocks super 1.2 512k chunks

It’s worth noting that although all the internal storage devices are high-performance, the scratch drives making up the /raid filesystem support the newer PCI-E generation 4.0 bus which doubles I/O throughput. NVIDIA leads the pack here, as they’re the first we’ve seen to be shipping these new super-fast SSDs.

High-Throughput and Low-Latency Communications with Mellanox 200Gbps

Photo of the internal system sled of DGX A100 with CPUs, Memory, and HCAs
Sled from DGX A100 showing ten 200Gbps adapters

Depending upon the deployment, nine or ten Mellanox 200Gbps adapters are present in each DGX A100. These adapters support Mellanox VPI, which enables each port to be configured for 200G Ethernet or HDR InfiniBand. Though Ethernet is particularly prevalent in particular sectors (healthcare and other industry verticals), InfiniBand tends to be the mode of choice when highest performance is required.

In practice, a common configuration is for the GPU-adjacent adapters be connected to an InfiniBand fabric (which allows for high-performance RDMA GPU-Direct and Magnum IO communications). The adapter(s) attached to the CPUs are then used for Ethernet connectivity (often meeting the speed of the existing facility Ethernet, which might be any one of 10GbE, 25GbE, 40GbE, 50GbE, 100GbE, or 200GbE).

Leveraging the fast PCI-E 4.0 bus available in DGX A100, each 200Gbps port is able to push up to 24.6GB/s of throughput (with latencies typically ranging from 1.09 to 202 microseconds as measured by OSU’s osu_bw and osu_latency benchmarks). Thus, a properly tuned application running across a cluster of DGX systems could push upwards of 200 gigabytes per second to the fabric!

GPU-to-GPU Transfers with NVSwitch and NVLink

NVIDIA built a new generation of NVIDIA NVLink into the NVIDIA A100 GPUs, which provides double the throughput of NVLink in the previous “Volta” generation. Each NVIDIA A100 GPU supports up to 300GB/s throughput (600GB/s bidirectional). Combined with NVSwitch, which connects each GPU to all other GPUs, the DGX A100 provides full connectivity between all eight GPUs.

Running NVIDIA’s p2pBandwidthLatencyTest utility, we can examine the transfer speeds between each set of GPUs:

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 1180.14 254.47 258.80 254.13 257.67 247.62 257.21 251.53
     1 255.35 1173.05 261.04 243.97 257.09 247.20 258.64 257.51
     2 253.79 260.46 1155.70 241.66 260.23 245.54 259.49 255.91
     3 256.19 261.29 253.87 1142.18 257.59 248.81 250.10 259.44
     4 252.35 260.44 256.82 249.11 1169.54 252.46 257.75 255.62
     5 256.82 257.64 256.37 249.76 255.33 1142.18 259.72 259.95
     6 261.78 260.25 261.81 249.77 258.47 248.63 1173.05 255.47
     7 259.47 261.96 253.61 251.00 259.67 252.21 254.58 1169.54

The above values show GPU-to-GPU transfer throughput ranging from 247GB/s to 262GB/s. Running the same test in bidirectional mode shows results between 473GB/s and 508GB/s. Execution within the same GPU (running down the diagonal) shows data rates around 1,150GB/s.

Turning to latencies, we see fairly uniform communication times between GPUs at ~3 microseconds:

P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3      4      5      6      7
     0   2.63   2.98   2.99   2.96   3.01   2.96   2.96   3.00
     1   3.02   2.59   2.96   3.00   3.03   2.96   2.96   3.03
     2   3.02   2.95   2.51   2.97   3.03   3.04   3.02   2.96
     3   3.05   3.01   2.99   2.49   2.99   2.98   3.06   2.97
     4   2.88   2.88   2.95   2.87   2.39   2.87   2.90   2.88
     5   2.87   2.95   2.89   2.87   2.94   2.49   2.87   2.87
     6   2.89   2.86   2.86   2.88   2.93   2.93   2.53   2.88
     7   2.90   2.90   2.94   2.89   2.87   2.87   2.87   2.54

   CPU     0      1      2      3      4      5      6      7
     0   4.54   3.86   3.94   4.10   3.92   3.93   4.07   3.92
     1   3.99   4.52   4.00   3.96   3.98   4.05   3.92   3.93
     2   4.09   3.99   4.65   4.01   4.00   4.01   4.00   3.97
     3   4.10   4.01   4.03   4.59   4.02   4.03   4.04   3.95
     4   3.89   3.91   3.83   3.88   4.29   3.77   3.76   3.77
     5   4.20   3.87   3.83   3.83   3.89   4.31   3.89   3.84
     6   3.76   3.72   3.77   3.71   3.78   3.77   4.19   3.77
     7   3.86   3.79   3.78   3.78   3.79   3.83   3.81   4.27

As with the bandwidths, the values down the diagonal show execution within that particular GPU. Latencies are lower when executing within a single GPU as there’s no need to hop across the bus to NVSwitch or another GPU. These values show that the same-device latencies are 0.3~0.5 microseconds faster than when communicating with a different GPU via NVSwitch.

Finally, we want to share the full DGX A100 topology as reported by the nvidia-smi topo --matrix utility. While a lot to digest, the main takeaways from this connectivity matrix are the following:

  • all GPUs have full NVLink connectivity (12 links each)
  • each pair of GPUs is connected to a pair of Mellanox adapters via a PXB PCI-E switch
  • each pair of GPUs is closest to a particular set of CPU cores (CPU and NUMA affinity)
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	mlx5_0	mlx5_1	mlx5_2	mlx5_3	mlx5_4	mlx5_5	mlx5_6	mlx5_7	mlx5_8	mlx5_9	CPU Affinity	NUMA Affinity
GPU0	 X 	NV12	NV12	NV12	NV12	NV12	NV12	NV12	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3
GPU1	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NV12	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3
GPU2	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NV12	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	16-31,144-159	1
GPU3	NV12	NV12	NV12	 X 	NV12	NV12	NV12	NV12	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	16-31,144-159	1
GPU4	NV12	NV12	NV12	NV12	 X 	NV12	NV12	NV12	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	112-127,240-255	7
GPU5	NV12	NV12	NV12	NV12	NV12	 X 	NV12	NV12	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	112-127,240-255	7
GPU6	NV12	NV12	NV12	NV12	NV12	NV12	 X 	NV12	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	80-95,208-223	5
GPU7	NV12	NV12	NV12	NV12	NV12	NV12	NV12	 X 	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	80-95,208-223	5

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Host-to-Device Transfer Speeds with PCI-Express generation 4.0

Just as it’s important for the GPUs to be able to communicate with each other, the CPUs must be able to communicate with the GPUs. A100 is the first NVIDIA GPU to support the new PCI-E gen4 bus speed, which doubles the transfer speeds of generation 3. True to expectations, NVIDIA bandwidthTest demonstrates 2X speedups on transfer speeds from the system to each GPU and from each GPU to the system:

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			24.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			26.1

As you might notice, these performance values are right in line with the throughput of each Mellanox 200Gbps adapter. Having eight network adapters with the exact same bandwidth as each of the eight GPUs allows for perfect balance. Data can stream into each GPU from the fabric at line rate (and vice versa).

Diving into the NVIDIA A100 SXM4 GPUs

The DGX A100 is unique in leveraging NVSwitch to provide the full 300GB/s NVLink bandwidth (600GB/s bidirectional) between all GPUs in the system. Although it’s possible to examine a single GPU within this platform, it’s important to keep in mind the context that the GPUs are tightly connected to each other (as well as their linkage to the EPYC CPUs and the Mellanox adapters). The single-GPU information we share below will likely match that shown for A100 SXM4 GPUs in other non-DGX systems. However, their overall performance will depend on the complete system architecture.

To start, here is the ‘brief’ dump of GPU information as provided by nvidia-smi on DGX A100:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   31C    P0    60W / 400W |      0MiB / 40537MiB |      7%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:0F:00.0 Off |                    0 |
| N/A   30C    P0    63W / 400W |      0MiB / 40537MiB |     14%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:47:00.0 Off |                    0 |
| N/A   30C    P0    62W / 400W |      0MiB / 40537MiB |     24%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:4E:00.0 Off |                    0 |
| N/A   29C    P0    58W / 400W |      0MiB / 40537MiB |     23%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM4-40GB      On   | 00000000:87:00.0 Off |                    0 |
| N/A   34C    P0    62W / 400W |      0MiB / 40537MiB |     23%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM4-40GB      On   | 00000000:90:00.0 Off |                    0 |
| N/A   33C    P0    60W / 400W |      0MiB / 40537MiB |     23%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-SXM4-40GB      On   | 00000000:B7:00.0 Off |                    0 |
| N/A   34C    P0    65W / 400W |      0MiB / 40537MiB |     22%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-SXM4-40GB      On   | 00000000:BD:00.0 Off |                    0 |
| N/A   33C    P0    63W / 400W |      0MiB / 40537MiB |     21%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

The clock speed and power consumption of each GPU will vary depending upon the workload (running low when idle to conserve energy and running as high as possible when executing applications). The idle, default, and max boost speeds are shown below. You will note that memory speeds are fixed at 1215 MHz.

    Clocks
        Graphics                          : 420 MHz (GPU is idle)
        Memory                            : 1215 MHz
    Default Applications Clocks
        Graphics                          : 1095 MHz
        Memory                            : 1215 MHz
    Max Clocks
        Graphics                          : 1410 MHz
        Memory                            : 1215 MHz

Those who have particularly stringent efficiency or power requirements will note that the NVIDIA A100 SXM4 GPU supports 81 different clock speeds between 210 MHz and 1410MHz. Power caps can be set to keep each GPU within preset limits between 100 Watts and 400 Watts. Microway’s post on nvidia-smi for GPU control offers more details for those who need such capabilities.

Each new generation of NVIDIA GPUs introduces new architecture capabilities and adjustments to existing features (such as resizing cache). Some details can be found through the deviceQuery utility, reports the CUDA capabilities of each NVIDIA A100 GPU device:

  CUDA Driver Version / Runtime Version          11.0 / 11.0
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 40537 MBytes (42506321920 bytes)
  (108) Multiprocessors, ( 64) CUDA Cores/MP:     6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1215 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes

In the NVIDIA A100 GPU, NVIDIA increased cache & global memory size, introduced new instruction types, enabled new asynchronous data copy capabilities, and more. More complete information is available in our Knowledge Center article which summarizes the features of the Ampere GPU architecture. However, it could be argued that the biggest architecture change is the introduction of MIG.

Multi-Instance GPU (MIG)

For years, virtualization has allowed CPUs to be virtually broken into chunks and shared between a wide group of users and/or applications. One physical CPU device might be simultaneously running jobs for a dozen different users. The flexibility and security offered by virtualization has spawned billion dollar businesses and whole new industries.

NVIDIA GPUs have supported multiple users and virtualization for a couple of generations, but NVIDIA A100 GPUs with MIG are the first to support physical separation of those tasks. In essence, one GPU can now be sliced into up to seven distinct hardware instances. Each instance then runs its own completely independent applications with no interruption or “noise” from other applications running on the GPU:

Diagram of NVIDIA Multi-Instance GPU demonstrating seven separate user instances on one GPU
NVIDIA Multi-Instance GPU supports seven separate user instances on one GPU

The MIG capabilities are significant enough that we won’t attempt to address them all here. Instead, we’ll highlight the most important aspects of MIG. Readers needing complete implementation details are encouraged to reference NVIDIA’s MIG documentation.

Each GPU can have MIG enabled or disabled (which means a DGX A100 system might have some shared GPUs and some dedicated GPUs). Enabling MIG on a GPU has the following effects:

  • One NVIDIA A100 GPU may be split into anywhere between 2 and 7 GPU Instances
  • Each of the GPU Instances receives a dedicated set of hardware units: GPU compute resources (including streaming multiprocessors/SMs, and GPU engines such as copy engines or NVDEC video decoders), and isolated paths through the entire memory system (L2 cache, memory controllers, and DRAM address busses, etc)
  • Each of the GPU Instances can be further divided into Compute Instances, if desired. Each Compute Instance is provided a set of dedicated compute resources (SMs), but all the Compute Instances within the GPU Instance share the memory and GPU engines (such as the video decoders)
  • A unique CUDA_VISIBLE_DEVICES identifier will be created for each Compute Instance and the corresponding parent GPU Instance. The identifier follows this convention:
    MIG-<GPU-UUID>/<GPU instance ID>/<compute instance ID>
  • Graphics API support (e.g. OpenGL etc.) is disabled
  • GPU to GPU P2P (either PCI-Express or NVLink) is disabled
  • CUDA IPC across GPU instances is not supported (though IPC across the Compute Instances within one GPU Instance is supported)

Though the above caveats are important to note, they are not expected to be significant pain points in practice. Applications which require NVLink will be workloads that require significant performance and should not be run on a shared GPU. Applications which need to virtualize GPUs for graphical applications are likely to use a different type of NVIDIA GPU.

Also note that the caveats don’t extend all the way through the CUDA capabilities and software stack. The following features are supported when MIG is enabled:

  • MIG is transparent to CUDA and existing CUDA programs can run under MIG unchanged
  • CUDA MPS is supported on top of MIG
  • GPUDirect RDMA is supported when used from GPU Instances
  • CUDA debugging (e.g. using cuda-gdb) and memory/race checking (e.g. using cuda-memcheck or compute-sanitizer) is supported

When MIG is fully-enabled on the DGX A100 system, up to 56 separate GPU Instances can be executed simultaneously. That could be 56 unique workloads, 56 separate users each running a Jupyter notebook, or some other combination of users and applications. And if some of the users/workloads have more demanding needs than others, MIG can be reconfigured to issue larger slices of the GPU to those particular applications.

DGX A100 Review Summary

DGX-POD with DGX A100

As mentioned at the top, this new hardware is quite impressive, but is only one part of the DGX story. NVIDIA has multiple software stacks to suit the broad range of possible uses for this system. If you’re just getting started, there’s a lot left to learn. Depending upon what you need next, I’d suggest a few different directions:

The post DGX A100 review: Throughput and Hardware Summary appeared first on Microway.

]]>
https://www.microway.com/hardware/dgx-a100-review-throughput-and-hardware-summary/feed/ 0
Multi-GPU Scaling of MLPerf Benchmarks on NVIDIA DGX-1 https://www.microway.com/hpc-tech-tips/multi-gpu-scaling-of-mlperf-benchmarks-on-nvidia-dgx-1/ https://www.microway.com/hpc-tech-tips/multi-gpu-scaling-of-mlperf-benchmarks-on-nvidia-dgx-1/#respond Fri, 23 Aug 2019 15:39:17 +0000 https://www.microway.com/?p=11628 In this post, we discuss how the training of deep neural networks scales on DGX-1. Considering 6 models across 4 out of 5 popular domains covered in the MLPerf v0.5 benchmarking suite, we discuss the time to state-of-the-art accuracy as set by MLPerf.  We also highlight the models that scale well and should be trained […]

The post Multi-GPU Scaling of MLPerf Benchmarks on NVIDIA DGX-1 appeared first on Microway.

]]>
In this post, we discuss how the training of deep neural networks scales on DGX-1. Considering 6 models across 4 out of 5 popular domains covered in the MLPerf v0.5 benchmarking suite, we discuss the time to state-of-the-art accuracy as set by MLPerf.  We also highlight the models that scale well and should be trained on larger numbers of GPUs. Models with poor scalability should be trained on fewer GPUs, which allows for resource sharing among multiple users. As such, we provide insight into common deep learning workloads and how to best leverage the multi-gpu DGX-1 deep learning system for training the models.

MLPerf – a benchmarking suite for deep learning applications

Just as HPC system design is evolving to achieve good performance for Deep Learning applications, there is also an ever-increasing need to have a good set of benchmarks to quantify this performance. Many benchmarking tools have been proposed. For example, Baidu Research released DeepBench which focuses on basic operations involved in neural networks like convolution, GEMM, Recurrent Layers, and All Reduce. Yet there is no provision to compare different systems/workstations or even software frameworks. Tensorflow introduced TF_CNN_BENCH which is only single-domain and benchmarks only convolutional network-based deep-learning workloads. With a diversity of workloads and a variety of different hardware configurations, we need a more general approach to benchmarking deep learning applications.

With support from both industry, universities, and inspired by SPEC and TPC standards, MLPerf is a leading choice as a set of benchmarks covering different areas of Machine Learning. The goals here are multi-fold which includes a fair comparison of different hardware configurations and software frameworks, while encouraging innovation and also easy reproducibility of results.

MLPerf suite includes Image Classification, Object Detection (light and heavy), Language Translation (Recurrent and Non-Recurrent), Recommendation Systems, and Reinforcement Learning benchmarks. The suite is divided into two divisions: Closed and Open. In the Closed division the data preprocessing, training method, and model must be the same as the MLPerf reference implementation. Only very limited changes to hyperparameters are allowed. This aims for fair comparison of different deep learning hardware platforms. In the Open division any model, preprocessing, or training method can be used.

Version v0.5 received no submissions to the Open division. However, Google, NVIDIA, and Intel made submissions to the Closed division. Only Google (on cloud instance) and NVIDIA submitted GPU-accelerated results. No GPU submissions were made for the reinforcement learning benchmark, but Intel did submit a CPU-only result on Skylake processors. Software frameworks varied from Tensorflow v1.12, to MXNet for image classification, and PyTorch for the rest of the domains.

The results discussed in this post largely replicate NVIDIA’s submission in the Closed Model Division of MLPerf v0.5.0 for training. This division places restrictions on modifying hyperparameters like learning rate and batch size to provide a fair comparison of hardware/software systems. However, minor changes were required to successfully train on small numbers of GPUs. All our changes are reflected in the below log files for interested folks who want to dive deeper. We performed scaling analysis on 1, 4, and 8 GPUs on DGX-1. Our findings help deep learning practitioners and researchers determine the best options for their deep learning problem/application(s).

Training Deep Neural Networks

Training deep neural networks can be a formidable task. With millions of parameters, the model risks overfitting the training data. The deep layers in the model can have extreme gradients that lead to vanishing/exploding gradient problems. Even after accounting for all these pitfalls, the training of a network can be really slow. As a non-convex optimization problem, there can be multiple solutions and training neural networks boils down to finding a right selection of hyperparameters in order to achieve a certain threshold of accuracy. This can be done by manually tuning parameters, observing a low generalization error, and reiterating with a different combination of values until reaching the desired accuracy. When there are only a few hyperparameters, a grid search can be applied, which is more computationally intensive. A range of discrete values for each parameter is selected and the model is trained on every combination of parameters as described by the Cartesian product (grid) of the values chosen.

The following is a brief description of each model being used in the MLPerf benchmarks:

  1. Convolutional Neural Networks (CNN):  Most widely used for image processing and pattern recognition applications like object detection/localization, human pose estimation, scene recognition; also for certain non-image workflows (e.g., processing acoustic, seismic, radio, or radar signals). In general, any data that has a grid-like topology can be processed using CNNs. Typical CNNs consist of convolutional layers, pooling layers, and fully connected layers. The convolution operation involves convolving a filter on the image, which extracts features in a local region of the image. In any image the pixels at large distances are randomly related, as opposed to smaller distances where they are correlated. The size of the filter, stride, and padding are some of the hyperparameters that need proper tuning. Pooling layers are used to reduce the number of parameters in the network, in turn reducing the number of computations. Fully connected layers help in classifying images based on the features extracted by the convolution layers. The MLPerf benchmarks Image Classification, Single Stage Detector, and Object Detection make use of a special type of CNN called ResNet. Introduced by Microsoft, ResNet [1] won the ILSVRC 2015 challenge and continues to lead. ResNets consist of residual blocks which ease the process of training extremely deep networks. A residual connection is a shortcut from one layer to another usually after skipping a few layers, basically copying the output from one layer and adding it to another layer just before applying non-linearity. MLPerf benchmarks Image Classification and Object Detection use ResNet-50 (50 layers) while the Single-Stage detector uses ResNet-34 (34 layers) as the backbone.
  2. Recurrent Neural Network (RNN): RNNs are interesting neural networks that offer a lot of flexibility in designing the model. It lets you operate with sequenced data at input, output, or both.  For example, in image captioning with a fixed-size image input, where the RNN model generates a sequence of words describing the contents of the image. In the case of sentiment analysis, the input is a sequence of words and the output is the sentiment of the sentence: whether it is good (positive) or bad (negative).  The MLPerf RNN benchmark uses the sequenced input and sequenced output model, similar to Google’s Neural Machine Translation (GNMT). GNMT has 3 components: an encoder, a decoder, and an attention network. The encoder modifies the input sequence into a list of vectors and the decoder decodes the vector into another sequence of words as an output. The encoder and decoder are connected via an attention network that allows for giving attention to different parts of the input sentence/sequence while decoding. For a more detailed description of the model, read the GNMT [2] paper.
  3. Transformers : A Transformer is a new type of sequence-to-sequence architecture for machine translation that uses both an encoder and a decoder, but does not use Recurrent layers like LSTMs or GRUs. Transformers are a new advancement in NLP which perform better than RNNs. A typical Transformer model would have an encoder and a decoder, with both containing modules like ‘Multi-Head Attention’ and ‘Feed Forward layers’. Since there is no RNN, there is no way of knowing the order of the words fed to the network. Therefore, we need part of the model to have a positional encoding of the words in the sequence. The source language sequence is fed to the encoder and the corresponding target language sequence is fed into the decoder, but shifted by a position. The model tries to predict the next word in the target sequence while having seen only the words prior to that position, and avoids simply copying the decoder sequence as the output. For more detailed model description, read the Attention is all you need [3] paper.
  4. Neural Collaborative Filtering (NCF) : Many online services (e.g., e-commerce, social networking) provide their customers with millions of options to choose from. With digital transformation resulting in huge amounts of data overload, it’s almost impossible to browse through an entire online collection. Recommender systems are needed to filter these options and help users make selections. Collaborative Filtering models the past interactions between the user and the collection. This essentially boils down to a Matrix Factorization problem where the user and collection are projected onto a latent space and the similarity (using the inner product) between the latent vectors is computed. The predictions are based on similarities. However, ‘Inner Product‘ is not a good choice of function to model complex interactions and an alternate approach of using a neural architecture to learn the arbitrary function from the data was devised. This approach is known as Neural Collaborative Filtering (NCF) [4]. Both the user and collection are represented as one-hot encoded in the input layer (sparse). A fully-connected (Embedding) layer projects this sparse representation to a dense vector. The output of the embedding layer is then fed into the Neural CF layers where each layer can learn certain structure among the interactions.

MLPerf Scaling on NVIDIA DGX-1

The MLPerf results submitted by NVIDIA make use of single-node and multi-node DGX-1 and DGX-2 systems, utilizing the entirety of the systems to train a single network. Our post discusses how performance scales when using a single DGX-1 (using 1, 4, or all 8 NVIDIA Tesla GPUs). This is important to understand how a single DGX-1 system can be used as a shared resource among multiple users, or to be used to run multiple cases of the same problem. It also helps establish which deep learning domains require the training to be done on a large scale.

Image Classification

Trained on the ILSVRC2012 dataset with 1.2 million images, this benchmark scales well. It achieves better than linear speedups going from 1 to 4 (~5x) and 1 to 8 GPUs (~10x). DGX users will achieve better throughput if they use the full system for each job.

Figure 1. Evaluation accuracy vs Epochs for Image Classification.

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
151216.2183fp-16
416644.5663fp-16
816642.200263fp-16

Table 1. Synopsis of Image Classification benchmarks

Figure 1. shows the validation accuracy versus the number of epochs it took to reach that accuracy. The accuracy set by MLPerf for this benchmark is 74.9%. The 4- and 8-GPU plots achieve this accuracy in the same number of epochs, however, the average time for each epoch are different as reported in the Table 1. For a single-GPU run, the batch size needed to be reduced in order to avoid “Out of Memory (OOM)” errors. With less data being processed per epoch on a single GPU compared to 4 and 8 GPUs, it took more epochs to train the model to the same accuracy.

Object Detection – Heavy

This is the heaviest workload among all the benchmarks considered in MLPerf. Utilizing the full DGX-1, it takes ~325 minutes to train on the COCO2014 dataset. The model used is the same ResNet-50 as the Image-Classification benchmark. The speedup obtained is ~2.5x going from 1 to 4 GPUs and ~6x when going from 1 to 8 GPUs (which is sub-linear).

Figure 2. Mask mAP and Bounding Box mAP vs Epochs for heavy Object Detection.

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
12 179.18311fp-16
4444.31518fp-16
8424.989513fp-16

Table 2. Synopsis of Object Detection (heavy) benchmarks

Figure 2a and 2b (click on the tabs to toggle between figures) shows the accuracy plots for the heavy object detection benchmark. There are two different accuracy thresholds here: BBOX (Fig. 2b) which stands for Bounding Box accuracy and SEGM (Fig. 2a) which stands for Instance Segmentation. Simply put, an object detection problem requires that the object be correctly located within the image and also that the object be correctly identified/categorized. Instance segmentation refers to instance of each pixel associated with an object in the image.

Object Detection – Light

The light weight object detection benchmark makes use of the COCO2017 dataset and scales with close to linear speedups: about ~3.7x going from 1 to 4 GPUs, and ~7.3x going from 1 to 8 GPUs. Total runtime varies from more than 3 hours on a single GPU to less than half an hour on 8 GPUs.

Figure 3. Accuracy vs Epoch for Single Stage Detector

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
11524.08049fp-16
41521.11549fp-16
81520.562849fp-16

Table 3. Synopsis of SSD benchmark.

Figure 3. shows the accuracy plots for the single stage detector benchmark. The evaluation of the model occurs only at epoch 32, 43, and 48 – hence the 3 data points in the plot. This, of course, can be modified to evaluating more often to have more data points for the plot. However, we stuck to the default values.

Language Translation – Recurrent (GNMT) and Non-Recurrent (Transformer)

The Recurrent model is trained on the WMT16 English-German dataset and the Transformer model is trained on the WMT17 EN-DE dataset. Both language translation models scale well, however transformer not only scales better but also achieves higher accuracy and averaging more in total training time.

Figure 4. BLEU score vs Epochs for Google’s NMT and Transformer Translation models.

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
151239.985fp-16
451212.315fp-16
810246.43fp-16

Table 4. Synopsis of RNN benchmark for Language Translation (GNMT)

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
1512060.348fp-16
4512022.384fp-16
851207.6484fp-16

Table 5. Synopsis of Non-Recurrent benchmark for Language Translation (Transformer)

Figure 4a and 4b (click on the tabs to toggle between images) shows the validation accuracy plots vs epochs for the language translation models. Google’s NMT uses a Recurrent Neural Network based model and achieves an accuracy of 21.80 BLEU. The Transformer model is a new advancement in the models used in language translation which does not use Recurrent Neural network and performs better achieving a higher quality target of 25.00 BLEU.

Table 4. and 5. shows the synopsis for these benchmarks. The length of the sequence is a key parameter for a Recurrent model and does affect the scaling.

Recommendation Systems

This is the quickest benchmark to run. Even on a single GPU, it only takes a little over a minute to train to the desired accuracy of 0.635. The speedups are ~1.8x and ~2.8x when going from 1 to 4 and 1 to 8 GPUs, respectively.

Figure 5. Evaluation accuracy vs Epochs of Neural Collaborative Filtering model for Recommendation Systems .

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
110485760.13538461513fp-16
410485760.07692307713fp-16
810485760.04846153813fp-16

Table 6. Synopsis of Recommendation Systems benchmark

Figure 5. shows the accuracy plots for the recommendation benchmark.All the plots in the figure are quite close to each other. This suggests that it’s not a cost effective strategy to use multiple GPUs for this type of workload. The benefit of using a machine like DGX-1 for such workloads is to run multiple cases, each on a single GPU. Dedicating an entire DGX-1 to a single training will reduce the training time, but is not as efficient if overall throughput is the goal.

MLPerf Scaling Results

This section summarizes the scaling results and discusses the speedups. Figure 6 (click to enlarge) shows the scaling analysis of six MLPerf benchmarks on 1, 4, and 8 GPUs on an NVIDIA DGX-1 (with Tesla V100 32GB GPUs). A general conclusion to draw from the Figure is that “all the models do not scale the same way”. Most of the models scale well. The better a model scales, the more efficiently you can train networks on large resources (an entire DGX or a cluster of DGX).

Figure 6. Scaling plots on 1-4-8 GPUs for MLPerf v0.5 Closed Model Division Benchmarks submitted by NVIDIA. The X-axis shows the number of GPUs and the Y-axis shows the training time to desired accuracy in minutes (the metric set by MLPerf). The inset axis shows a zoomed in view of the plot.

We see substantial speedups for Image Classification and Transformer Translation benchmarks (both are super-linear, running more quickly the more GPUs are added). Single-Stage Detector and Mask-RCNN Object Detection benchmark remain close to linear, while the RNN benchmark goes from linear speedup on 4 GPUs to super-linear speedup on 8 GPUs (which indicates that all of the above will scale efficiently). The Recommendation benchmark scales poorly, with fairly insignificant time savings when run on many GPUs. Table 7 lists the speedups for all benchmarks, including a calculated speed-up as the ratio of total training time on a single GPU to the total training time on multiple GPUs.

For a more detailed understanding of hyperparameters used to train these models, please reference the log files below [10].

BenchmarkSpeed Up (1-4 GPU)Speed Up  (1-8 GPU)
Image Classification4.769.70
Single Stage Detector3.667.25
Object Detection2.476.066
RNN GNMT3.2410.411
Transformer Translation5.39215.778
Recommendation Systems (NCF)1.76*2.789*

(*) Recommendation systems is not a good benchmark for studying scaling analysis of deep learning workloads, since it is the quickest of the bunch and the achieved speedup is on the order of seconds.

Table 7. Speed Ups for all the benchmarks going from 1 to 4 to 8 GPUs

Based on the results, a general takeaway message would be to select systems based on the type of deep learning application one is trying to build. We see that the recommendation systems benchmark doesn’t scale well, which suggests that such projects should limit multi-GPU training and instead share the resources (either shared between multiple users or between multiple models).  On the other hand, if your team trains neural networks on large image sets (image classification, object localization, object detection, instance segmentation), using multi-GPU systems is crucial for quick results.

Next Steps for Successful Deep Learning Deployment

Of course, a powerful compute resource is just one part of successful deep learning implementation. Depending upon your project needs and the anticipated growth of your datasets, storage requirements may eclipse compute requirements. Connectivity also becomes critical, as neural network training stresses system and network I/O.

Whether you are planning a new project or looking to improve your existing deep learning practice, Microway’s team would be happy to help you define the requirements and deliver a successful solution. With experience in everything from GPU workstations to DGX-2 SuperPODS, our experts can ensure the deployment meets your needs. Contact an AI expert today!

References

[1]
Deep Residual Learning for Image Recognition
[2]
Google’s Neural Machine Translation System
[3]
Attention is all you need
[4]
Neural Collaborative Filtering
[5]
Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville
[6]
Demystifying Hardware Infrastructure Choices for Deep Learning Using MLPerf
[7]
MLPerfv0.5 Training Results
[8]
Mask R-CNN for Object Detection
[9]
Single Shot Multibox Detector
[10]
Training results log files

The post Multi-GPU Scaling of MLPerf Benchmarks on NVIDIA DGX-1 appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/multi-gpu-scaling-of-mlperf-benchmarks-on-nvidia-dgx-1/feed/ 0
Improvements in scaling of Bowtie2 alignment software and implications for RNA-Seq pipelines https://www.microway.com/hpc-tech-tips/improvements-in-scaling-of-bowtie2-alignment-software-and-implications-for-rna-seq-pipelines/ https://www.microway.com/hpc-tech-tips/improvements-in-scaling-of-bowtie2-alignment-software-and-implications-for-rna-seq-pipelines/#respond Fri, 28 Jun 2019 17:48:05 +0000 https://www.microway.com/?p=11665  This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries What is Bowtie2? Bowtie2 is a commonly used, open-source, fast, and memory efficient application used as part of a Next Generation Sequencing (NGS) workflow.It aligns the sequencing reads, which are the genomic data output […]

The post Improvements in scaling of Bowtie2 alignment software and implications for RNA-Seq pipelines appeared first on Microway.

]]>
 
This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries

What is Bowtie2?

Bowtie2 is a commonly used, open-source, fast, and memory efficient application used as part of a Next Generation Sequencing (NGS) workflow.It aligns the sequencing reads, which are the genomic data output from an NGS device such as an Illumina HiSeq Sequencer, to a reference genome.Applications like Bowtie2 are used as the first step in pipelines such as those for variant determination, and an area of continuously growing research interest, RNA-Seq.

What is RNA-Seq?

RNA Sequencing (RNA-Seq) is a type of NGS that seeks to identify the presence and quantity of RNA in a sample at a given point in time.This can be used to quantify changes in gene expression, which can be a result of time, external stimuli, healthy or diseased states, and other factors.Through this quantification, researchers can obtain a unique snapshot of the genomic status of the organism to identify genomic information previously undetectable with other technologies.

There is considerable research effort being put into RNA-Seq, and the number of publications has grown steadily since its first use in 2009.

Plot of the number of RNA-Seq research publications accepted each year
Figure 1. RNA-Seq research publications published per year as of April 2019.Note the continuous growth.At the current rate, there will be 60% more publications in 2019 as compared to 2018. Source: NCBI PubMed

RNA-Seq is being applied to many research areas and diseases, and a few notable examples of using the technology include:

  • Oral Cancer: Researchers used an RNA-Seq approach to identify differences in gene expression between oral cancer and normal tissue samples.
  • Alzheimer’s Disease: Researchers compared the gene expression of different lobes of deceased Alzheimer’s Disease patients brain with the brain of healthy individuals.They were able to identify genomic differences between the diseased and unaffected individuals.
  • Diabetes: Researchers identified novel gene expression information from pancreatic beta-cells, which are cells critical for glycemic control.

Compute Infrastructure for aligning with Bowtie2

Designing a compute resource to meet the sequence analysis needs of Bioinformatics researchers can be a daunting task for IT staff.Limited information is available about multithreading and performance increases in the diverse portfolio of software related to NGS analysis.To further complicate things, processors are now available in a variety of models, with a large range of core counts and clock speeds, from both AMD and Intel. See, for example, the latest Intel Xeon “Cascade Lake” CPUs: Intel Xeon Scalable “Cascade Lake SP” Processor Review

Though many sequence analysis tools have multithreading options, the ability to scale is often limited, and rarely linear.In some cases, performance can decrease as more threads are added.Multithreading applications does not guarantee a performance improvement.

ThreadsRun Time (seconds)
8620
16340
32260
48385
64530

Table 1. Research data showing previous version of Bowtie2 scaling with thread count.Performance would decrease above 32 threads.

Plot of Bowtie2 run time as the number of threads increases
Figure 2. Plot of thread scaling of previous version of Bowtie2.Performance decreases after 32 threads due to a variety of factors.Non-linear scaling and performance decreases with core count have been shown in other scientific applications as well.

However, researchers recently greatly improved the thread scaling of Bowtie2.Original versions of this tool did not scale linearly, and demonstrated reduced performance when using more than 32 threads.Aware of these problems, the developers of Bowtie2 have implemented superior multithread scaling in their applications.Depending on processor type, their results show:

  • Removal of performance decreases over 32 threads
  • An increase in read throughput of up to 44%
  • Reduced memory usage with thread scaling
  • Up to a 4 hour reduction in time to align 40x coverage human genome

This new version of the software is open-source and available for download.

Right Sizing your NGS Cluster

With the recent release of Intel’s Cascade Lake-AP Xeons providing up to 112 threads per socket, as well as high density AMD EPYC processors, it can be tempting to assume that more cores will result in more performance for NGS applications.However, this is not always the case, and some applications will show reduced performance with higher thread count.

When selecting compute systems for NGS analysis, researchers and IT staff need to evaluate which software products will be used, and how they scale with threads.Depending on the use cases, more nodes with fewer, faster, threads could provide better performance than high thread density nodes.Unfortunately there is no “one size fits all” solution, and applications are in constant development, so research into the most recent versions of analysis software is always required.

References

[1] https://www.ncbi.nlm.nih.gov/pubmed/
[2] https://doi.org/10.1371/journal.pone.0016266
[3] https://doi.org/10.1101/205328
[4] https://link.springer.com/article/10.1007/s10586-017-1015-0


If you are interested in testing your NGS workloads on the latest Intel and AMD HPC systems, please consider our free HPC Test Drive. We provide bare-metal benchmarking access to HPC and Deep Learning systems.

The post Improvements in scaling of Bowtie2 alignment software and implications for RNA-Seq pipelines appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/improvements-in-scaling-of-bowtie2-alignment-software-and-implications-for-rna-seq-pipelines/feed/ 0
CryoEM takes center stage: how compute, storage, and networking needs are growing with CryoEM research https://www.microway.com/hpc-tech-tips/cryoem-takes-center-stage-how-compute-storage-networking-needs-growing/ https://www.microway.com/hpc-tech-tips/cryoem-takes-center-stage-how-compute-storage-networking-needs-growing/#respond Thu, 11 Apr 2019 14:05:48 +0000 https://www.microway.com/?p=11409  This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries Background and history Cryogenic Electron Microscopy (CryoEM) is a type of electron microscopy that images molecular samples embedded in a thin layer of non-crystalline ice, also called vitreous ice.Though CryoEM experiments have been performed […]

The post CryoEM takes center stage: how compute, storage, and networking needs are growing with CryoEM research appeared first on Microway.

]]>
 
This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries

Background and history

Cryogenic Electron Microscopy (CryoEM) is a type of electron microscopy that images molecular samples embedded in a thin layer of non-crystalline ice, also called vitreous ice.Though CryoEM experiments have been performed since the 1980s, the majority of molecular structures have been determined with two other techniques, X-ray crystallography and Nuclear Magnetic Resonance (NMR).The primary advantage of X-ray crystallography and NMR is that molecules were able to be determined at very high resolution, several fold better than historical CryoEM results.

However, recent advancements in CryoEM microscope detector technology and analysis software have greatly improved the capability of this technique.Before 2012, CryoEM structures could not achieve the resolution of X-ray Crystallography and NMR structures.The imaging and analysis improvements since that time now allow researchers to image structures of large molecules and complexes at high resolution.The primary advantages of Cryo-EM over X-ray Crystallography and NMR are:

  • Much larger structures can be determined than by X-ray or NMR
  • Structures can be determined in a more native state than by using X-ray

The ability to generate these high resolution large molecular structures through CryoEM enables better understanding of life science processes and improved opportunities for drug design. CryoEM has been considered so impactful, that the inventors won the 2017 Nobel Prize in chemistry.

CryoEM structure and publication growth

While the number of molecular structures determined by CryoEM is much lower than those determined by X-ray crystallography and NMR, the rate at which these structures are released has greatly increased in the past decade. In 2016, the number CryoEM structures deposited in the Protein Data Bank (PDB) exceeded those of NMR for the first time.

The importance of CryoEM in research is growing, as shown by the steady increase in publications over the past decade (Table 1, Figure 1).Interestingly, though there are ~10 times as many X-ray crystallography publications per year, the number of related publications per year has decreased consistently since 2013.

Experimental Structure TypeApproximate total number of structures as of March 2019
X-ray Crystallography134,000
NMR12,500
Cryo-EM3,000

Table 1. Total number of CryoEM structures available in the publicly accessible Protein Data Bank (PDB). Source: RCSB

Chart showing the growth of CryoEM structures available in the Protein Data Bank
Figure 1. Total number of structures available in the Protein Data Bank (PDB) by year.
Note the rapid and consistent growth. Source: RCSB
Plot of the number of CryoEM publications accepted each year
Figure 2. Number of CryoEm publications accepted each year.
Note the rapid increase in publications. Source: NCBI PubMed
Plot showing the declining number of X-ray crystallography publications per year
Figure 3. Number of X-ray crystallography publications per year. Note the steady decline in publications. While publications related to X-ray crystallography may be decreasing, opportunities exist for integrating both CryoEM and X-ray crystallography data to further our understanding of molecular structure. Source: NCBI PubMed

CryoEM is part of our research – do I need to add GPUs to my infrastructure?

A major challenge facing researchers and IT staff is how to appropriately build out infrastructure for CryoEM demands.There are several software products that are used for CryoEM analysis, with RELION being one of the most widely used open source packages.While GPUs can greatly accelerate RELION workflows, support for them has only existed since Version 2 (released in 2016).Worldwide, the vast majority of individual servers and centralized resources available to researchers are not GPU accelerated.Those systems that do have professional grade GPUs are often oversubscribed and can have considerable queue wait times.The relative high cost of server grade GPU systems can put those devices out of the reach of many individual research labs.

While advanced GPU hardware like the DGX-1 continue to give the best analysis times, not every GPU system provides the same throughput. Large datasets can create issues with consumer grade GPUs, in that the dataset must fit within the GPU memory to fully take advantage of the acceleration.Though RELION can parallelize the datasets, GPU memory is still limited when compared to the large amounts of system memory available to CPUs that can be installed in a single device (DGX-1 provides 256GB GPU memory; DGX-2 provides 512GB).This problem is amplified if the researcher has access to only a single consumer grade graphic card (e.g., an NVIDIA GeForce GTX 1080 Ti GPU with 11GB memory).

With the Version 3 release of the software (late 2018), RELION authors have implemented CPU acceleration to broaden the usable hardware for efficient CryoEM reconstruction.The authors have shown a 1.5x improvement on Broadwell processors and a 2.5x improvement on Skylake over the previous code.However, taking advantage of AVX instructions during compilation can further improve performance, with the authors demonstrating a 5.4x improvement on Skylake processors.This improvement is approaching the performance increases of professional grade GPUs without the additional cost.

Additional infrastructure considerations

CryoEM datasets are being generated at a higher rate and with larger data sizes than ever before.Currently, the largest raw dataset in the Electron Microscopy Public Image Archive (EMPIAR) is 12.4TB, with a median dataset size of approximately 2TB.Researchers and IT staff can expect datasets in this order of magnitude to become the norm as CryoEM continues to grow as an experimental resource in the life sciences space.

Many CryoEM labs function as microscopy cores, where they provide the service of generating the 2D datasets for different researchers, which are then analyzed by individual labs.Given the high cost of professional GPUs as compared to the ubiquitous availability of multicore CPU systems, researchers may consider modern multicore servers or using centralized clusters to meet their CryoEM analysis needs.This is with the caveat that they use Version 3 of RELION software with appropriate compilation flags.

Dataset transfer is also a concern, and organizations that have a centralized Cryo-EM core would greatly benefit from upgraded networking (10Gbps+) from the core location to centralized compute resources, or to individual labs.

Visualization of the structure of beta-galactosidase from the EMPIAR database
Figure 4. A 2.2 angstrom resolution CryoEM structure of beta-galactosidase. This is currently the largest dataset in the EMPIAR database, totaling 12.4 TB. Source: EMPIAR

CryoEM takes center stage

The increase in capabilities, interest, and research related to CryoEM shows it is now a mainstream experimental technique.IT staff and scientists alike are rapidly becoming aware of this fact as they face the data analysis, transfer, and storage challenges associated with this technique.Careful consideration must be given to the infrastructure of an organization that is engaging in CryoEM research.

In an organization that is performing exclusively CryoEM experiments, a GPU cluster would be the most cost-effective solution for rapid analysis.Researchers with access to advanced professional grade GPU systems, such as a DGX-1, will see analysis times that are even faster than modern CPU optimized RELION.While these professional GPUs can greatly accelerate CryoEM analysis, it is unlikely in the short term that all researchers wanting to use CryoEM data will have access to such high-spec GPU hardware, as compared to mixed-use commodity clusters, which are ubiquitous at all life science organizations.A large multicore CPU machine, when properly configured, can give better performance than a low core workstation or server with a single consumer grade GPU (e.g., an NVIDIA GeForce GPU).

IT departments and researchers must work together to define expected turnaround time, analysis workflow requirements, budget, and configuration of existing hardware.In doing so, researcher needs will be met and IT can implement the most effective architecture for CryoEM.

References

[1] https://doi.org/10.7554/eLife.42166.001
[2] https://febs.onlinelibrary.wiley.com/doi/10.1111/febs.12796
[3] https://www.ncbi.nlm.nih.gov/pubmed/
[4] https://www.ebi.ac.uk/pdbe/emdb/empiar/
[5] https://www.rcsb.org/


If you are interested in trying out RELION performance on some of the latest CPU and GPU-accelerated systems (including NVIDIA DGX-1), please consider our free HPC Test Drive. We provide bare-metal benchmarking access to HPC and Deep Learning systems.

The post CryoEM takes center stage: how compute, storage, and networking needs are growing with CryoEM research appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/cryoem-takes-center-stage-how-compute-storage-networking-needs-growing/feed/ 0
Intel Xeon Scalable “Cascade Lake SP” Processor Review https://www.microway.com/hpc-tech-tips/intel-xeon-scalable-cascade-lake-sp-processor-review/ https://www.microway.com/hpc-tech-tips/intel-xeon-scalable-cascade-lake-sp-processor-review/#comments Tue, 02 Apr 2019 17:00:45 +0000 https://www.microway.com/?p=11305 With the launch of the latest Intel Xeon Scalable processors (previously code-named “Cascade Lake SP”), a new standard is set for high performance computing hardware. These latest Xeon CPUs bring increased core counts, faster memory, and faster clock speeds. They are compatible with the existing workstation and server platforms that have been shipping since mid-2017. […]

The post Intel Xeon Scalable “Cascade Lake SP” Processor Review appeared first on Microway.

]]>
With the launch of the latest Intel Xeon Scalable processors (previously code-named “Cascade Lake SP”), a new standard is set for high performance computing hardware. These latest Xeon CPUs bring increased core counts, faster memory, and faster clock speeds. They are compatible with the existing workstation and server platforms that have been shipping since mid-2017. Starting today, Microway is shipping these new CPUs across our entire line of turn-key Xeon workstations, systems, and clusters.

Important changes in Intel Xeon Scalable “Cascade Lake SP” Processors include:

  • Higher CPU core counts for many SKUs in the product stack
  • Improved CPU clock speeds (with Turbo Boost up to 4.4GHz)
  • Introduction of the new AVX-512 VNNI instruction for Intel Deep Learning Boost (VNNI)
    provides significant, more efficient deep learning inference acceleration
  • Higher memory capacity & performance:
    • Most CPU models provide increased memory speeds
    • Support for DDR4 memory speeds up to 2933MHz
    • Large-memory capabilities with Intel Optane DC Persistent Memory
    • Support for up to 4.5TB-per-socket system memory
  • Integrated hardware-based security mitigations against side-channel attacks

More for Your Dollar: performance uplift

With an increase in core counts, clock speeds, and memory speeds, applications will achieve better performance across the board. Particularly in the lower-end Xeon 4200- and 5200-series CPUs, the cost-effectiveness of the processors has increased considerably. The plot below compares the price of each processor against its performance. Both the current “Cascade Lake CP” and previous-generation “Skylake-SP” are shown:

Comparison chart of Intel Xeon Cascade Lake SP cost-effectiveness vs Skylake-SP for applications with AVX-512 instructions
In the diagram above, the wide colored bars indicate the price performance of these new Xeon CPUs. The dots indicate the price performance of the previous generation, which allows us to compare the two generations SKU by SKU (though a few of the newer models do not have previous-generation counterparts). In this comparison, lower values are better and indicate a higher quantity of computation per dollar spent.

Same SKU – More Performance

As shown above, many models offer more performance than their previous-generation counterpart. Here we highlight models which are showing particularly substantial improvements:

  • Xeon 4210 is 34% more price-performant than Xeon 4110
  • Xeon 4214 is 30% more price-performant than Xeon 4114
  • Xeon 4216 is 25% more price-performant than Xeon 4116
  • Xeon 5218 is 40% more price-performant than Xeon 5118
  • Xeon 5220 is 34% more price-performant than Xeon 5120
  • Xeon 6242 saw an 8% increase in clock speed and ~10% reduction in price
  • Xeon 8270 is 28% more price-performant than Xeon 8170

To summarize: this latest generation will provide more performance for the same cost if you stick with the model numbers you’ve been using. In the next section, we’ll review opportunities for cost reduction.

More for Less: Select a more modest Cascade Lake SKU for the same core count or performance

With generational improvements, it’s not unusual for a new CPU to replace a higher-end version of the older generation. There are many cases where this is true in the Cascade Lake Xeon CPUs, so be sure to consider if you can leverage such savings.

Guaranteed savings

  • Xeon 4208 replaces the Xeon 4110: providing the same 8 cores for a lower price
  • Xeon 4210 replaces the Xeon 4114: providing the same 10 cores for a lower price
  • Xeon 4214 surpasses the Xeon 4116: providing the same 12 cores at higher clock speeds
  • Xeon 5218 surpasses the Xeon 5120: providing more cores, higher clock speeds, and faster memory speeds

Worthy of consideration

  • Xeon 4216 may replace most of the 5100-series: Xeon 5115, 5118 and 5120
    Nearly all specifications are equivalent, but the UPI speed of the Xeon 4216 is 9.6GT/s rather than 10.4GT/s
  • Xeon 6230 likely replaces the Xeon 6130, 6138, 6140: providing the same or more cores for a lower price
  • Xeon 6240 competes with every Xeon 6100-series model
    with the exception that it does not provide 3+GHz processor frequencies

Greater Memory Bandwidth

For computationally-intensive applications, rapid access to data is critical. Thus, memory speed increases are valuable improvements. This generation of CPUs brings a 10% improvement to the Xeon 5200-series (2666MHz; up from 2400MHz) and the Xeon 6200-/8200-series (2933MHz; up from 2666MHz). This means that the Xeon 5200-series CPUs are more competitive (they’re running memory at the same speed as last generation’s Xeon 6100- and 8100-series processors). And the higher-end Xeon 6200-/8200-series CPUs have a 10% memory performance advantage over all others.

While a 10% improvement may seem to be only a modest improvement, keep in mind that it’s essentially a free upgrade. Combined with the other features and improvements discussed above, you can be confident you’re making the right choice by upgrading to these newest Intel Xeon Scalable CPUs.

Enabling Very Large Memory Capacity

With the official launch of Intel Optane DC Persistent Memory, it is now possible to deploy systems with multiple terabytes of system memory. Well-equipped systems provide each Xeon CPU with six Optane memory modules (alongside six standard memory modules). This results in up to 3TB of Optane memory and 1.5TB of standard DRAM per CPU! Look for more information on these possibilities as HPC sites begin adopting and exploring this new technology.

Transitioning from the “Skylake-SP” Intel Xeon Scalable CPUs

Because the new “Cascade Lake SP” CPUs are socket-compatible with the previous-generation “Skylake SP” CPUs, the upgrade path is simple. All existing platforms that support the earlier CPUs can also accept these new CPUs. This also simplifies the choice for those considering a new system: the new CPUs use existing, proven platforms. There’s little risk in selecting the latest and highest-performance components. HPC sites adding to existing clusters will find they have a choice: spend the same for increased performance or spend less for the same performance. Below are peak performance comparisons of the previous generation CPUs with the new generation:

The wider/colored bars indicate peak performance for the new Xeon CPUs. The slim grey bars indicate peak performance for the previous-generation Xeon CPUs. Without exception, the new CPUs are expected to outperform their predecessors. The widest margins of improvement are in the lower-end Xeon 4200- and 5200-series.

Standout performance in a single socket

This generation introduces three CPU models designed for single-socket systems (providing very high throughput at relatively low-cost). They provide 20+ CPU cores at prices as much as $2,000 less than their multi-socket counterparts. If your workload performs well with a single CPU, these SKUs will be incredibly valuable:

  • Xeon 6209U outperforms nearly all of last generation’s Xeon Gold 6100-series CPUs
  • Xeon 6210U outperforms all Xeon 6100-series and many 6200-series CPUs
  • Xeon 6212U outperforms several of the Xeon 8100-series CPUs

The only exception to the above would be for applications which require very high clock speeds, as these single-socket CPU models do not provide base processor frequencies higher than 2.5GHz. The strength of these single-socket processors is in high throughput (via high core count) and decent clock speeds.

Next Steps: get started today!

Read More

If you’d like to read more about these new processors, check out our article with detailed specifications of the Intel Xeon “Cascade Lake SP” CPUs. We summarize and compare the specifications of each model, and provide guidance on which models are likely to be best suited to computationally-intensive HPC & Deep Learning applications.

Try Intel Xeon Scalable CPUs for Yourself

Groups which prefer to verify performance before making a design are encouraged to sign up for a Test Drive, which will provide you with access to bare-metal hardware with Intel Xeon Scalable CPUs, large-memory, and more.

Speak with an Expert

If you’re expecting to be upgrading or deploying new systems in the coming months, our experts would be happy to help you consider your options and design a custom cluster optimized to your workloads. We also help groups writing budget proposals to ensure they’re requesting the correct resources. Please get in touch!

The post Intel Xeon Scalable “Cascade Lake SP” Processor Review appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/intel-xeon-scalable-cascade-lake-sp-processor-review/feed/ 1
NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/ https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/#respond Fri, 15 Mar 2019 17:06:57 +0000 https://www.microway.com/?p=11118 Performance benchmarks are an insightful way to compare new products on the market. With so many GPUs available, it can be difficult to assess which are suitable to your needs. Various benchmarks provide information to compare performance on individual algorithms or operations. Since there are so many different algorithms to choose from, there is no […]

The post NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks appeared first on Microway.

]]>
Performance benchmarks are an insightful way to compare new products on the market. With so many GPUs available, it can be difficult to assess which are suitable to your needs. Various benchmarks provide information to compare performance on individual algorithms or operations. Since there are so many different algorithms to choose from, there is no shortage of benchmarking suites available.

For this comparison, the SHOC benchmark suite (https://github.com/vetter/shoc/) is used to compare the performance of the NVIDIA Tesla T4 with other GPUs commonly used for scientific computing: the NVIDIA Tesla P100 and Tesla V100.

The Scalable Heterogeneous Computing Benchmark Suite (SHOC) is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing, and the software used to program them. Its initial focus is on systems containing Graphics Processing Units (GPUs) and multi-core processors, and on the OpenCL programming standard. It can be used on clusters as well as individual hosts.

The SHOC benchmark suite includes options for many benchmarks relevant to a variety of scientific computations. Most of the benchmarks are provided in both single- and double-precision and with and without PCIE transfer consideration. This means that for each test there are up to four results for each benchmark. These benchmarks are organized into three levels and can be run individually or all together.

The Tesla P100 and V100 GPUs are well-established accelerators for HPC and AI workloads. They typically offer the highest performance, consume the most power (250~300W), and have the highest price tag (~$10k). The Tesla T4 is a new product based on the latest “Turing” architecture, delivering increased efficiency along with new features. However, it is not a replacement for the bigger/more power-hungry GPUs. Instead, it offers good performance while consuming far less power (70W) at a lower price (~$2.5k). You’ll want to use the right tool for the job, which will depend upon your workload(s). A summary of each Tesla GPU is shown below.

In our testing, both single- and double-precision SHOC benchmarks were run, which allows us to make a direct comparison of the capabilities of each GPU. A few HPC-relevant benchmarks were selected to compare the T4 to the P100 and V100. Tesla P100 is based on the “Pascal” architecture, which provides standard CUDA cores. Tesla V100 features the “Volta” architecture, which introduced deep-learning specific TensorCores to complement CUDA cores. Tesla T4 has NVIDIA’s “Turing” architecture, which includes TensorCores and CUDA cores (weighted towards single-precision). This product was designed primarily with machine learning in mind, which results in higher single-precision performance and relatively low double-precision performance. Below, some of the commonly-used HPC benchmarks are compared side-by-side for the three GPUs.

Double Precision Results

GPUTesla T4Tesla V100Tesla P100
Max Flops (GFLOPS)253.387072.864736.76
Fast Fourier Transform (GFLOPS)132.601148.75756.29
Matrix Multiplication (GFLOPS)249.575920.014256.08
Molecular Dynamics  (GFLOPS)105.26908.62402.96
S3D (GFLOPS)59.97227.85161.54

 

Single Precision Results

GPUTesla T4Tesla V100Tesla P100
Max Flops (GFLOPS)8073.2614016.509322.46
Fast Fourier Transform (GFLOPS)660.052301.321510.49
Matrix Multiplication (GFLOPS)3290.9413480.408793.33
Molecular Dynamics (GFLOPS)572.91997.61480.02
S3D (GFLOPS)99.42434.78295.20

 

What Do These Results Mean?

The single-precision results show Tesla T4 performing well for its size, though it falls short in double precision compared to the NVIDIA Tesla V100 and Tesla P100 GPUs. Applications that require double-precision accuracy are not suited to the Tesla T4. However, the single precision performance is impressive and bodes well for the performance of applications that are optimized for lower or mixed precision.

Plot comparing the performance of Tesla T4 with the Tesla P100 and Tesla V100 GPUs

To explain the single-precision benchmarks shown above:

  • The Max Flops for the T4 are good compared to V100 and competitive with P100. Tesla T4 provides more than half as many FLOPS as V100 and more than 80% of P100.
  • The T4 shows impressive performance in the Molecular Dynamics benchmark (an n-body pairwise computation using the Lennard-Jones potential). It again offers more than half the performance of Tesla V100, while beating the Tesla P100.
  • In the Fast Fourier Transform (FFT) and Matrix Multiplication benchmarks, the performance of Tesla T4 is on par for both price/performance and power/performance (one fourth the performance of V100 for one fourth the price and one fourth the wattage). This reflects how the T4 will perform in a large number of HPC applications.
  • For S3D, the T4 falls behind by a few additional percent.

Looking at these results, it’s important to remember the context. Tesla T4 consumes only ~25% the wattage of the larger Tesla GPUs and costs only ~25% as much. It is also a physically smaller GPU that can be installed in a wider variety of servers and compute nodes. In that context, the Tesla T4 holds its own as a powerful option for a reasonable price when compared to the larger NVIDIA Tesla GPUs.

What to Expect from the NVIDIA Tesla T4

Cost-Effective Machine Learning

The T4 has substantial single/mixed precision machine learning focused performance, with a price tag significantly lower than larger Tesla GPUs. What the T4 lacks in double precision, it makes up for with impressive single-precision results. The single-precision performance available will strongly cater to the machine learning algorithms with potential to be applied to mixed precision. Future work will examine this aspect more closely, but Tesla T4 is expected to be of high interest for deep learning inference and to have specific use-cases for deep learning training.

Impressive Single-Precision HPC Performance

In the molecular dynamics benchmark, the T4 outperforms the Tesla P100 GPU. This is extremely impressive, and for those interested in single- or mixed-precision calculations involving similar algorithms, the T4 could provide an excellent solution. With some adapting algorithms, the T4 may be a strong contender for scientific applications that also want to utilize machine learning capabilities to analyze results or run a variety of different types of algorithms from both machine learning and scientific computing on an easily accessible GPU.

In addition to the outright lower price tag, the T4 also operates at 70 Watts, in comparison to the 250+ Watts required for the Tesla P100 / V100 GPUs. Running on one quarter of the power means that it is both cheaper to purchase and cheaper to operate.

Next Steps for leveraging Tesla T4

If it appears the new Tesla T4 will accelerate your workload, but you’d like to benchmark, please sign up to Test Drive for yourself. We also invite you to contact one of our experts to discuss your needs further. Our goal is to understand your requirements, provide guidance on best options, and see the project through to successful system/cluster deployment.

Full SHOC Benchmark Results

The post NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/feed/ 0
NVIDIA Tesla V100 Price Analysis https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/#respond Wed, 09 May 2018 00:52:23 +0000 https://www.microway.com/?p=10150 Now that NVIDIA has launched their new Tesla V100 32GB GPUs, the next questions from many customers are “What is the Tesla V100 Price?” “How does it compare to Tesla P100?” “How about Tesla V100 16GB?” and “Which GPU should I buy?” Tesla V100 32GB GPUs are shipping in volume, and our full line of […]

The post NVIDIA Tesla V100 Price Analysis appeared first on Microway.

]]>
Now that NVIDIA has launched their new Tesla V100 32GB GPUs, the next questions from many customers are “What is the Tesla V100 Price?” “How does it compare to Tesla P100?” “How about Tesla V100 16GB?” and “Which GPU should I buy?”

Tesla V100 32GB GPUs are shipping in volume, and our full line of Tesla V100 GPU-accelerated systems are ready for the new GPUs. If you’re planning a new project, we’d be happy to help steer you towards the right choices.

Tesla V100 Price

The table below gives a quick breakdown of the Tesla V100 GPU price, performance and cost-effectiveness:

Tesla GPU modelPriceDouble-Precision Performance (FP64)Dollars per TFLOPSDeep Learning Performance (TensorFLOPS or 1/2 Precision)Dollars per DL TFLOPS
Tesla V100 PCI-E 16GB or 32GB$10,664* $11,458* for 32GB7 TFLOPS$1,523 $1,637 for 32GB112 TFLOPS$95.21 $102.30 for 32GB
Tesla P100 PCI-E 16GB$7,374*4.7 TFLOPS$1,56918.7 TFLOPS$394.33
Tesla V100 SXM 16GB or 32GB$10,664* $11,458* for 32GB7.8 TFLOPS$1,367 $1,469 for 32GB125 TFLOPS$85.31 $91.66 for 32GB
Tesla P100 SXM2 16GB$9,428*5.3 TFLOPS$1,77921.2 TFLOPS$444.72

* single-unit list price before any applicable discounts (ex: EDU, volume)

Key Points

  • Tesla V100 delivers a big advance in absolute performance, in just 12 months
  • Tesla V100 PCI-E maintains similar price/performance value to Tesla P100 for Double Precision Floating Point, but it has a higher entry price
  • Tesla V100 delivers dramatic absolute performance & dramatic price/performance gains for AI
  • Tesla P100 remains a reasonable price/performance GPU choice, in select situations
  • Tesla P100 will still dramatically outperform a CPU-only configuration

Tesla V100 Double Precision HPC: Pay More for the GPU, Get More Performance

VMD visualization of a nucleosome

You’ll notice that Tesla V100 delivers an almost 50% increase in double precision performance. This is crucial for many HPC codes. A variety of applications have been shown to mirror this performance boost. In addition, Tesla V100 now offers the option of 2X the memory of Tesla P100 16GB for memory bound workloads.

Tesla V100 can is a compelling choice for HPC workloads: it will almost always deliver the greatest absolute performance. However, in the right situation a Tesla P100 can still deliver reasonable price/performance as well.

Both Tesla P100 and V100 GPUs should be considered for GPU accelerated HPC clusters and servers. A Microway expert can help you evaluate what’s best for your needs and applications and/or provide you remote benchmarking resources.

Tesla V100 for Deep Learning: Enormous Advancement & Value- The New Standard


If your goal is maximum Deep Learning performance, Tesla V100 is an enormous on-paper leap in performance. The dedicated TensorCores have huge performance potential for deep learning applications. NVIDIA has even termed a new “TensorFLOP” to measure this gain. Tesla V100 delivers a 6X on-paper advancement.

If your budget allows you to purchase at least 1 Tesla V100, it’s the right GPU to invest in for deep learning performance. For the first time, the beefy Tesla V100 GPU is compelling for not just AI Training, but AI Inference as well (unlike Tesla P100).

Moreover, only a selection of Deep Learning frameworks are fully taking advantage of the TensorCore today. As more and more DL Frameworks are optimized to use these new TensorCores and their instructions, the gains will grow. Even before many major optimizations, many workloads have advanced 3X-4X.

Finally, there is no more SXM cost premium for Tesla V100 GPUs (and only a modest premium for SXM-enabled host-servers). Nearly all DL applications benefit greatly from the NVLink interface from GPU:GPU; a selection of HPC applications (ex: AMBER) do today.

If you’re running DL frameworks, select Tesla V100 and if possible the SXM-enabled GPUs and servers.

FLOPS vs Real Application Performance

Unless you firmly know your workload correlates, we strongly discourage anyone from making purchasing decisions strictly based upon raw $/FLOP calculations.

While the generalizations above are useful, application performance differs dramatically from any simplistic FLOPS calculation. Device/device bandwidth, host-device bandwidth, GPU memory bandwidth, code maturity, are all equal levers to FLOPS on realized application performance.

Here’s some of NVIDIA’s own application performance testing across some real applications


You’ll see that some codes scale similarly to the on-paper FLOPS gains, and others are frankly far more removed.

At the most, use such simplistic FLOPS and price/performance calculations to guide the following higher level decision-making: to help predict new hardware relative to prior testing of FLOPS vs. actual performance, to steer what GPUs to consider, to decide what to purchase for POCs, or as the way to identify appropriate GPUs to remotely test to validate actual application performance.

No one should buy based upon price/performance per FLOP; most should buy based upon price/performance per workload (or basket of workloads).

When Paper Performance + Intuition Collide with Reality

While the above guidelines are helpful, there are still a wide diversity of workloads out there in the field. Apart from testing that steers you to one GPU or another, here’s some good reasons we’ve seen or advised customers to use to make other selections:

Tesla V100 SXM 2.0 GPU
  • Your application has shown diminishing returns to advances in GPU performance in the past (Tesla P100 might be a price/performance choice)
  • Your budget doesn’t allow for even a single Tesla V100 (pick Tesla P100, still great speedups)
  • Your budget allows for a server with 2 Tesla P100s, but not 2 Tesla V100s (Pick 2 Tesla P100s vs 1 Tesla V100)
  • Your application is GPU memory capacity-bound (pick Tesla V100 32GB)
  • There are workload sharing considerations (ex: preferred scheduler only allocates whole GPUs)
  • Your application isn’t multi-GPU enabled (pick Tesla V100, the most powerful single GPU)
  • Your application is GPU memory bandwidth limited (test it, but potential case for Tesla P100)

Further Resources

You may wish to reference our Comparison of Tesla “Volta” GPUs, which summarizes the technical improvements made in these new GPUs or Tesla V100 GPU Review for more extended discussion.

If you’re looking to see how these GPUs will be deployed in production, read our NVIDIA GPU Clusters page. As always, please feel free to reach out to us if you’d like to get a better understanding of these latest HPC systems and what they can do for you.

The post NVIDIA Tesla V100 Price Analysis appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/feed/ 0
NVIDIA Datacenter Manager (DCGM) for More Effective GPU Management https://www.microway.com/hpc-tech-tips/nvidia-dcgm-for-gpu-management/ https://www.microway.com/hpc-tech-tips/nvidia-dcgm-for-gpu-management/#respond Mon, 02 Apr 2018 03:58:47 +0000 https://www.microway.com/?p=10643 Managing an HPC server can be a tricky job, and managing multiple servers even more complex. Adding GPUs adds even more power yet new levels of granularity. Luckily, there’s a powerful, and effective tool available for managing multiple servers or a cluster of GPUs: NVIDIA Datacenter GPU Manager. Executing hardware or health checks DCGM’s power […]

The post NVIDIA Datacenter Manager (DCGM) for More Effective GPU Management appeared first on Microway.

]]>
Managing an HPC server can be a tricky job, and managing multiple servers even more complex. Adding GPUs adds even more power yet new levels of granularity. Luckily, there’s a powerful, and effective tool available for managing multiple servers or a cluster of GPUs: NVIDIA Datacenter GPU Manager.

Executing hardware or health checks

DCGM’s power comes from its ability to access all kinds of low level data from the GPUs in your system. Much of this data is reported by NVML (NVIDIA Management Library), and it may be accessible via IPMI on your system. But DCGM helps make it far easier to access and use the following:

Report what GPUs are installed, in which slots and PCI-E trees and make a group

Build a group of GPUs once you know which slots your GPUs are installed in and on which PCI-E trees and NUMA nodes they are on. This is great for binding jobs, linking available capabilities.

Determine GPU link states, bandwidths

Provide a report of the PCI-Express link speed each GPU is running at. You may also perform D2D and H2D bandwidth tests inside your system (to take action on the reports)

Read temps, boost states, power consumption, or utilization

Deliver data on the energy usage and utilization of your GPUs. This data can be used to control the cluster

Driver versions and CUDA versions

Report on the versions of CUDA, NVML, and the NVIDIA GPU driver installed on your system

Run sample jobs and integrated validation

Run basic diagnostics and sample jobs that are built into the DCGM package.

Set policies

DCGM provide a mechanism to set policies to a group of GPUs.

Policy driven management: elevating from “what’s happening” to “what can I do”

Simply accessing data about your GPUs is only of modest use. The power of DCGM is in how it arms you to take act upon that data.DCGM allows administrators to take programmatic or preventative action when “it’s not right”

Here’s are a few scenarios where data provided by DCGM allows for both powerful control of your hardware and action:

Scenario 1: Healthchecks – periodic or before the job

Run a check before each job, after a job, or daily/hourly to ensure a cluster is performing optimally.

This allows you to preemptively stop a run if diagnostics fail or move GPUs/nodes out of the scheduling queue for the next job.

Scenario 2: Resource Allocation

Jobs often need a certain class of node (ex: with >4 GPUs or with IB & GPUs on the same PCI-E tree). DCGM can be used to report on the capabilities of a node and help identify appropriate resources.

Users/schedulers can subsequently send jobs only where they are capable of being executed

Scenario 3: “Personalities”

Some codes request specific CUDA or NVIDIA driver versions. DCGM can be used to probe the CUDA version/NVIDIA GPU driver version on a compute node.

Users can then script the deployment of alternate versions or the launch containerized apps to support non-standard versions.

Scenario 4: Stress tests

Periodically stress test GPUs in a cluster with integrated functions

Stress tests like Microway GPU Checker can tease out failing GPUs, and reading data via DCGM during or after can identify bad nodes to be sidelined.

Scenario 5: Power Management

Programmatically set GPU Boost or max TDP levels for an application or run. This allow you to eke out extra performance.

Alternatively, set your GPUs to stay within a certain power band to reduce electricity costs when rates are high or lower total cluster consumption when there is insufficient generation capacity

Scenario 6: Logging for Validation

Script the pull of error logs and take action with that data.

You can accumulate error logging over time, and determine tendencies of your cluster. For example, a group of GPUs with consistently high temperatures may indicate a hotspot in your datacenter

Getting Started with DCGM: Starting a Health Check

DCGM can be used in many ways. We won’t explore them all here, but it’s important to understand the ease of use of these capabilities.

Here’s the code for a simple health check and also for a basic diagnostic:

dcgmi health --check -g 1
dcgmi diag –g 1 -r 1

The syntax is very standard and includes dcgmi, the command, and the group of GPUs (you must set a group first). In the diagnostic, you include the level of diagnostics requested (-r 1, or lowest level here).

DCGM and Cluster Management

While the Microway team loves advanced scripting, you may prefer integrating DCGM or its capabilities in with your existing schedulers or cluster managers. The following are supported or already leverage DCGM today:

What’s Next

What will you do with DCGM or DCGM-enabled tools? We’ve only scratched the surface. There are extensive resources on how to use DCGM and/or how it is integrated with other tools. We recommend this blog post.

The post NVIDIA Datacenter Manager (DCGM) for More Effective GPU Management appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-dcgm-for-gpu-management/feed/ 0
Designing A Production-Class AI Cluster https://www.microway.com/hpc-tech-tips/designing-a-production-class-ai-cluster/ https://www.microway.com/hpc-tech-tips/designing-a-production-class-ai-cluster/#respond Fri, 27 Oct 2017 14:49:50 +0000 https://www.microway.com/?p=9997 Artificial Intelligence (AI) and, more specifically, Deep Learning (DL) are revolutionizing the way businesses utilize the vast amounts of data they collect and how researchers accelerate their time to discovery. Some of the most significant examples come from the way AI has already impacted life as we know it such as smartphone speech recognition, search […]

The post Designing A Production-Class AI Cluster appeared first on Microway.

]]>
Artificial Intelligence (AI) and, more specifically, Deep Learning (DL) are revolutionizing the way businesses utilize the vast amounts of data they collect and how researchers accelerate their time to discovery. Some of the most significant examples come from the way AI has already impacted life as we know it such as smartphone speech recognition, search engine image classification, and cancer detection in biomedical imaging. Most businesses have collected troves of data or incorporated new avenues to collect data in recent years. Through the innovations of deep learning, that same data can be used to gain insight, make accurate predictions, and pave the path to discovery.

Developing a plan to integrate AI workloads into an existing business infrastructure or research group presents many challenges. However, there are two key elements that will drive the decisions to customizing an AI cluster. First, understanding the types and volumes of data is paramount to beginning to understand the computational requirements of training the neural network. Secondly, understanding the business expectation for time to result is equally important. Each of these factors influence the first and second stages of the AI workload, respectively. Underestimating the data characteristics will result in insufficient computational and infrastructure resources to train the networks in a reasonable timeframe. Moreover, underestimating the value and requirement of time-to-results can fail to deliver ROI to the business or hamper research results.

Below are summaries of the different features of system design that must be evaluated when configuring an AI cluster in 2017.

System Architectures

AI workloads are very similar to HPC workloads in that they require massive computational resources combined with fast and efficient access to giant datasets. Today, there are systems designed to serve the workload of an AI cluster. These systems outlined in sections below generally share similar characteristics: high-performance CPU cores, large-capacity system memory, multiple NVLink-connected GPUs per node, 10G Ethernet, and EDR InfiniBand. However, there are nuanced differences with each platform. Read below for more information about each.

Microway GPU-Accelerated NumberSmashers

Microway demonstrates the value of experience with every GPU cluster deployment. The company’s long history of designing and deploying state of the art GPU clusters for HPC makes our expertise invaluable when custom configuring full-scale, production-ready AI clusters. One of the most common GPU nodes used in our AI offerings is the NumberSmasher 1U with NVLink. The system features dense compute performance in a small footprint, making it a building block for scale-out cluster design. Alternatively, the Octoputer with Single Root Complex offers the most GPUs per system to maximize the total throughput of a single system.

To ensure maximum performance and field reliability, our system integrators test and tune every node built. Clusters, once integrated, undergo total system testing to assure total peak system operability. We offer AI integration services for installation and testing of AI frameworks in addition to the full suite of cluster management utilities and software. Additionally, all Microway systems come complete with Lifetime Technical Support.

To learn more about Microway’s GPU clusters and systems, please visit Tesla GPU clusters.

NVIDIA DGX Systems

NVIDIA’s DGX-1 and DGX Station systems deliver not only dense computational power per system, they also include access to the NVIDIA GPU Cloud and Container Registry. These NVIDIA resources provide optimized container environments for the host of libraries and frameworks typically running on an AI cluster. This allows researchers and data scientists to focus on delivering results instead of worrying about software maintenance and tuning. As an Elite Solutions Provider of NVIDIA products, Microway offers DGX systems as either a full system solution or as part of a custom cluster design.

IBM Power Systems with PowerAI

IBM’s commitment to innovative chip and system design for HPC and AI workloads has created a platform for next-generation computing. Through collaboration with NVIDIA, the IBM Power Systems are the only available GPU platforms that integrate NVLink connectivity between the CPU and GPU. IBM’s latest AC922 Power System release delivers 10x the throughput over traditional x86 systems. Additionally, Microway integrates IBM PowerAI to provide faster time to deployment with their optimized software distribution.

Professional vs. Consumer GPUs

NVIDIA GPUs are the primary element to designing a world class AI deployment. In fact, NVIDIA’s commitment to delivering AI to everyone has led them to produce a multi-tiered array of GPU accelerators. Microway’s engineers often face questions about the difference between NVIDIA’s consumer GeForce and professional Tesla GPU accelerators. Although at first glance the higher-end GeForce GPUs seem to mimic the computational capabilities of the professional Tesla products, this is not always the case. Upon further inspection, the differences become quite evident.

When determining which GPU to use, raw performance numbers are typically the first technical specifications to review. In specific regard to AI workloads, a Tesla GPU has up to 1000X the performance of a high end GeForce card running half precision floating point calculations (FP16). The GeForce cards also do not support INT8 instructions used in Deep Learning inferencing. Although it is possible to use consumer GPUs for AI work, it is not recommended for large-scale production deployments. Aside from raw throughput, there are many other features that we outline in our article at the link below.

The price of the consumer cards allows businesses and researchers to understand the potential impact of AI and develop code on single systems without investing in a larger infrastructure. Microway recommends that the use of consumer cards be limited to development workstations during the investigatory and development process.

Our knowledge center provides a detailed article on the differences between Tesla and GeForce.

Training and Inferencing

There is a stark contrast between the resources needed for efficient training versus efficient inferencing. Training neural networks requires significant GPU resources for computation, host system resources for data passing, reliable and fast access to entire datasets, and a network architecture to support it all. The resource requirement for inferencing, however, depends on how the new data will be inferenced in production. Real-time inferencing has a far lower computational requirement because the data is fed to the neural network as it occurs in real time. This is very different from bulk inference where entire new data sets are fed into the neural network at the same time. Also, going back to the beginning, understanding the expectation for time-to-result will likely impact the overall cluster design regardless of inference workload.

Storage Architecture

The type of storage architecture used with an AI cluster can and will have a significant impact on efficiency of the cluster. Although storage can seem a rather nebulous topic, the demands of an AI workload are a mostly known factor. During training, the nodes of the cluster will need access to entire data sets because the data will be accessed often and in succession throughout the training process. Many commercial AI appliances, such as the DGX-1, leverage large high-speed cache volumes in each node for efficiency.

Standard and High-Performance Network File Systems are sufficient for small to medium sized AI cluster deployments. If the nodes have been configured properly to each have sufficient cache space, the file system itself does not need to be exceptionally performant as it is simply there for long-term storage. However, if the nodes do not have enough local cache space for the dataset, the need for performant storage increases. There are component features that can increase the performance of an NFS without moving to a parallel file system, but this is not a common scenario for this workload. The goal should always be to have enough local cache space for optimal performance.

Parallel File Systems are known for their performance and sometimes price. These storage systems should be reserved for larger cluster deployments where it will provide the best benefit per dollar spent.

Network Infrastructure

Deploying the right kind of network infrastructure will reduce bottlenecks and improve the performance of the AI cluster. The guidelines for networking will change depending on the size/type of data passing through the network as well as the nature of the computation. For instance, small text files will not need as much bandwidth as 4K video files, but Deep Learning training requires access to the entire data pool which can saturate the network. Going back to the beginning of this article, understanding data sets will help identify and prevent system bottlenecks. Our experts can help walk you through that analysis.

All GPU cluster deployments, regardless of workload, should utilize a tiered networking system that includes a management network and data traffic network. Management networks are typically a single Gigabit or 10Gb Ethernet link to support system management and IPMI. Data traffic networks, however, can require more network bandwidth to accommodate the increased amount of traffic as well as lower latency for increased efficiency.

Common data networks use either Ethernet (10G/25G/40G/50G) or InfiniBand (200Gb or 100Gb). There are many cases where 10G~50G Ethernet will be sufficient for the file sizes and volume of data passing through the network at the same time. These types of networks are often used in workloads with smaller files sizes such as still images or where computation happens within a single node. They can also be a cost-effective network for a cluster with a small number of nodes.

However, for larger files and/or multi-node GPU computation such as DL training, 100Gb EDR InfiniBand is the network fabric of choice for increased bandwidth and lower latency. InfiniBand enables Peer-to-Peer GPU communication between nodes via Remote Direct Memory Access (RDMA) which can increase the efficiency of the overall system.

To compare network speeds and latencies, please visit Performance Characteristics of Common Network Fabrics

The post Designing A Production-Class AI Cluster appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/designing-a-production-class-ai-cluster/feed/ 0
Tesla V100 “Volta” GPU Review https://www.microway.com/hpc-tech-tips/tesla-v100-volta-gpu-review/ https://www.microway.com/hpc-tech-tips/tesla-v100-volta-gpu-review/#respond Thu, 28 Sep 2017 13:50:32 +0000 https://www.microway.com/?p=9401 The next generation NVIDIA Volta architecture is here. With it comes the new Tesla V100 “Volta” GPU, the most advanced datacenter GPU ever built. Volta is NVIDIA’s 2nd GPU architecture in ~12 months, and it builds upon the massive advancements of the Pascal architecture. Whether your workload is in HPC, AI, or even remote visualization […]

The post Tesla V100 “Volta” GPU Review appeared first on Microway.

]]>

The next generation NVIDIA Volta architecture is here. With it comes the new Tesla V100 “Volta” GPU, the most advanced datacenter GPU ever built.

Volta is NVIDIA’s 2nd GPU architecture in ~12 months, and it builds upon the massive advancements of the Pascal architecture. Whether your workload is in HPC, AI, or even remote visualization & graphics acceleration, Tesla V100 has something for you.

Two Flavors, one giant leap: Tesla V100 PCI-E & Tesla V100 with NVLink

For those who love speeds and feeds, here’s a summary of the key enhancements vs Tesla P100 GPUs

Tesla V100 with NVLinkTesla V100 PCI-ETesla P100 with NVLinkTesla P100 PCI-ERatio Tesla V100:P100
DP TFLOPS7.8 TFLOPS7.0 TFLOPS5.3 TFLOPS4.7 TFLOPS~1.4-1.5X
SP TFLOPS15.7 TFLOPS14 TFLOPS9.3 TFLOPS8.74 TFLOPS~1.4-1.5X
TensorFLOPS125 TFLOPS112 TFLOPS21.2 TFLOPS 1/2 Precision18.7 TFLOPS 1/2 Precision~6X
Interface (bidirec. BW) 300GB/sec32GB/sec160GB/sec32GB/sec1.88X NVLink
9.38X PCI-E
Memory Bandwidth900GB/sec900GB/sec720GB/sec720GB/sec1.25X
CUDA Cores (Tensor Cores) 5120 (640)5120 (640)35843584
Performance of Tesla GPUs, Generation to Generation

Selecting the right Tesla V100 for you:

With Tesla P100 “Pascal” GPUs, there was a substantial price premium to the NVLink-enabled SXM2.0 form factor GPUs. We’re excited to see things even out for Tesla V100.

However, that doesn’t mean selecting a GPU is as simple as picking one that matches a system design. Here’s some guidance to help you evaluate your options:

Performance Summary

Increases in relative performance are widely workload dependent. But early testing demonstates HPC performance advancing approximately 50%, in just a 12 month period.
Tesla V100 HPC PerformanceTesla V100 HPC Performance
If you haven’t made the jump to Tesla P100 yet, Tesla V100 is an even more compelling proposition.

For Deep Learning, Tesla V100 delivers a massive leap in performance. This extends to training time, and it also brings unique advancements for inference as well.
Deep Learning Performance Summary -Tesla V100

Enhancements for Every Workload

Now that we’ve learned a bit about each GPU, let’s review the breakthrough advancements for Tesla V100.

Next Generation NVIDIA NVLink Technology (NVLink 2.0)

NVLink helps you to transcend the bottlenecks in PCI-Express based systems and dramatically improve the data flow to GPU accelerators.

The total data pipe for each Tesla V100 with NVLink GPUs is now 300GB/sec of bidirectional bandwidth—that’s nearly 10X the data flow of PCI-E x16 3.0 interfaces!

Bandwidth

NVIDIA has enhanced the bandwidth of each individual NVLink brick by 25%: from 40GB/sec (20+20GB/sec in each direction) to 50GB/sec (25+25GB/sec) of bidirectional bandwidth.

This improved signaling helps carry the load of data intensive applications.

Number of Bricks

But enhanced NVLink isn’t just about simple signaling improvements. Point to point NVLink connections are divided into “bricks” or links. Each brick delivers

From 4 bricks to 6

Over and above the signaling improvements, Tesla V100 with NVLink increases the number of NVLink “bricks” that are embedded into each GPU: from 4 bricks to 6.

This 50% increase in brick quantity delivers a large increase in bandwidth for the world’s most data intensive workloads. It also allows for more diverse set of system designs and configurations.

The NVLink Bank Account

NVIDIA NVLink technology continues to be a breakthrough for both HPC and Deep Learning workloads. But as with previous generations of NVLink, there’s a design choice related to a simple question:

Where do you want to spend your NVLink bandwidth?

Think about NVLink bricks as a “spending” or “bank account.” Each NVLink system-design strikes a different balance between where they “spend the funds.” You may wish to:

  • Spend links on GPU:GPU communication
  • Focus on increasing the number of GPUs
  • Broaden CPU:GPU bandwidth to overcome the PCI-E bottleneck
  • Create a balanced system, or prioritize design choices solely for a single workload

There are good reasons for each, or combinations, of these choices. DGX-1V, NumberSmasher 1U GPU Server with NVLink, and future IBM Power Systems products all set different priorities.

We’ll dive more deeply into these choices in a future post.

Net-net: the NVLink bandwidth increases from 160GB/sec to 300GB/sec (bidirectional), and a number of new, diverse HPC and AI hardware designs are now possible.

Programming Improvements

Tesla V100 and CUDA 9 bring a host of improvements for GPU programmers. These include:

  • Cooperative Groups
  • A new L1 cache + shared memory, that simplifies programming
  • A new SIMT model, that relieves the need to program to fit 32 thread warps

We won’t explore these in detail in this post, but we encourage you to read the following resources for more:

What does Tesla V100 mean for me?

Tesla V100 GPUs mean a massive change for many workloads. But each workload will see different impacts. Here’s a short summary of what we see:

  1. An increase in performance on HPC applications (on paper FLOPS increase of 50%, diverse real-world impacts)
  2. A massive leap for Deep Learning Training
  3. 1 GPU, many Deep Learning workloads
  4. New system designs, better tuned to your applications
  5. Radical new, and radically simpler, programming paradigms (coherence)

The post Tesla V100 “Volta” GPU Review appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/tesla-v100-volta-gpu-review/feed/ 0