gpu Archives - Microway https://www.microway.com/tag/gpu/ We Speak HPC & AI Mon, 03 Jun 2024 16:42:53 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 DGX A100 review: Throughput and Hardware Summary https://www.microway.com/hardware/dgx-a100-review-throughput-and-hardware-summary/ https://www.microway.com/hardware/dgx-a100-review-throughput-and-hardware-summary/#respond Fri, 26 Jun 2020 20:17:42 +0000 https://www.microway.com/?p=12767 When NVIDIA launched the Ampere GPU architecture, they also launched their new flagship system for HPC and deep learning – the DGX 100. This system offers exceptional performance, but also new capabilities. We’ve seen immediate interest and have already shipped to some of the first adopters. Given our early access, we wanted to share a […]

The post DGX A100 review: Throughput and Hardware Summary appeared first on Microway.

]]>
When NVIDIA launched the Ampere GPU architecture, they also launched their new flagship system for HPC and deep learning – the DGX 100. This system offers exceptional performance, but also new capabilities. We’ve seen immediate interest and have already shipped to some of the first adopters. Given our early access, we wanted to share a deeper dive into this impressive new system.

Photo of NVIDIA DGX A100 packaged, being lifted out of packaging, and being tested

The focus of this NVIDIA DGX™ A100 review is on the hardware inside the system – the server features a number of features & improvements not available in any other type of server at the moment. DGX will be the “go-to” server for 2020. But hardware only tells part of the story, particularly for NVIDIA’s DGX products. NVIDIA employs more software engineers than hardware engineers, so be certain that application and GPU library performance will continue to improve through updates to the DGX Operating System and to the whole catalog of software containers provided through the NGC hub. Expect more details as the year continues.

Overall DGX A100 System Architecture

This new DGX system offers top-bin parts across-the-board. Here’s the high-level overview:

  • Dual 64-core AMD EPYC 7742 CPUs
  • 1TB DDR4 system memory (upgradeable to 2TB)
  • Eight NVIDIA A100 SXM4 GPUs with NVLink
  • NVIDIA NVSwitch connectivity between all GPUs
  • 15TB high-speed NVMe SSD Scratch Space (upgradeable to 30TB)
  • Eight Mellanox 200Gbps HDR InfiniBand/Ethernet Single-Port Adapters
  • One or Two Mellanox 200Gbps Ethernet Dual-Port Adapter(s)

As you’ll see from the block diagram, there is a lot to break down within such a complex system. Though it’s a very busy diagram, it becomes apparent that the design is balanced and well laid out. Breaking down the connectivity within DGX A100 we see:

  • The eight NVIDIA A100 GPUs are depicted at the bottom of the diagram, with each GPU fully linked to all other GPUs via six NVSwitches
  • Above the GPUs are four PCI-Express switches which act as nexuses between the GPUs and the rest of the system devices
  • Linking into the PCI-E switch nexuses, there are eight 200Gbps network adapters and eight high-speed SSD devices – one for each GPU
  • The devices are broken into pairs, with 2 GPUs, 2 network adapters, and 2 SSDs per PCI-E nexus
  • Each of the AMD EPYC CPUs connects to two of the PCI-E switch nexuses
  • At the top of the diagram, each EPYC CPU is shown with a link to system memory and a link to a 200Gbps network adapter

We’ll dig into each aspect of the system in turn, starting with the CPUs and making our way down to the new NVIDIA A100 GPUs. Readers should note that throughput and performance numbers are only useful when put into context. You are encouraged to run the same tests on your existing systems/servers to better understand how the performance of DGX A100 will compare to your existing resources. And as always, reach out to Microway’s DGX experts for additional discussion, review, and design of a holistic solution.

Index of our DGX A100 review:

AMD EPYC CPUs and System Memory

Diagram depicting the CPU cores, cache, and memory in the NVIDIA DGX A100
DGX A100 CPU/Memory topology (Click to expand)

With two 64-core EPYC CPUs and 1TB or 2TB of system memory, the DGX A100 boasts respectable performance even before the GPUs are considered. The architecture of the AMD EPYC “Rome” CPUs is outside the scope of this article, but offers an elegant design of its own. Each CPU provides 64 processor cores (supporting up to 128 threads), 256MB L3 cache, and eight channels of DDR4-3200 memory (which provides the highest memory throughput of any mainstream x86 CPU).

Most users need not dive further, but experts will note that each EPYC 7742 CPU has four NUMA nodes (for a total of eight nodes). This allows best performance for parallelized applications and can also reduce the impact of noisy neighbors. Pairs of GPUs are connected to NUMA nodes 1, 3, 5, and 7. Here’s a snapshot of CPU capabilities from the lscpu utility:

Architecture:        x86_64
CPU(s):              256
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
CPU MHz:             3332.691
CPU max MHz:         2250.0000
CPU min MHz:         1500.0000
NUMA node0 CPU(s):   0-15,128-143
NUMA node1 CPU(s):   16-31,144-159
NUMA node2 CPU(s):   32-47,160-175
NUMA node3 CPU(s):   48-63,176-191
NUMA node4 CPU(s):   64-79,192-207
NUMA node5 CPU(s):   80-95,208-223
NUMA node6 CPU(s):   96-111,224-239
NUMA node7 CPU(s):   112-127,240-255

High-speed NVMe Storage

Although DGX A100 is designed to support extremely high-speed connectivity to network/cluster storage, it also provides internal flash storage drives. Redundant 2TB NVMe SSDs are provided to host the Operating System. Four non-redundant striped NVMe SSDs provide a 14TB space for scratch storage (which is most frequently used to cache data coming from a centralized storage system).

Here’s how the filesystems look on a fresh DGX A100:

Filesystem      Size  Used Avail Use%    Mounted on
/dev/md0        1.8T   14G  1.7T   1%    /
/dev/md1         14T   25M   14T   1%    /raid

The industry is trending towards Linux software RAID rather than hardware controllers for NVMe SSDs (as such controllers present too many performance bottlenecks). Here’s what the above md0 and md1 arrays look like when healthy:

md0 : active raid1 nvme1n1p2[0] nvme2n1p2[1]
      1874716672 blocks super 1.2 [2/2] [UU]
      bitmap: 1/14 pages [4KB], 65536KB chunk

md1 : active raid0 nvme5n1[2] nvme3n1[1] nvme4n1[3] nvme0n1[0]
      15002423296 blocks super 1.2 512k chunks

It’s worth noting that although all the internal storage devices are high-performance, the scratch drives making up the /raid filesystem support the newer PCI-E generation 4.0 bus which doubles I/O throughput. NVIDIA leads the pack here, as they’re the first we’ve seen to be shipping these new super-fast SSDs.

High-Throughput and Low-Latency Communications with Mellanox 200Gbps

Photo of the internal system sled of DGX A100 with CPUs, Memory, and HCAs
Sled from DGX A100 showing ten 200Gbps adapters

Depending upon the deployment, nine or ten Mellanox 200Gbps adapters are present in each DGX A100. These adapters support Mellanox VPI, which enables each port to be configured for 200G Ethernet or HDR InfiniBand. Though Ethernet is particularly prevalent in particular sectors (healthcare and other industry verticals), InfiniBand tends to be the mode of choice when highest performance is required.

In practice, a common configuration is for the GPU-adjacent adapters be connected to an InfiniBand fabric (which allows for high-performance RDMA GPU-Direct and Magnum IO communications). The adapter(s) attached to the CPUs are then used for Ethernet connectivity (often meeting the speed of the existing facility Ethernet, which might be any one of 10GbE, 25GbE, 40GbE, 50GbE, 100GbE, or 200GbE).

Leveraging the fast PCI-E 4.0 bus available in DGX A100, each 200Gbps port is able to push up to 24.6GB/s of throughput (with latencies typically ranging from 1.09 to 202 microseconds as measured by OSU’s osu_bw and osu_latency benchmarks). Thus, a properly tuned application running across a cluster of DGX systems could push upwards of 200 gigabytes per second to the fabric!

GPU-to-GPU Transfers with NVSwitch and NVLink

NVIDIA built a new generation of NVIDIA NVLink into the NVIDIA A100 GPUs, which provides double the throughput of NVLink in the previous “Volta” generation. Each NVIDIA A100 GPU supports up to 300GB/s throughput (600GB/s bidirectional). Combined with NVSwitch, which connects each GPU to all other GPUs, the DGX A100 provides full connectivity between all eight GPUs.

Running NVIDIA’s p2pBandwidthLatencyTest utility, we can examine the transfer speeds between each set of GPUs:

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 1180.14 254.47 258.80 254.13 257.67 247.62 257.21 251.53
     1 255.35 1173.05 261.04 243.97 257.09 247.20 258.64 257.51
     2 253.79 260.46 1155.70 241.66 260.23 245.54 259.49 255.91
     3 256.19 261.29 253.87 1142.18 257.59 248.81 250.10 259.44
     4 252.35 260.44 256.82 249.11 1169.54 252.46 257.75 255.62
     5 256.82 257.64 256.37 249.76 255.33 1142.18 259.72 259.95
     6 261.78 260.25 261.81 249.77 258.47 248.63 1173.05 255.47
     7 259.47 261.96 253.61 251.00 259.67 252.21 254.58 1169.54

The above values show GPU-to-GPU transfer throughput ranging from 247GB/s to 262GB/s. Running the same test in bidirectional mode shows results between 473GB/s and 508GB/s. Execution within the same GPU (running down the diagonal) shows data rates around 1,150GB/s.

Turning to latencies, we see fairly uniform communication times between GPUs at ~3 microseconds:

P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3      4      5      6      7
     0   2.63   2.98   2.99   2.96   3.01   2.96   2.96   3.00
     1   3.02   2.59   2.96   3.00   3.03   2.96   2.96   3.03
     2   3.02   2.95   2.51   2.97   3.03   3.04   3.02   2.96
     3   3.05   3.01   2.99   2.49   2.99   2.98   3.06   2.97
     4   2.88   2.88   2.95   2.87   2.39   2.87   2.90   2.88
     5   2.87   2.95   2.89   2.87   2.94   2.49   2.87   2.87
     6   2.89   2.86   2.86   2.88   2.93   2.93   2.53   2.88
     7   2.90   2.90   2.94   2.89   2.87   2.87   2.87   2.54

   CPU     0      1      2      3      4      5      6      7
     0   4.54   3.86   3.94   4.10   3.92   3.93   4.07   3.92
     1   3.99   4.52   4.00   3.96   3.98   4.05   3.92   3.93
     2   4.09   3.99   4.65   4.01   4.00   4.01   4.00   3.97
     3   4.10   4.01   4.03   4.59   4.02   4.03   4.04   3.95
     4   3.89   3.91   3.83   3.88   4.29   3.77   3.76   3.77
     5   4.20   3.87   3.83   3.83   3.89   4.31   3.89   3.84
     6   3.76   3.72   3.77   3.71   3.78   3.77   4.19   3.77
     7   3.86   3.79   3.78   3.78   3.79   3.83   3.81   4.27

As with the bandwidths, the values down the diagonal show execution within that particular GPU. Latencies are lower when executing within a single GPU as there’s no need to hop across the bus to NVSwitch or another GPU. These values show that the same-device latencies are 0.3~0.5 microseconds faster than when communicating with a different GPU via NVSwitch.

Finally, we want to share the full DGX A100 topology as reported by the nvidia-smi topo --matrix utility. While a lot to digest, the main takeaways from this connectivity matrix are the following:

  • all GPUs have full NVLink connectivity (12 links each)
  • each pair of GPUs is connected to a pair of Mellanox adapters via a PXB PCI-E switch
  • each pair of GPUs is closest to a particular set of CPU cores (CPU and NUMA affinity)
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	mlx5_0	mlx5_1	mlx5_2	mlx5_3	mlx5_4	mlx5_5	mlx5_6	mlx5_7	mlx5_8	mlx5_9	CPU Affinity	NUMA Affinity
GPU0	 X 	NV12	NV12	NV12	NV12	NV12	NV12	NV12	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3
GPU1	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NV12	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3
GPU2	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NV12	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	16-31,144-159	1
GPU3	NV12	NV12	NV12	 X 	NV12	NV12	NV12	NV12	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	16-31,144-159	1
GPU4	NV12	NV12	NV12	NV12	 X 	NV12	NV12	NV12	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	112-127,240-255	7
GPU5	NV12	NV12	NV12	NV12	NV12	 X 	NV12	NV12	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	112-127,240-255	7
GPU6	NV12	NV12	NV12	NV12	NV12	NV12	 X 	NV12	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	80-95,208-223	5
GPU7	NV12	NV12	NV12	NV12	NV12	NV12	NV12	 X 	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	80-95,208-223	5

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Host-to-Device Transfer Speeds with PCI-Express generation 4.0

Just as it’s important for the GPUs to be able to communicate with each other, the CPUs must be able to communicate with the GPUs. A100 is the first NVIDIA GPU to support the new PCI-E gen4 bus speed, which doubles the transfer speeds of generation 3. True to expectations, NVIDIA bandwidthTest demonstrates 2X speedups on transfer speeds from the system to each GPU and from each GPU to the system:

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			24.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			26.1

As you might notice, these performance values are right in line with the throughput of each Mellanox 200Gbps adapter. Having eight network adapters with the exact same bandwidth as each of the eight GPUs allows for perfect balance. Data can stream into each GPU from the fabric at line rate (and vice versa).

Diving into the NVIDIA A100 SXM4 GPUs

The DGX A100 is unique in leveraging NVSwitch to provide the full 300GB/s NVLink bandwidth (600GB/s bidirectional) between all GPUs in the system. Although it’s possible to examine a single GPU within this platform, it’s important to keep in mind the context that the GPUs are tightly connected to each other (as well as their linkage to the EPYC CPUs and the Mellanox adapters). The single-GPU information we share below will likely match that shown for A100 SXM4 GPUs in other non-DGX systems. However, their overall performance will depend on the complete system architecture.

To start, here is the ‘brief’ dump of GPU information as provided by nvidia-smi on DGX A100:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   31C    P0    60W / 400W |      0MiB / 40537MiB |      7%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:0F:00.0 Off |                    0 |
| N/A   30C    P0    63W / 400W |      0MiB / 40537MiB |     14%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:47:00.0 Off |                    0 |
| N/A   30C    P0    62W / 400W |      0MiB / 40537MiB |     24%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:4E:00.0 Off |                    0 |
| N/A   29C    P0    58W / 400W |      0MiB / 40537MiB |     23%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM4-40GB      On   | 00000000:87:00.0 Off |                    0 |
| N/A   34C    P0    62W / 400W |      0MiB / 40537MiB |     23%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM4-40GB      On   | 00000000:90:00.0 Off |                    0 |
| N/A   33C    P0    60W / 400W |      0MiB / 40537MiB |     23%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-SXM4-40GB      On   | 00000000:B7:00.0 Off |                    0 |
| N/A   34C    P0    65W / 400W |      0MiB / 40537MiB |     22%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-SXM4-40GB      On   | 00000000:BD:00.0 Off |                    0 |
| N/A   33C    P0    63W / 400W |      0MiB / 40537MiB |     21%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

The clock speed and power consumption of each GPU will vary depending upon the workload (running low when idle to conserve energy and running as high as possible when executing applications). The idle, default, and max boost speeds are shown below. You will note that memory speeds are fixed at 1215 MHz.

    Clocks
        Graphics                          : 420 MHz (GPU is idle)
        Memory                            : 1215 MHz
    Default Applications Clocks
        Graphics                          : 1095 MHz
        Memory                            : 1215 MHz
    Max Clocks
        Graphics                          : 1410 MHz
        Memory                            : 1215 MHz

Those who have particularly stringent efficiency or power requirements will note that the NVIDIA A100 SXM4 GPU supports 81 different clock speeds between 210 MHz and 1410MHz. Power caps can be set to keep each GPU within preset limits between 100 Watts and 400 Watts. Microway’s post on nvidia-smi for GPU control offers more details for those who need such capabilities.

Each new generation of NVIDIA GPUs introduces new architecture capabilities and adjustments to existing features (such as resizing cache). Some details can be found through the deviceQuery utility, reports the CUDA capabilities of each NVIDIA A100 GPU device:

  CUDA Driver Version / Runtime Version          11.0 / 11.0
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 40537 MBytes (42506321920 bytes)
  (108) Multiprocessors, ( 64) CUDA Cores/MP:     6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1215 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes

In the NVIDIA A100 GPU, NVIDIA increased cache & global memory size, introduced new instruction types, enabled new asynchronous data copy capabilities, and more. More complete information is available in our Knowledge Center article which summarizes the features of the Ampere GPU architecture. However, it could be argued that the biggest architecture change is the introduction of MIG.

Multi-Instance GPU (MIG)

For years, virtualization has allowed CPUs to be virtually broken into chunks and shared between a wide group of users and/or applications. One physical CPU device might be simultaneously running jobs for a dozen different users. The flexibility and security offered by virtualization has spawned billion dollar businesses and whole new industries.

NVIDIA GPUs have supported multiple users and virtualization for a couple of generations, but NVIDIA A100 GPUs with MIG are the first to support physical separation of those tasks. In essence, one GPU can now be sliced into up to seven distinct hardware instances. Each instance then runs its own completely independent applications with no interruption or “noise” from other applications running on the GPU:

Diagram of NVIDIA Multi-Instance GPU demonstrating seven separate user instances on one GPU
NVIDIA Multi-Instance GPU supports seven separate user instances on one GPU

The MIG capabilities are significant enough that we won’t attempt to address them all here. Instead, we’ll highlight the most important aspects of MIG. Readers needing complete implementation details are encouraged to reference NVIDIA’s MIG documentation.

Each GPU can have MIG enabled or disabled (which means a DGX A100 system might have some shared GPUs and some dedicated GPUs). Enabling MIG on a GPU has the following effects:

  • One NVIDIA A100 GPU may be split into anywhere between 2 and 7 GPU Instances
  • Each of the GPU Instances receives a dedicated set of hardware units: GPU compute resources (including streaming multiprocessors/SMs, and GPU engines such as copy engines or NVDEC video decoders), and isolated paths through the entire memory system (L2 cache, memory controllers, and DRAM address busses, etc)
  • Each of the GPU Instances can be further divided into Compute Instances, if desired. Each Compute Instance is provided a set of dedicated compute resources (SMs), but all the Compute Instances within the GPU Instance share the memory and GPU engines (such as the video decoders)
  • A unique CUDA_VISIBLE_DEVICES identifier will be created for each Compute Instance and the corresponding parent GPU Instance. The identifier follows this convention:
    MIG-<GPU-UUID>/<GPU instance ID>/<compute instance ID>
  • Graphics API support (e.g. OpenGL etc.) is disabled
  • GPU to GPU P2P (either PCI-Express or NVLink) is disabled
  • CUDA IPC across GPU instances is not supported (though IPC across the Compute Instances within one GPU Instance is supported)

Though the above caveats are important to note, they are not expected to be significant pain points in practice. Applications which require NVLink will be workloads that require significant performance and should not be run on a shared GPU. Applications which need to virtualize GPUs for graphical applications are likely to use a different type of NVIDIA GPU.

Also note that the caveats don’t extend all the way through the CUDA capabilities and software stack. The following features are supported when MIG is enabled:

  • MIG is transparent to CUDA and existing CUDA programs can run under MIG unchanged
  • CUDA MPS is supported on top of MIG
  • GPUDirect RDMA is supported when used from GPU Instances
  • CUDA debugging (e.g. using cuda-gdb) and memory/race checking (e.g. using cuda-memcheck or compute-sanitizer) is supported

When MIG is fully-enabled on the DGX A100 system, up to 56 separate GPU Instances can be executed simultaneously. That could be 56 unique workloads, 56 separate users each running a Jupyter notebook, or some other combination of users and applications. And if some of the users/workloads have more demanding needs than others, MIG can be reconfigured to issue larger slices of the GPU to those particular applications.

DGX A100 Review Summary

DGX-POD with DGX A100

As mentioned at the top, this new hardware is quite impressive, but is only one part of the DGX story. NVIDIA has multiple software stacks to suit the broad range of possible uses for this system. If you’re just getting started, there’s a lot left to learn. Depending upon what you need next, I’d suggest a few different directions:

The post DGX A100 review: Throughput and Hardware Summary appeared first on Microway.

]]>
https://www.microway.com/hardware/dgx-a100-review-throughput-and-hardware-summary/feed/ 0
In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-ampere-gpu-accelerators/ https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-ampere-gpu-accelerators/#respond Sat, 20 Jun 2020 22:09:44 +0000 https://www.microway.com/?post_type=incsub_wiki&p=12599 This article provides details on the NVIDIA A-series GPUs (codenamed “Ampere”). “Ampere” GPUs improve upon the previous-generation “Volta” and “Turing” architectures. Ampere A100 GPUs began shipping in May 2020 (with other variants shipping by end of 2020). Note that not all “Ampere” generation GPUs provide the same capabilities and feature sets. Broadly-speaking, there is one […]

The post In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators appeared first on Microway.

]]>
This article provides details on the NVIDIA A-series GPUs (codenamed “Ampere”). “Ampere” GPUs improve upon the previous-generation “Volta” and “Turing” architectures. Ampere A100 GPUs began shipping in May 2020 (with other variants shipping by end of 2020).

Note that not all “Ampere” generation GPUs provide the same capabilities and feature sets. Broadly-speaking, there is one version dedicated solely to computation and a second version dedicated to a mixture of graphics/visualization and compute. The specifications of both versions are shown below – speak with one of our GPU experts for a personalized summary of the options best suited to your needs.

Computational “Ampere” GPU architecture – important features and changes:

  • Exceptional HPC performance:
    • 9.7 TFLOPS FP64 double-precision floating-point performance
    • Up to 19.5 TFLOPS FP64 double-precision via Tensor Core FP64 instruction support
    • 19.5 TFLOPS FP32 single-precision floating-point performance
  • Exceptional AI deep learning training and inference performance:
    • TensorFloat 32 (TF32) instructions improve performance without loss of accuracy
    • Sparse matrix optimizations potentially double training and inference performance
    • Speedups of 3x~20x for network training, with sparse TF32 TensorCores (vs Tesla V100)
    • Speedups of 7x~20x for inference, with sparse INT8 TensorCores (vs Tesla V100)
    • Tensor Cores support many instruction types: FP64, TF32, BF16, FP16, I8, I4, B1
  • High-speed HBM2 Memory delivers 40GB or 80GB capacity at 1.6TB/s or 2TB/s throughput
  • Multi-Instance GPU allows each A100 GPU to run seven separate/isolated applications
  • 3rd-generation NVLink doubles transfer speeds between GPUs
  • 4th-generation PCI-Express doubles transfer speeds between the system and each GPU
  • Native ECC Memory detects and corrects memory errors without any capacity or performance overhead
  • Larger and Faster L1 Cache and Shared Memory for improved performance
  • Improved L2 Cache is twice as fast and nearly seven times as large as L2 on Tesla V100
  • Compute Data Compression accelerates compressible data patterns, resulting in up to 4x faster DRAM bandwidth, up to 4x faster L2 read bandwidth, and up to 2x increase in L2 capacity.

Visualization “Ampere” GPU architecture – important features and changes:

  • Double FP32 processing throughput with upgraded Streaming Multiprocessors (SM) that support FP32 computation on both datapaths
    (previous generations provided one dedicated FP32 path and one dedicated Integer path)
  • 2nd-generation RT cores provide up to a 2x increase in raytracing performance
  • 3rd-generation Tensor Cores with TF32 and support for sparsity optimizations
  • 3rd-generation NVLink provides up to 56.25 GB/sec bandwidth between pairs of GPUs in each direction
  • GDDR6X memory providing up to 768 GB/s of GPU memory throughput
  • 4th-generation PCI-Express doubles transfer speeds between the system and each GPU

As stated above, the feature sets vary between the “computational” and the “visualization” GPU models. Additional details on each are shared in the tabs below, and the best choice will depend upon your mix of workloads. Please contact our team for additional review and discussion.

NVIDIA “Ampere” GPU Specifications

[tabby title=”High Performance Computing & Deep Learning GPUs”]
The table below summarizes the features of the NVIDIA Ampere GPU Accelerators designed for computation and deep learning/AI/ML. Note that the PCI-Express version of the NVIDIA A100 GPU features a much lower TDP than the SXM4 version of the A100 GPU (250W vs 400W). For this reason, the PCI-Express GPU is not able to sustain peak performance in the same way as the higher-power part. Thus, the performance values of the PCI-E A100 GPU are shown as a range and actual performance will vary by workload.

To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC/AI expert.

Feature NVIDIA A30 PCI-E NVIDIA A100 40GB PCI-E NVIDIA A100 80GB PCI-E NVIDIA A100 SXM4
GPU Chip Ampere GA100
TensorCore Performance*
10.3 TFLOPS FP64
82 TFLOPS † TF32
165 TFLOPS † FP16/BF16
330 TOPS † INT8
661 TOPS † INT4
17.6 ~ 19.5 TFLOPS FP64
140 ~ 156 TFLOPS † TF32
281 ~ 312 TFLOPS † FP16/BF16
562 ~ 624 TOPS † INT8
1,123 ~ 1,248 TOPS † INT4
19.5 TFLOPS FP64
156 TFLOPS † TF32
312 TFLOPS † FP16/BF16
624 TOPS † INT8
1,248 TOPS † INT4
Double Precision (FP64) Performance* 5.2 TFLOPS 8.7 ~ 9.7 TFLOPS 9.7 TFLOPS
Single Precision (FP32) Performance* 10.3 TFLOPS 17.6 ~ 19.5 TFLOPS 19.5 TFLOPS
Half Precision (FP16) Performance* 41 TFLOPS 70 ~ 78 TFLOPS 78 TFLOPS
Brain Floating Point (BF16) Performance* 20 TFLOPS 35 ~ 39 TFLOPS 39 TFLOPS
On-die Memory 24GB HBM2 40GB HBM2 80GB HBM2 40GB HBM2 or 80GB HBM2e
Memory Bandwidth 933 GB/s 1,555 GB/s 1,940 GB/s 1,555 GB/s for 40GB
2,039 GB/s for 80GB
L2 Cache 40MB
Interconnect NVLink 3.0 (4 bricks) + PCI-E 4.0
NVLink is limited to pairs of directly-linked cards
NVLink 3.0 (12 bricks) + PCI-E 4.0
NVLink is limited to pairs of directly-linked cards
NVLink 3.0 (12 bricks) + PCI-E 4.0
GPU-to-GPU transfer bandwidth (bidirectional) 200 GB/s 600 GB/s
Host-to-GPU transfer bandwidth (bidirectional) 64 GB/s
# of MIG instances supported up to 4 up to 7
# of SM Units 56 108
# of Tensor Cores 224 432
# of integer INT32 CUDA Cores 3,584 6,912
# of single-precision FP32 CUDA Cores 3,584 6,912
# of double-precision FP64 CUDA Cores 1,792 3,456
GPU Base Clock 930 MHz 765 MHz 1065 MHz 1095 MHz
GPU Boost Support Yes – Dynamic
GPU Boost Clock 1440 MHz 1410 MHz
Compute Capability 8.0
Workstation Support no
Server Support yes
Cooling Type Passive
Wattage (TDP) 165 250W 300W 400W

* theoretical peak performance based on GPU boost clock
† an additional 2X performance can be achieved via NVIDIA’s new sparsity feature

[tabby title=”Visualization & Ray Tracing GPUs”]
The table below summarizes the features of the NVIDIA Ampere GPU Accelerators designed for visualization and ray tracing. Note that these GPUs would not necessarily be connecting directly to a display device, but might be performing remote rendering from a datacenter.

To learn more about these GPUs and to review which are the best options for you, please speak with a GPU expert.

Feature NVIDIA RTX A5000 NVIDIA RTX A6000 NVIDIA A40
GPU Chip Ampere GA102
TensorCore Performance*
55.6 TFLOPS † TF32
111.1 TFLOPS † FP16/BF16
222.2 TOPS † INT8
444.4 TOPS † INT4
77.4 TFLOPS † TF32
154.8 TFLOPS † FP16/BF16
309.7 TOPS † INT8
619.3 TOPS † INT4
74.8 TFLOPS † TF32
149.7 TFLOPS † FP16/BF16
299.3 TOPS † INT8
598.7 TOPS † INT4
Double Precision (FP64) Performance* 0.4 TFLOPS 0.6 TFLOPS 0.6 TFLOPS
Single Precision (FP32) Performance* 27.8 TFLOPS 38.7 TFLOPS 37.4 TFLOPS
Integer (INT32) Performance* 13.9 TOPS 19.4 TOPS 18.7 TOPS
GPU Memory 24GB 48GB 48GB
Memory Bandwidth 768 GB/s 768 GB/s 696 GB/s
L2 Cache 6MB
Interconnect NVLink 3.0 + PCI-E 4.0
NVLink is limited to pairs of directly-linked cards
GPU-to-GPU transfer bandwidth (bidirectional) 112.5 GB/s
Host-to-GPU transfer bandwidth (bidirectional) 64 GB/s
# of MIG instances supported N/A
# of SM Units 64 84
# of RT Cores 64 84
# of Tensor Cores 256 336
# of integer INT32 CUDA Cores 8,192 10,752
# of single-precision FP32 CUDA Cores 8,192 10,752
# of double-precision FP64 CUDA Cores 128 168
GPU Base Clock not published
GPU Boost Support Yes – Dynamic
GPU Boost Clock not published
Compute Capability 8.6
Workstation Support yes no
Server Support no yes
Cooling Type Active Passive
Wattage (TDP) 230W 300W 300W

* theoretical peak performance based on GPU boost clock
† an additional 2X performance can be achieved via NVIDIA’s new sparsity feature

Several lower-end graphics cards and datacenter GPUs are also available including RTX A2000, RTX A4000, A10, and A16. These GPUs offer similar capabilities, but with lower levels of performance and available at lower price points.
[tabbyending]

Comparison between “Pascal”, “Volta”, and “Ampere” GPU Architectures

Feature Pascal GP100 Volta GV100 Ampere GA100
Compute Capability* 6.0 7.0 8.0
Threads per Warp 32
Max Warps per SM 64
Max Threads per SM 2048
Max Thread Blocks per SM 32
Max Concurrent Kernels 128
32-bit Registers per SM 64 K
Max Registers per Block 64 K
Max Registers per Thread 255
Max Threads per Block 1024
L1 Cache Configuration 24KB
dedicated cache
32KB ~ 128KB
dynamic with shared memory
28KB ~ 192KB
dynamic with shared memory
Shared Memory Configurations 64KB configurable up to 96KB;
remainder for L1 Cache
(128KB total)
configurable up to 164KB;
remainder for L1 Cache
(192KB total)
Max Shared Memory per SM 64KB 96KB 164KB
Max Shared Memory per Thread Block 48KB 96KB 160KB
Max X Grid Dimension 232-1
Tensor Cores No Yes
Mixed Precision Warp-Matrix Functions No Yes
Hardware-accelerated async-copy No Yes
L2 Cache Residency Management No Yes
Dynamic Parallelism Yes
Unified Memory Yes
Preemption Yes

* For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation

Hardware-accelerated raytracing, video encoding, video decoding, and image decoding

The NVIDIA “Ampere” Datacenter GPUs that are designed for computational workloads do not include graphics acceleration features such as RT cores and hardware-accelerated video encoders. For example, RT cores for accelerated raytracing are not included in the A30 and A100 GPUs. Similarly, video encoding units (NVENC) are not included in these GPUs.

To accelerate computational workloads that require processing of image or video files, five JPEG decoding (NVJPG) units and five video decoding units (NVDEC) are included in A100. Details are described on NVIDIA’s A100 for computer vision blog post.

For additional details on NVENC and NVDEC, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.

The post In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-ampere-gpu-accelerators/feed/ 0
NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/ https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/#respond Fri, 15 Mar 2019 17:06:57 +0000 https://www.microway.com/?p=11118 Performance benchmarks are an insightful way to compare new products on the market. With so many GPUs available, it can be difficult to assess which are suitable to your needs. Various benchmarks provide information to compare performance on individual algorithms or operations. Since there are so many different algorithms to choose from, there is no […]

The post NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks appeared first on Microway.

]]>
Performance benchmarks are an insightful way to compare new products on the market. With so many GPUs available, it can be difficult to assess which are suitable to your needs. Various benchmarks provide information to compare performance on individual algorithms or operations. Since there are so many different algorithms to choose from, there is no shortage of benchmarking suites available.

For this comparison, the SHOC benchmark suite (https://github.com/vetter/shoc/) is used to compare the performance of the NVIDIA Tesla T4 with other GPUs commonly used for scientific computing: the NVIDIA Tesla P100 and Tesla V100.

The Scalable Heterogeneous Computing Benchmark Suite (SHOC) is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing, and the software used to program them. Its initial focus is on systems containing Graphics Processing Units (GPUs) and multi-core processors, and on the OpenCL programming standard. It can be used on clusters as well as individual hosts.

The SHOC benchmark suite includes options for many benchmarks relevant to a variety of scientific computations. Most of the benchmarks are provided in both single- and double-precision and with and without PCIE transfer consideration. This means that for each test there are up to four results for each benchmark. These benchmarks are organized into three levels and can be run individually or all together.

The Tesla P100 and V100 GPUs are well-established accelerators for HPC and AI workloads. They typically offer the highest performance, consume the most power (250~300W), and have the highest price tag (~$10k). The Tesla T4 is a new product based on the latest “Turing” architecture, delivering increased efficiency along with new features. However, it is not a replacement for the bigger/more power-hungry GPUs. Instead, it offers good performance while consuming far less power (70W) at a lower price (~$2.5k). You’ll want to use the right tool for the job, which will depend upon your workload(s). A summary of each Tesla GPU is shown below.

In our testing, both single- and double-precision SHOC benchmarks were run, which allows us to make a direct comparison of the capabilities of each GPU. A few HPC-relevant benchmarks were selected to compare the T4 to the P100 and V100. Tesla P100 is based on the “Pascal” architecture, which provides standard CUDA cores. Tesla V100 features the “Volta” architecture, which introduced deep-learning specific TensorCores to complement CUDA cores. Tesla T4 has NVIDIA’s “Turing” architecture, which includes TensorCores and CUDA cores (weighted towards single-precision). This product was designed primarily with machine learning in mind, which results in higher single-precision performance and relatively low double-precision performance. Below, some of the commonly-used HPC benchmarks are compared side-by-side for the three GPUs.

Double Precision Results

GPUTesla T4Tesla V100Tesla P100
Max Flops (GFLOPS)253.387072.864736.76
Fast Fourier Transform (GFLOPS)132.601148.75756.29
Matrix Multiplication (GFLOPS)249.575920.014256.08
Molecular Dynamics  (GFLOPS)105.26908.62402.96
S3D (GFLOPS)59.97227.85161.54

 

Single Precision Results

GPUTesla T4Tesla V100Tesla P100
Max Flops (GFLOPS)8073.2614016.509322.46
Fast Fourier Transform (GFLOPS)660.052301.321510.49
Matrix Multiplication (GFLOPS)3290.9413480.408793.33
Molecular Dynamics (GFLOPS)572.91997.61480.02
S3D (GFLOPS)99.42434.78295.20

 

What Do These Results Mean?

The single-precision results show Tesla T4 performing well for its size, though it falls short in double precision compared to the NVIDIA Tesla V100 and Tesla P100 GPUs. Applications that require double-precision accuracy are not suited to the Tesla T4. However, the single precision performance is impressive and bodes well for the performance of applications that are optimized for lower or mixed precision.

Plot comparing the performance of Tesla T4 with the Tesla P100 and Tesla V100 GPUs

To explain the single-precision benchmarks shown above:

  • The Max Flops for the T4 are good compared to V100 and competitive with P100. Tesla T4 provides more than half as many FLOPS as V100 and more than 80% of P100.
  • The T4 shows impressive performance in the Molecular Dynamics benchmark (an n-body pairwise computation using the Lennard-Jones potential). It again offers more than half the performance of Tesla V100, while beating the Tesla P100.
  • In the Fast Fourier Transform (FFT) and Matrix Multiplication benchmarks, the performance of Tesla T4 is on par for both price/performance and power/performance (one fourth the performance of V100 for one fourth the price and one fourth the wattage). This reflects how the T4 will perform in a large number of HPC applications.
  • For S3D, the T4 falls behind by a few additional percent.

Looking at these results, it’s important to remember the context. Tesla T4 consumes only ~25% the wattage of the larger Tesla GPUs and costs only ~25% as much. It is also a physically smaller GPU that can be installed in a wider variety of servers and compute nodes. In that context, the Tesla T4 holds its own as a powerful option for a reasonable price when compared to the larger NVIDIA Tesla GPUs.

What to Expect from the NVIDIA Tesla T4

Cost-Effective Machine Learning

The T4 has substantial single/mixed precision machine learning focused performance, with a price tag significantly lower than larger Tesla GPUs. What the T4 lacks in double precision, it makes up for with impressive single-precision results. The single-precision performance available will strongly cater to the machine learning algorithms with potential to be applied to mixed precision. Future work will examine this aspect more closely, but Tesla T4 is expected to be of high interest for deep learning inference and to have specific use-cases for deep learning training.

Impressive Single-Precision HPC Performance

In the molecular dynamics benchmark, the T4 outperforms the Tesla P100 GPU. This is extremely impressive, and for those interested in single- or mixed-precision calculations involving similar algorithms, the T4 could provide an excellent solution. With some adapting algorithms, the T4 may be a strong contender for scientific applications that also want to utilize machine learning capabilities to analyze results or run a variety of different types of algorithms from both machine learning and scientific computing on an easily accessible GPU.

In addition to the outright lower price tag, the T4 also operates at 70 Watts, in comparison to the 250+ Watts required for the Tesla P100 / V100 GPUs. Running on one quarter of the power means that it is both cheaper to purchase and cheaper to operate.

Next Steps for leveraging Tesla T4

If it appears the new Tesla T4 will accelerate your workload, but you’d like to benchmark, please sign up to Test Drive for yourself. We also invite you to contact one of our experts to discuss your needs further. Our goal is to understand your requirements, provide guidance on best options, and see the project through to successful system/cluster deployment.

Full SHOC Benchmark Results

The post NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/feed/ 0
Check for memory errors on NVIDIA GPUs https://www.microway.com/knowledge-center-articles/check-for-memory-errors-on-nvidia-gpus/ https://www.microway.com/knowledge-center-articles/check-for-memory-errors-on-nvidia-gpus/#respond Thu, 14 Feb 2019 18:10:35 +0000 https://www.microway.com/?post_type=incsub_wiki&p=11141 Professional NVIDIA GPUs (the Tesla and Quadro products) are equipped with error-correcting code (ECC) memory, which allows the system to detect when memory errors occur. Smaller “single-bit” errors are transparently corrected. Larger “double-bit” memory errors will cause applications to crash, but are at least detected (GPUs without ECC memory would continue operating on the corrupted […]

The post Check for memory errors on NVIDIA GPUs appeared first on Microway.

]]>
Professional NVIDIA GPUs (the Tesla and Quadro products) are equipped with error-correcting code (ECC) memory, which allows the system to detect when memory errors occur. Smaller “single-bit” errors are transparently corrected. Larger “double-bit” memory errors will cause applications to crash, but are at least detected (GPUs without ECC memory would continue operating on the corrupted data).

There are conditions under which GPU events are reported to the Linux kernel, in which case you will see such errors in the system logs. However, the GPUs themselves will also store the type and date of the event.

It’s important to note that not all ECC errors are due to hardware failures. Stray cosmic rays are known to cause bit flips. For this reason, memory is not considered “bad” when a single error occurs (or even when a number of errors occurs). If you have a device reporting tens or hundreds of Double Bit errors, please contact Microway tech support for review. You may also wish to review the NVIDIA documentation

To review the current health of the GPUs in a system, use the nvidia-smi utility:

[root@node7 ~]# nvidia-smi -q -d PAGE_RETIREMENT

==============NVSMI LOG==============

Timestamp                           : Thu Feb 14 10:58:34 2019
Driver Version                      : 410.48

Attached GPUs                       : 4
GPU 00000000:18:00.0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No

GPU 00000000:3B:00.0
    Retired Pages
        Single Bit ECC              : 15
        Double Bit ECC              : 0
        Pending                     : No

The output above shows one card with no issues and one card with a minor quantity of single-bit errors (the card is still functional and in operation).

If the above report indicates that memory pages have been retired, then you may wish to see additional details (including when the pages were retired). If nvidia-smi reports Pending: Yes, then memory errors have occurred since the last time the system rebooted. In either case, there may be older page retirements that took place.

To review a complete listing of the GPU memory pages which have been retired (including the unique ID of each GPU), run:

[root@node7 ~]# nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv

gpu_uuid, retired_pages.address, retired_pages.cause
GPU-9fa5168d-97bf-98aa-33b9-45329682f627, 0x000000000005c05e, Single Bit ECC
GPU-9fa5168d-97bf-98aa-33b9-45329682f627, 0x000000000005ca0d, Single Bit ECC
GPU-9fa5168d-97bf-98aa-33b9-45329682f627, 0x000000000005c72e, Single Bit ECC
...

A different type of output must be selected in order to read the timestamps of page retirements. The output is in XML format and may require a bit more effort to parse. In short, try running a report such as shown below:

[root@node7 ~]# nvidia-smi -i 1 -q -x| grep -i -A1 retired_page_addr

<retired_page_address>0x000000000005c05e</retired_page_address>
<retired_page_timestamp>Mon Dec 18 06:25:25 2017</retired_page_timestamp>
--
<retired_page_address>0x000000000005ca0d</retired_page_address>
<retired_page_timestamp>Mon Dec 18 06:25:25 2017</retired_page_timestamp>
--
<retired_page_address>0x000000000005c72e</retired_page_address>
<retired_page_timestamp>Mon Dec 18 06:25:31 2017</retired_page_timestamp>
...

The post Check for memory errors on NVIDIA GPUs appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/check-for-memory-errors-on-nvidia-gpus/feed/ 0
In-Depth Comparison of NVIDIA Quadro “Turing” GPU Accelerators https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-quadro-turing-gpu-accelerators/ https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-quadro-turing-gpu-accelerators/#respond Tue, 21 Aug 2018 22:05:31 +0000 https://www.microway.com/?post_type=incsub_wiki&p=10694 This article provides in-depth details of the NVIDIA Quadro RTX “Turing” GPUs. NVIDIA “Turing” GPUs bring an evolved core architecture and add dedicated ray tracing units to the previous-generation “Volta” architecture. Turing GPUs began shipping in late 2018. Important features available in the “Turing” GPU architecture include: Quadro “Turing” GPU Specifications The table below summarizes […]

The post In-Depth Comparison of NVIDIA Quadro “Turing” GPU Accelerators appeared first on Microway.

]]>
This article provides in-depth details of the NVIDIA Quadro RTX “Turing” GPUs. NVIDIA “Turing” GPUs bring an evolved core architecture and add dedicated ray tracing units to the previous-generation “Volta” architecture. Turing GPUs began shipping in late 2018.

Important features available in the “Turing” GPU architecture include:

  • New RT Ray Tracing Cores for the first realtime ray-tracing performance
  • Evolved Deep Learning performance with over 130 Tensor TFLOPS (training) and and 500 TOPS Int4 (inference) throughput
  • NVLink 2.0 between GPUs—when optional NVLink bridges are added—supporting up to 2 bricks and up to 100GB/sec bidirectional bandwidth
  • New GDDR6 Memory with a substantial improvement in memory performance compared to previous-generation GPUs.

Quadro “Turing” GPU Specifications

The table below summarizes the features of the available Quadro Turing GPU Accelerators. To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

* FLOPS and TOPS calculations are presented at Max Boost
† Passively-cooled models are available with slightly reduced clock speeds

The post In-Depth Comparison of NVIDIA Quadro “Turing” GPU Accelerators appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-quadro-turing-gpu-accelerators/feed/ 0
In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-tesla-volta-gpu-accelerators/ https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-tesla-volta-gpu-accelerators/#respond Mon, 12 Mar 2018 22:25:26 +0000 https://www.microway.com/?post_type=incsub_wiki&p=10062 This article provides in-depth details of the NVIDIA Tesla V-series GPU accelerators (codenamed “Volta”). “Volta” GPUs improve upon the previous-generation “Pascal” architecture. Volta GPUs began shipping in September 2017 and were updated to 32GB of memory in March 2018; Tesla V100S was released in late 2019. Note: these have since been superseded by the NVIDIA […]

The post In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators appeared first on Microway.

]]>
This article provides in-depth details of the NVIDIA Tesla V-series GPU accelerators (codenamed “Volta”). “Volta” GPUs improve upon the previous-generation “Pascal” architecture. Volta GPUs began shipping in September 2017 and were updated to 32GB of memory in March 2018; Tesla V100S was released in late 2019. Note: these have since been superseded by the NVIDIA Ampere GPU architecture.

This page is intended to be a fast and easy reference of key specs for these GPUs. You may wish to browse our Tesla V100 Price Analysis and Tesla V100 GPU Review for more extended discussion.

Important features available in the “Volta” GPU architecture include:

  • Exceptional HPC performance with up to 8.2 TFLOPS double- and 16.4 TFLOPS single-precision floating-point performance.
  • Deep Learning training performance with up to 130 TFLOPS FP16 half-precision floating-point performance.
  • Deep Learning inference performance with up to 62.8 TeraOPS INT8 8-bit integer performance.
  • Simultaneous execution of FP32 and INT32 operations improves the overall computational throughput of the GPU
  • NVLink enables an 8~10X increase in bandwidth between the Tesla GPUs and from GPUs to supported system CPUs (compared with PCI-E).
  • High-bandwidth HBM2 memory provides a 3X improvement in memory performance compared to previous-generation GPUs.
  • Enhanced Unified Memory allows GPU applications to directly access the memory of all GPUs as well as all of system memory (up to 512TB).
  • Native ECC Memory detects and corrects memory errors without any capacity or performance overhead.
  • Combined L1 Cache and Shared Memory provides additional flexibility and higher performance than Pascal.
  • Cooperative Groups – a new programming model introduced in CUDA 9 for organizing groups of communicating threads

Tesla “Volta” GPU Specifications

The table below summarizes the features of the available Tesla Volta GPU Accelerators. To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

Comparison between “Kepler”, “Pascal”, and “Volta” GPU Architectures

FeatureKepler GK210Pascal GP100Volta GV100
Compute Capability ^3.76.07.0
Threads per Warp32
Max Warps per SM64
Max Threads per SM2048
Max Thread Blocks per SM1632
Max Concurrent Kernels32128
32-bit Registers per SM128 K64 K
Max Registers per Thread Block64 K
Max Registers per Thread255
Max Threads per Thread Block1024
L1 Cache Configurationsplit with shared memory24KB dedicated L1 cache32KB ~ 128KB
(dynamic with shared memory)
Shared Memory Configurations16KB + 112KB L1 Cache

32KB + 96KB L1 Cache

48KB + 80KB L1 Cache

(128KB total)
64KBconfigurable up to 96KB; remainder for L1 Cache

(128KB total)
Max Shared Memory per Thread Block48KB96KB*
Max X Grid Dimension232-1
Hyper-QYes
Dynamic ParallelismYes
Unified MemoryNoYes
Pre-EmptionNoYes

&Hat; For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation
* above 48 KB requires dynamic shared memory

Hardware-accelerated video encoding and decoding

All NVIDIA “Volta” GPUs include one or more hardware units for video encoding and decoding (NVENC / NVDEC). For complete hardware details, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.

The post In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-tesla-volta-gpu-accelerators/feed/ 0
Tesla V100 “Volta” GPU Review https://www.microway.com/hpc-tech-tips/tesla-v100-volta-gpu-review/ https://www.microway.com/hpc-tech-tips/tesla-v100-volta-gpu-review/#respond Thu, 28 Sep 2017 13:50:32 +0000 https://www.microway.com/?p=9401 The next generation NVIDIA Volta architecture is here. With it comes the new Tesla V100 “Volta” GPU, the most advanced datacenter GPU ever built. Volta is NVIDIA’s 2nd GPU architecture in ~12 months, and it builds upon the massive advancements of the Pascal architecture. Whether your workload is in HPC, AI, or even remote visualization […]

The post Tesla V100 “Volta” GPU Review appeared first on Microway.

]]>

The next generation NVIDIA Volta architecture is here. With it comes the new Tesla V100 “Volta” GPU, the most advanced datacenter GPU ever built.

Volta is NVIDIA’s 2nd GPU architecture in ~12 months, and it builds upon the massive advancements of the Pascal architecture. Whether your workload is in HPC, AI, or even remote visualization & graphics acceleration, Tesla V100 has something for you.

Two Flavors, one giant leap: Tesla V100 PCI-E & Tesla V100 with NVLink

For those who love speeds and feeds, here’s a summary of the key enhancements vs Tesla P100 GPUs

Tesla V100 with NVLinkTesla V100 PCI-ETesla P100 with NVLinkTesla P100 PCI-ERatio Tesla V100:P100
DP TFLOPS7.8 TFLOPS7.0 TFLOPS5.3 TFLOPS4.7 TFLOPS~1.4-1.5X
SP TFLOPS15.7 TFLOPS14 TFLOPS9.3 TFLOPS8.74 TFLOPS~1.4-1.5X
TensorFLOPS125 TFLOPS112 TFLOPS21.2 TFLOPS 1/2 Precision18.7 TFLOPS 1/2 Precision~6X
Interface (bidirec. BW) 300GB/sec32GB/sec160GB/sec32GB/sec1.88X NVLink
9.38X PCI-E
Memory Bandwidth900GB/sec900GB/sec720GB/sec720GB/sec1.25X
CUDA Cores (Tensor Cores) 5120 (640)5120 (640)35843584
Performance of Tesla GPUs, Generation to Generation

Selecting the right Tesla V100 for you:

With Tesla P100 “Pascal” GPUs, there was a substantial price premium to the NVLink-enabled SXM2.0 form factor GPUs. We’re excited to see things even out for Tesla V100.

However, that doesn’t mean selecting a GPU is as simple as picking one that matches a system design. Here’s some guidance to help you evaluate your options:

Performance Summary

Increases in relative performance are widely workload dependent. But early testing demonstates HPC performance advancing approximately 50%, in just a 12 month period.
Tesla V100 HPC PerformanceTesla V100 HPC Performance
If you haven’t made the jump to Tesla P100 yet, Tesla V100 is an even more compelling proposition.

For Deep Learning, Tesla V100 delivers a massive leap in performance. This extends to training time, and it also brings unique advancements for inference as well.
Deep Learning Performance Summary -Tesla V100

Enhancements for Every Workload

Now that we’ve learned a bit about each GPU, let’s review the breakthrough advancements for Tesla V100.

Next Generation NVIDIA NVLink Technology (NVLink 2.0)

NVLink helps you to transcend the bottlenecks in PCI-Express based systems and dramatically improve the data flow to GPU accelerators.

The total data pipe for each Tesla V100 with NVLink GPUs is now 300GB/sec of bidirectional bandwidth—that’s nearly 10X the data flow of PCI-E x16 3.0 interfaces!

Bandwidth

NVIDIA has enhanced the bandwidth of each individual NVLink brick by 25%: from 40GB/sec (20+20GB/sec in each direction) to 50GB/sec (25+25GB/sec) of bidirectional bandwidth.

This improved signaling helps carry the load of data intensive applications.

Number of Bricks

But enhanced NVLink isn’t just about simple signaling improvements. Point to point NVLink connections are divided into “bricks” or links. Each brick delivers

From 4 bricks to 6

Over and above the signaling improvements, Tesla V100 with NVLink increases the number of NVLink “bricks” that are embedded into each GPU: from 4 bricks to 6.

This 50% increase in brick quantity delivers a large increase in bandwidth for the world’s most data intensive workloads. It also allows for more diverse set of system designs and configurations.

The NVLink Bank Account

NVIDIA NVLink technology continues to be a breakthrough for both HPC and Deep Learning workloads. But as with previous generations of NVLink, there’s a design choice related to a simple question:

Where do you want to spend your NVLink bandwidth?

Think about NVLink bricks as a “spending” or “bank account.” Each NVLink system-design strikes a different balance between where they “spend the funds.” You may wish to:

  • Spend links on GPU:GPU communication
  • Focus on increasing the number of GPUs
  • Broaden CPU:GPU bandwidth to overcome the PCI-E bottleneck
  • Create a balanced system, or prioritize design choices solely for a single workload

There are good reasons for each, or combinations, of these choices. DGX-1V, NumberSmasher 1U GPU Server with NVLink, and future IBM Power Systems products all set different priorities.

We’ll dive more deeply into these choices in a future post.

Net-net: the NVLink bandwidth increases from 160GB/sec to 300GB/sec (bidirectional), and a number of new, diverse HPC and AI hardware designs are now possible.

Programming Improvements

Tesla V100 and CUDA 9 bring a host of improvements for GPU programmers. These include:

  • Cooperative Groups
  • A new L1 cache + shared memory, that simplifies programming
  • A new SIMT model, that relieves the need to program to fit 32 thread warps

We won’t explore these in detail in this post, but we encourage you to read the following resources for more:

What does Tesla V100 mean for me?

Tesla V100 GPUs mean a massive change for many workloads. But each workload will see different impacts. Here’s a short summary of what we see:

  1. An increase in performance on HPC applications (on paper FLOPS increase of 50%, diverse real-world impacts)
  2. A massive leap for Deep Learning Training
  3. 1 GPU, many Deep Learning workloads
  4. New system designs, better tuned to your applications
  5. Radical new, and radically simpler, programming paradigms (coherence)

The post Tesla V100 “Volta” GPU Review appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/tesla-v100-volta-gpu-review/feed/ 0
DeepChem – a Deep Learning Framework for Drug Discovery https://www.microway.com/hpc-tech-tips/deepchem-deep-learning-framework-for-drug-discovery/ https://www.microway.com/hpc-tech-tips/deepchem-deep-learning-framework-for-drug-discovery/#respond Fri, 28 Apr 2017 19:02:51 +0000 https://www.microway.com/?p=8687 A powerful new open source deep learning framework for drug discovery is now available for public download on github.This new framework, called DeepChem, is python-based, and offers a feature-rich set of functionality for applying deep learning to problems in drug discovery and cheminformatics.Previous deep learning frameworks, such as scikit-learn have been applied to chemiformatics, but […]

The post DeepChem – a Deep Learning Framework for Drug Discovery appeared first on Microway.

]]>
A powerful new open source deep learning framework for drug discovery is now available for public download on github.This new framework, called DeepChem, is python-based, and offers a feature-rich set of functionality for applying deep learning to problems in drug discovery and cheminformatics.Previous deep learning frameworks, such as scikit-learn have been applied to chemiformatics, but DeepChem is the first to accelerate computation with NVIDIA GPUs.

The framework uses Google TensorFlow, along with scikit-learn, for expressing neural networks for deep learning.It also makes use of the RDKit python framework, for performing more basic operations on molecular data, such as converting SMILES strings into molecular graphs.The framework is now in the alpha stage, at version 0.1.As the framework develops, it will move toward implementing more models in TensorFlow, which use GPUs for training and inference.This new open source framework is poised to become an accelerating factor for innovation in drug discovery across industry and academia.

Another unique aspect of DeepChem is that it has incorporated a large amount of publicly-available chemical assay datasets, which are described in Table 1.

DeepChem Assay Datasets

DatasetCategoryDescriptionClassification TypeCompounds
QM7Quantum Mechanicsorbital energies
atomization energies
Regression7,165
QM7bQuantum Mechanicsorbital energiesRegression7,211
ESOLPhysical ChemistrysolubilityRegression1,128
FreeSolvPhysical Chemistrysolvation energyRegression643
PCBABiophysicsbioactivityClassification439,863
MUVBiophysicsbioactivityClassification93,127
HIVBiophysicsbioactivityClassification41,913
PDBBindBiophysicsbinding activityRegression11,908
Tox21PhysiologytoxicityClassification8,014
ToxCastPhysiologytoxicityClassification8,615
SIDERPhysiologyside reactionsClassification1,427
ClinToxPhysiologyclinical toxicityClassification1,491

Table 1:The current v0.1 DeepChem Framework includes the data sets in this table, along others which will be added to future versions.

Metrics

The squared Pearson Correleation Coefficient is used to quantify the quality of performance of a model trained on any of these regression datasets.Models trained on classification datasets have their predictive quality measured by the area under curve (AUC) for receiver operator characteristic (ROC) curves (AUC-ROC).Some datasets have more than one task, in which case the mean over all tasks is reported by the framework.

Data Splitting

DeepChem uses a number of methods for randomizing or reordering datasets so that models can be trained on sets which are more thoroughly randomized, in both the training and validation sets, for example.These methods are summarized in Table 2.

DeepChem Dataset Splitting Methods

Split Typeuse cases
Index Splitdefault index is sufficient as long as it contains no built-in bias
Random Splitif there is some bias to the default index
Scaffold Splitif chemical properties of dataset will be depend on molecular scaffold
Stratified Random Splitwhere one needs to ensure that each dataset split contains a full range of some real-valued property

Table 2:Various methods are available for splitting the dataset in order to avoid sampling bias.

Featurizations

DeepChem offers a number of featurization methods, summarized in Table 3.SMILES strings are unique representations of molecules, and can themselves can be used as a molecular feature.The use of SMILES strings has been explored in recent work.SMILES featurization will likely become a part of future versions of DeepChem.

Most machine learning methods, however, require more feature information than can be extracted from a SMILES string alone.

DeepChem Featurizers

Featurizeruse cases
Extended-Connectivity Fingerprints (ECFP)for molecular datasets not containing large numbers of non-bonded interactions
Graph ConvolutionsLike ECFP, graph convolution produces granular representations of molecular topology. Instead of applying fixed hash functions, as with ECFP, graph convolution uses a set of parameters which can learned by training a neural network associated with a molecular graph structure.
Coloumb MatrixColoumb matrix featurization captures information about the nuclear charge state, and internuclear electric repulsion. This featurization is less granular than ECFP, or graph convolutions, and may perform better where intramolecular electrical potential may play an important role in chemical activity
Grid Featurizationdatasets containing molecules interacting through non-bonded forces, such as docked protein-ligand complexes

Table 3:Various methods are available for splitting the dataset in order to avoid sampling bias.

Supported Models

Supported Models as of v0.1

Model Typepossible use case
Logistic Regressioncontinuous, real-valued prediction required
Random ForestClassification or Regression
Multitask NetworkIf various prediction types required, a multitask network would be a good choice. An example would be a continuous real-valued prediction, along with one or more categorical predictions, as predicted outcomes.
Bypass NetworkClassification and Regression
Graph Convolution Modelsame as Multitask Networks

Table 4: Model types supported by DeepChem 0.1

A Glimpse into the Tox21 Dataset and Deep Learning

The Toxicology in the 21st Century (Tox21) research initiative led to the creation of a public dataset which includes measurements of activation of stress response and nuclear receptor response pathways by 8,014 distinct molecules.Twelve response pathways were observed in total, with each having some association with toxicity.Table 5 summarizes the pathways investigated in the study.

Tox21 Assay Descriptions

Biological Assaydescription
NR-ARNuclear Receptor Panel, Androgen Receptor
NR-AR-LBDNuclear Receptor Panel, Androgen Receptor, luciferase
NR-AhRNuclear Receptor Panel, aryl hydrocarbon receptor
NR-AromataseNuclear Receptor Panel, aromatase
NR-ERNuclear Receptor Panel, Estrogen Receptor alpha
NR-ER-LBDNuclear Receptor Panel, Estrogen Receptor alpha, luciferase
NR-PPAR-gammaNuclear Receptor Panel, peroxisome profilerator-activated receptor gamma
SR-AREStress Response Panel, nuclear factor (erythroid-derived 2)-like 2 antioxidant responsive element
SR-ATAD5Stress Response Panel, genotoxicity indicated by ATAD5
SR-HSEStress Response Panel, heat shock factor response element
SR-MMPStress Response Panel, mitochondrial membrane potential
SR-p53Stress Response Panel, DNA damage p53 pathway

Table 5:Biological pathway responses investigated in the Tox21 Machine Learning Challenge.

We used the Tox21 dataset to make predictions on molecular toxicity in DeepChem using the variations shown in Table 6.

Model Construction Parameter Variations Used

Dataset SplittingIndexScaffold
FeaturizationECFPMolecular Graph Convolution

Table 6:Model construction parameter variations used in generating our predictions, as shown in Figure 1.

A .csv file containing SMILES strings for 8,014 molecules was used to first featurize each molecule by using either ECFP or molecular graph convolution.IUPAC names for each molecule were queried from NIH Cactus, and toxicity predictions were made, using a trained model, on a set of nine molecules randomly selected from the total tox21 data set.Nine results showing molecular structure (rendered by RDKit), IUPAC names, and predicted toxicity scores, across all 12 biochemical response pathways, described in Table 5, are shown in Figure 1.

Tox21 wprediction ith DeepChem
Figure 1. Tox21 Predictions for nine randomly selected molecules from the tox21 dataset

Expect more from DeepChem in the Future

The DeepChem framework is undergoing rapid development, and is currently at the 0.1 release version.New models and features will be added, along with more data sets in future.You can download the DeepChem framework from github.There is also a website for framework documentation at deepchem.io.

Microway offers DeepChem pre-installed on our line of WhisperStation products for Deep Learning. Researchers interested in exploring deep learning applications with chemistry and drug discovery can browse our line of WhisperStation products.

References

1.) Subramanian, Govindan, et al. “Computational Modeling of β-secretase 1 (BACE-1) Inhibitors using Ligand Based Approaches.” Journal of Chemical Information and Modeling 56.10 (2016): 1936-1949.
2.) Altae-Tran, Han, et al. “Low Data Drug Discovery with One-shot Learning.” arXiv preprint arXiv:1611.03199 (2016).
3.) Wu, Zhenqin, et al. “MoleculeNet: A Benchmark for Molecular Machine Learning.” arXiv preprint arXiv:1703.00564 (2017).
4.) Gomes, Joseph, et al. “Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity.” arXiv preprint arXiv:1703.10603 (2017).
5.) Gómez-Bombarelli, Rafael, et al. “Automatic chemical design using a data-driven continuous representation of molecules.” arXiv preprint arXiv:1610.02415 (2016).
6.) Mayr, Andreas, et al. “DeepTox: toxicity prediction using deep learning.” Frontiers in Environmental Science 3 (2016): 80.

The post DeepChem – a Deep Learning Framework for Drug Discovery appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/deepchem-deep-learning-framework-for-drug-discovery/feed/ 0
GPU-accelerated HPC Containers with Singularity https://www.microway.com/hpc-tech-tips/gpu-accelerated-hpc-containers-singularity/ https://www.microway.com/hpc-tech-tips/gpu-accelerated-hpc-containers-singularity/#respond Tue, 11 Apr 2017 16:44:44 +0000 https://www.microway.com/?p=8673 Fighting with application installations is frustrating and time consuming. It’s not what domain experts should be spending their time on. And yet, every time users move their project to a new system, they have to begin again with a re-assembly of their complex workflow. This is a problem that containers can help to solve. HPC […]

The post GPU-accelerated HPC Containers with Singularity appeared first on Microway.

]]>
Fighting with application installations is frustrating and time consuming. It’s not what domain experts should be spending their time on. And yet, every time users move their project to a new system, they have to begin again with a re-assembly of their complex workflow.

This is a problem that containers can help to solve. HPC groups have had some success with more traditional containers (e.g., Docker), but there are security concerns that have made them difficult to use on HPC systems. Singularity, the new tool from the creator of CentOS and Warewulf, aims to resolve these issues.

Singularity helps you to step away from the complex dependencies of your software apps. It enables you to assemble these complex toolchains into a single unified tool that you can use just as simply as you’d use any built-in Linux command. A tool that can be moved from system to system without effort.

Surprising Simplicity

Of course, HPC tools are traditionally quite complex, so users seem to expect Singularity containers to also be complex. Just as virtualization is hard for novices to wrap their heads around, the operation of Singularity containers can be disorienting. For that reason, I encourage you to think of your Singularity containers as a single file; a single tool. It’s an executable that you can use just like any other program. It just happens to have all its dependencies built in.

This means it’s not doing anything tricky with your data files. It’s not doing anything tricky with the network. It’s just a program that you’ll be running like any other. Just like any other program, it can read data from any of your files; it can write data to any local directory you specify. It can download data from the network; it can accept connections from the network. InfiniBand, Omni-Path and/or MPI are fully supported. Once you’ve created it, you really don’t think of it as a container anymore.

GPU-accelerated HPC Containers

When it comes to utilizing the GPUs, Singularity will see the same GPU devices as the host system. It will respect any device selections or restrictions put in place by the workload manager (e.g., SLURM). You can package your applications into GPU-accelerated HPC containers and leverage the flexibilities provided by Singularity. For example, run Ubuntu containers on an HPC cluster that uses CentOS Linux; run binaries built for CentOS on your Ubuntu system.

As part of this effort, we have contributed a Singularity image for TensorFlow back to the Singularity community. This image is available pre-built for all users on our GPU Test Drive cluster. It’s a fantastically easy way to compare the performance of CPU-only and GPU-accelerated versions of TensorFlow. All one needs to do is switch between executables:

Executing the pre-built TensorFlow for CPUs

[eliot@node2 ~]$ tensorflow_cpu ./hello_world.py
Hello, TensorFlow!
42

Executing the pre-built TensorFlow with GPU acceleration

[eliot@node2 ~]$ tensorflow_gpu ./hello_world.py
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla P100-SXM2-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.4805
pciBusID 0000:06:00.0
Total memory: 15.89GiB
Free memory: 15.61GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties:
name: Tesla P100-SXM2-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.4805
pciBusID 0000:07:00.0
Total memory: 15.89GiB
Free memory: 15.61GiB

[...]

I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:06:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-SXM2-16GB, pci bus id: 0000:07:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla P100-SXM2-16GB, pci bus id: 0000:84:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0)
Hello, TensorFlow!
42

As shown above, the tensorflow_cpu and tensorflow_gpu executables include everything that’s needed for TensorFlow. You can just think of them as ready-to-run applications that have all their dependencies built in. All you need to know is where the Singularity container image is stored on the filesystem.

Caveats of GPU-accelerated HPC containers with Singularity

In earlier versions of Singularity, the nature of NVIDIA GPU drivers required a couple extra steps during the configurations of GPU-accelerated containers. Although GPU support is still listed as experimental, Singularity now offers a --nv flag which passes through the appropriate driver/library files. In most cases, you will find that no additional steps are needed to access NVIDIA GPUs with a Singularity container. Give it a try!

Taking the next step on GPU-accelerated HPC containers

There are still many use cases left to be discovered. Singularity containers open up a lot of exciting capabilities. As an example, we are leveraging Singularity on our OpenPower systems (which provide full NVLink connectivity between CPUs and GPUs). All the benefits of Singularity are just as relevant on these platforms. The Singularity images cannot be directly transferred between x86 and POWER8 CPUs, but the same style Singularity recipes may be used. Users can run a pre-built Tensorflow image on x86 nodes and a complimentary image on POWER8 nodes. They don’t have to keep all the internals and dependencies in mind as they build their workflows.

Generating reproducible results is another anticipated benefit of Singularity. Groups can publish complete and ready-to-run containers alongside their results. Singularity’s flexibility will allow those containers to continue operating flawlessly for years to come – even if they move to newer hardware or different operating system versions.

If you’d like to see Singularity in action for yourself, request an account on our GPU Test Drive cluster. For those looking to deploy systems and clusters leveraging Singularity, we provide fully-integrated HPC clusters with Singularity ready-to-run. We can also assist by building optimized libraries, applications, and containers. Contact an HPC expert.

This post was updated 2017-06-02 to reflect recent changes in GPU support.

The post GPU-accelerated HPC Containers with Singularity appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/gpu-accelerated-hpc-containers-singularity/feed/ 0
NVIDIA Tesla P40 GPU Accelerator (Pascal GP102) Up Close https://www.microway.com/hpc-tech-tips/nvidia-tesla-p40-gpu-accelerator-pascal-gp102-up-close/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-p40-gpu-accelerator-pascal-gp102-up-close/#respond Tue, 07 Feb 2017 15:58:14 +0000 https://www.microway.com/?p=8592 As NVIDIA’s GPUs become increasingly vital to the fields of AI and intelligent machines, NVIDIA has produced GPU models specifically targeted to these applications. The new Tesla P40 GPU is NVIDIA’s premiere product for deep learning deployments. It is specifically designed for high-speed inference workloads, which means running data through pre-trained neural networks. However, it […]

The post NVIDIA Tesla P40 GPU Accelerator (Pascal GP102) Up Close appeared first on Microway.

]]>
As NVIDIA’s GPUs become increasingly vital to the fields of AI and intelligent machines, NVIDIA has produced GPU models specifically targeted to these applications. The new Tesla P40 GPU is NVIDIA’s premiere product for deep learning deployments. It is specifically designed for high-speed inference workloads, which means running data through pre-trained neural networks. However, it also offers significant processing performance for projects which do not require 64-bit double-precision floating point capability (many neural networks can be trained using the 32-bit single-precision floating point on the Tesla P40). For those cases, these GPUs can be used to accelerate both the neural network training and the inference.

Highlights of the new Tesla P40 GPU include:

  • Up to 12 TFLOPS single-precision floating-point performance
  • Support for INT8 operations with up to 47 TOPS (ideal for high-speed/high-volume inference)
  • 24GB of GDDR5 GPU memory, with bandwidths up to 346GB/s

PCI-Express Data Transfer Speeds

The Tesla P40 GPUs use the same generation 3.0 PCI-E connectivity as other recent GPUs (such as the Maxwell generation), so you should expect to achieve similar transfer speeds. As shown below, we’re able to achieve transfers up to ~12.8GB/s between the host and the GPU:

[root@node4 ~]# ./bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P40
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11842.8

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12899.9

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			240357.5

Result = PASS

Technical Details of the Tesla P40 GPU

Below are the technical details reported by nvidia-smi. Note that “Pascal” Tesla GPUs now include fully integrated memory ECC support that is always enabled (memory performance in previous generations could be improved by disabling ECC).

[root@node4 ~]# nvidia-smi -a -i 0

==============NVSMI LOG==============

Timestamp                           : Mon Feb  6 12:30:52 2017
Driver Version                      : 367.57

Attached GPUs                       : 4
GPU 0000:02:00.0
    Product Name                    : Tesla P40
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Enabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0324416xxxxxx
    GPU UUID                        : GPU-16254654-0bd3-8d18-e8fe-d53865xxxxxx
    Minor Number                    : 0
    VBIOS Version                   : 86.02.23.00.01
    MultiGPU Board                  : No
    Board ID                        : 0x200
    GPU Part Number                 : 900-2G610-0000-000
    Inforom Version
        Image Version               : G610.0200.00.03
        OEM Object                  : 1.1
        ECC Object                  : 4.1
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1B3810DE
        Bus Id                      : 0000:02:00.0
        Sub System Id               : 0x11D910DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : N/A
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 22912 MiB
        Used                        : 0 MiB
        Free                        : 22912 MiB
    BAR1 Memory Usage
        Total                       : 32768 MiB
        Used                        : 2 MiB
        Free                        : 32766 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 27 C
        GPU Shutdown Temp           : 95 C
        GPU Slowdown Temp           : 92 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 12.34 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 250.00 W
    Clocks
        Graphics                    : 544 MHz
        SM                          : 544 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
    Applications Clocks
        Graphics                    : 1531 MHz
        Memory                      : 3615 MHz
    Default Applications Clocks
        Graphics                    : 1303 MHz
        Memory                      : 3615 MHz
    Max Clocks
        Graphics                    : 1531 MHz
        SM                          : 1531 MHz
        Memory                      : 3615 MHz
        Video                       : 1379 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

The latest NVIDIA GPU architectures support large numbers of clock speeds, as well as automated boosting of the clock speed (when power and thermals allow). Administrators can also set specific power consumption limits and monitor the clock speeds (including explanations for any reasons the clocks are running at a lower speed). The list below shows the available clock speeds for the Tesla P40 GPU:

[root@node4 ~]# nvidia-smi -q -d SUPPORTED_CLOCKS -i 0

==============NVSMI LOG==============

Timestamp                           : Mon Feb  6 12:31:56 2017
Driver Version                      : 367.57

Attached GPUs                       : 4
GPU 0000:02:00.0
    Supported Clocks
        Memory                      : 3615 MHz
            Graphics                : 1531 MHz
            Graphics                : 1518 MHz
            Graphics                : 1506 MHz
            Graphics                : 1493 MHz
            Graphics                : 1480 MHz
            Graphics                : 1468 MHz
            Graphics                : 1455 MHz
            Graphics                : 1442 MHz
            Graphics                : 1430 MHz
            Graphics                : 1417 MHz
            Graphics                : 1404 MHz
            Graphics                : 1392 MHz
            Graphics                : 1379 MHz
            Graphics                : 1366 MHz
            Graphics                : 1354 MHz
            Graphics                : 1341 MHz
            Graphics                : 1328 MHz
            Graphics                : 1316 MHz
            Graphics                : 1303 MHz
            Graphics                : 1290 MHz
            Graphics                : 1278 MHz
            Graphics                : 1265 MHz
            Graphics                : 1252 MHz
            Graphics                : 1240 MHz
            Graphics                : 1227 MHz
            Graphics                : 1215 MHz
            Graphics                : 1202 MHz
            Graphics                : 1189 MHz
            Graphics                : 1177 MHz
            Graphics                : 1164 MHz
            Graphics                : 1151 MHz
            Graphics                : 1139 MHz
            Graphics                : 1126 MHz
            Graphics                : 1113 MHz
            Graphics                : 1101 MHz
            Graphics                : 1088 MHz
            Graphics                : 1075 MHz
            Graphics                : 1063 MHz
            Graphics                : 1050 MHz
            Graphics                : 1037 MHz
            Graphics                : 1025 MHz
            Graphics                : 1012 MHz

NVIDIA deviceQuery on Tesla P40 GPU

Each new GPU generation brings tweaks to the design. The output below, from the CUDA 8.0 SDK samples, shows additional details of the architecture and capabilities of the “Pascal” Tesla P40 GPU accelerators. Take note of the new Compute Capability 6.1, which is what you’ll want to target if you’re compiling your own CUDA code.

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla P40"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 22913 MBytes (24025956352 bytes)
  (30) Multiprocessors, (128) CUDA Cores/MP:     3840 CUDA Cores
  GPU Max Clock rate:                            1531 MHz (1.53 GHz)
  Memory Clock rate:                             3615 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = Tesla P40, Device1 = Tesla P40, Device2 = Tesla P40, Device3 = Tesla P40
Result = PASS

Additional Information on Tesla P40 GPUs

To learn more about the NVIDIA “Pascal” GPU architecture and to compare Tesla P40 with other models in the Tesla product line, read our “Pascal” Tesla GPU knowledge center article.

If you’re thinking about using GPUs for the first time, please consider getting in touch with us. We’ve been implementing GPU-accelerated systems for nearly a decade and have the expertise to help make your project a success!

If you’re an existing GPU user considering a new deployment, review or Tesla GPU clusters page and our list of GPU servers.

The post NVIDIA Tesla P40 GPU Accelerator (Pascal GP102) Up Close appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-p40-gpu-accelerator-pascal-gp102-up-close/feed/ 0