Benchmarking Archives - Microway https://www.microway.com/category/benchmarking/ We Speak HPC & AI Mon, 03 Jun 2024 16:42:53 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 DGX A100 review: Throughput and Hardware Summary https://www.microway.com/hardware/dgx-a100-review-throughput-and-hardware-summary/ https://www.microway.com/hardware/dgx-a100-review-throughput-and-hardware-summary/#respond Fri, 26 Jun 2020 20:17:42 +0000 https://www.microway.com/?p=12767 When NVIDIA launched the Ampere GPU architecture, they also launched their new flagship system for HPC and deep learning – the DGX 100. This system offers exceptional performance, but also new capabilities. We’ve seen immediate interest and have already shipped to some of the first adopters. Given our early access, we wanted to share a […]

The post DGX A100 review: Throughput and Hardware Summary appeared first on Microway.

]]>
When NVIDIA launched the Ampere GPU architecture, they also launched their new flagship system for HPC and deep learning – the DGX 100. This system offers exceptional performance, but also new capabilities. We’ve seen immediate interest and have already shipped to some of the first adopters. Given our early access, we wanted to share a deeper dive into this impressive new system.

Photo of NVIDIA DGX A100 packaged, being lifted out of packaging, and being tested

The focus of this NVIDIA DGX™ A100 review is on the hardware inside the system – the server features a number of features & improvements not available in any other type of server at the moment. DGX will be the “go-to” server for 2020. But hardware only tells part of the story, particularly for NVIDIA’s DGX products. NVIDIA employs more software engineers than hardware engineers, so be certain that application and GPU library performance will continue to improve through updates to the DGX Operating System and to the whole catalog of software containers provided through the NGC hub. Expect more details as the year continues.

Overall DGX A100 System Architecture

This new DGX system offers top-bin parts across-the-board. Here’s the high-level overview:

  • Dual 64-core AMD EPYC 7742 CPUs
  • 1TB DDR4 system memory (upgradeable to 2TB)
  • Eight NVIDIA A100 SXM4 GPUs with NVLink
  • NVIDIA NVSwitch connectivity between all GPUs
  • 15TB high-speed NVMe SSD Scratch Space (upgradeable to 30TB)
  • Eight Mellanox 200Gbps HDR InfiniBand/Ethernet Single-Port Adapters
  • One or Two Mellanox 200Gbps Ethernet Dual-Port Adapter(s)

As you’ll see from the block diagram, there is a lot to break down within such a complex system. Though it’s a very busy diagram, it becomes apparent that the design is balanced and well laid out. Breaking down the connectivity within DGX A100 we see:

  • The eight NVIDIA A100 GPUs are depicted at the bottom of the diagram, with each GPU fully linked to all other GPUs via six NVSwitches
  • Above the GPUs are four PCI-Express switches which act as nexuses between the GPUs and the rest of the system devices
  • Linking into the PCI-E switch nexuses, there are eight 200Gbps network adapters and eight high-speed SSD devices – one for each GPU
  • The devices are broken into pairs, with 2 GPUs, 2 network adapters, and 2 SSDs per PCI-E nexus
  • Each of the AMD EPYC CPUs connects to two of the PCI-E switch nexuses
  • At the top of the diagram, each EPYC CPU is shown with a link to system memory and a link to a 200Gbps network adapter

We’ll dig into each aspect of the system in turn, starting with the CPUs and making our way down to the new NVIDIA A100 GPUs. Readers should note that throughput and performance numbers are only useful when put into context. You are encouraged to run the same tests on your existing systems/servers to better understand how the performance of DGX A100 will compare to your existing resources. And as always, reach out to Microway’s DGX experts for additional discussion, review, and design of a holistic solution.

Index of our DGX A100 review:

AMD EPYC CPUs and System Memory

Diagram depicting the CPU cores, cache, and memory in the NVIDIA DGX A100
DGX A100 CPU/Memory topology (Click to expand)

With two 64-core EPYC CPUs and 1TB or 2TB of system memory, the DGX A100 boasts respectable performance even before the GPUs are considered. The architecture of the AMD EPYC “Rome” CPUs is outside the scope of this article, but offers an elegant design of its own. Each CPU provides 64 processor cores (supporting up to 128 threads), 256MB L3 cache, and eight channels of DDR4-3200 memory (which provides the highest memory throughput of any mainstream x86 CPU).

Most users need not dive further, but experts will note that each EPYC 7742 CPU has four NUMA nodes (for a total of eight nodes). This allows best performance for parallelized applications and can also reduce the impact of noisy neighbors. Pairs of GPUs are connected to NUMA nodes 1, 3, 5, and 7. Here’s a snapshot of CPU capabilities from the lscpu utility:

Architecture:        x86_64
CPU(s):              256
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
CPU MHz:             3332.691
CPU max MHz:         2250.0000
CPU min MHz:         1500.0000
NUMA node0 CPU(s):   0-15,128-143
NUMA node1 CPU(s):   16-31,144-159
NUMA node2 CPU(s):   32-47,160-175
NUMA node3 CPU(s):   48-63,176-191
NUMA node4 CPU(s):   64-79,192-207
NUMA node5 CPU(s):   80-95,208-223
NUMA node6 CPU(s):   96-111,224-239
NUMA node7 CPU(s):   112-127,240-255

High-speed NVMe Storage

Although DGX A100 is designed to support extremely high-speed connectivity to network/cluster storage, it also provides internal flash storage drives. Redundant 2TB NVMe SSDs are provided to host the Operating System. Four non-redundant striped NVMe SSDs provide a 14TB space for scratch storage (which is most frequently used to cache data coming from a centralized storage system).

Here’s how the filesystems look on a fresh DGX A100:

Filesystem      Size  Used Avail Use%    Mounted on
/dev/md0        1.8T   14G  1.7T   1%    /
/dev/md1         14T   25M   14T   1%    /raid

The industry is trending towards Linux software RAID rather than hardware controllers for NVMe SSDs (as such controllers present too many performance bottlenecks). Here’s what the above md0 and md1 arrays look like when healthy:

md0 : active raid1 nvme1n1p2[0] nvme2n1p2[1]
      1874716672 blocks super 1.2 [2/2] [UU]
      bitmap: 1/14 pages [4KB], 65536KB chunk

md1 : active raid0 nvme5n1[2] nvme3n1[1] nvme4n1[3] nvme0n1[0]
      15002423296 blocks super 1.2 512k chunks

It’s worth noting that although all the internal storage devices are high-performance, the scratch drives making up the /raid filesystem support the newer PCI-E generation 4.0 bus which doubles I/O throughput. NVIDIA leads the pack here, as they’re the first we’ve seen to be shipping these new super-fast SSDs.

High-Throughput and Low-Latency Communications with Mellanox 200Gbps

Photo of the internal system sled of DGX A100 with CPUs, Memory, and HCAs
Sled from DGX A100 showing ten 200Gbps adapters

Depending upon the deployment, nine or ten Mellanox 200Gbps adapters are present in each DGX A100. These adapters support Mellanox VPI, which enables each port to be configured for 200G Ethernet or HDR InfiniBand. Though Ethernet is particularly prevalent in particular sectors (healthcare and other industry verticals), InfiniBand tends to be the mode of choice when highest performance is required.

In practice, a common configuration is for the GPU-adjacent adapters be connected to an InfiniBand fabric (which allows for high-performance RDMA GPU-Direct and Magnum IO communications). The adapter(s) attached to the CPUs are then used for Ethernet connectivity (often meeting the speed of the existing facility Ethernet, which might be any one of 10GbE, 25GbE, 40GbE, 50GbE, 100GbE, or 200GbE).

Leveraging the fast PCI-E 4.0 bus available in DGX A100, each 200Gbps port is able to push up to 24.6GB/s of throughput (with latencies typically ranging from 1.09 to 202 microseconds as measured by OSU’s osu_bw and osu_latency benchmarks). Thus, a properly tuned application running across a cluster of DGX systems could push upwards of 200 gigabytes per second to the fabric!

GPU-to-GPU Transfers with NVSwitch and NVLink

NVIDIA built a new generation of NVIDIA NVLink into the NVIDIA A100 GPUs, which provides double the throughput of NVLink in the previous “Volta” generation. Each NVIDIA A100 GPU supports up to 300GB/s throughput (600GB/s bidirectional). Combined with NVSwitch, which connects each GPU to all other GPUs, the DGX A100 provides full connectivity between all eight GPUs.

Running NVIDIA’s p2pBandwidthLatencyTest utility, we can examine the transfer speeds between each set of GPUs:

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 1180.14 254.47 258.80 254.13 257.67 247.62 257.21 251.53
     1 255.35 1173.05 261.04 243.97 257.09 247.20 258.64 257.51
     2 253.79 260.46 1155.70 241.66 260.23 245.54 259.49 255.91
     3 256.19 261.29 253.87 1142.18 257.59 248.81 250.10 259.44
     4 252.35 260.44 256.82 249.11 1169.54 252.46 257.75 255.62
     5 256.82 257.64 256.37 249.76 255.33 1142.18 259.72 259.95
     6 261.78 260.25 261.81 249.77 258.47 248.63 1173.05 255.47
     7 259.47 261.96 253.61 251.00 259.67 252.21 254.58 1169.54

The above values show GPU-to-GPU transfer throughput ranging from 247GB/s to 262GB/s. Running the same test in bidirectional mode shows results between 473GB/s and 508GB/s. Execution within the same GPU (running down the diagonal) shows data rates around 1,150GB/s.

Turning to latencies, we see fairly uniform communication times between GPUs at ~3 microseconds:

P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3      4      5      6      7
     0   2.63   2.98   2.99   2.96   3.01   2.96   2.96   3.00
     1   3.02   2.59   2.96   3.00   3.03   2.96   2.96   3.03
     2   3.02   2.95   2.51   2.97   3.03   3.04   3.02   2.96
     3   3.05   3.01   2.99   2.49   2.99   2.98   3.06   2.97
     4   2.88   2.88   2.95   2.87   2.39   2.87   2.90   2.88
     5   2.87   2.95   2.89   2.87   2.94   2.49   2.87   2.87
     6   2.89   2.86   2.86   2.88   2.93   2.93   2.53   2.88
     7   2.90   2.90   2.94   2.89   2.87   2.87   2.87   2.54

   CPU     0      1      2      3      4      5      6      7
     0   4.54   3.86   3.94   4.10   3.92   3.93   4.07   3.92
     1   3.99   4.52   4.00   3.96   3.98   4.05   3.92   3.93
     2   4.09   3.99   4.65   4.01   4.00   4.01   4.00   3.97
     3   4.10   4.01   4.03   4.59   4.02   4.03   4.04   3.95
     4   3.89   3.91   3.83   3.88   4.29   3.77   3.76   3.77
     5   4.20   3.87   3.83   3.83   3.89   4.31   3.89   3.84
     6   3.76   3.72   3.77   3.71   3.78   3.77   4.19   3.77
     7   3.86   3.79   3.78   3.78   3.79   3.83   3.81   4.27

As with the bandwidths, the values down the diagonal show execution within that particular GPU. Latencies are lower when executing within a single GPU as there’s no need to hop across the bus to NVSwitch or another GPU. These values show that the same-device latencies are 0.3~0.5 microseconds faster than when communicating with a different GPU via NVSwitch.

Finally, we want to share the full DGX A100 topology as reported by the nvidia-smi topo --matrix utility. While a lot to digest, the main takeaways from this connectivity matrix are the following:

  • all GPUs have full NVLink connectivity (12 links each)
  • each pair of GPUs is connected to a pair of Mellanox adapters via a PXB PCI-E switch
  • each pair of GPUs is closest to a particular set of CPU cores (CPU and NUMA affinity)
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	mlx5_0	mlx5_1	mlx5_2	mlx5_3	mlx5_4	mlx5_5	mlx5_6	mlx5_7	mlx5_8	mlx5_9	CPU Affinity	NUMA Affinity
GPU0	 X 	NV12	NV12	NV12	NV12	NV12	NV12	NV12	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3
GPU1	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NV12	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3
GPU2	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NV12	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	16-31,144-159	1
GPU3	NV12	NV12	NV12	 X 	NV12	NV12	NV12	NV12	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	16-31,144-159	1
GPU4	NV12	NV12	NV12	NV12	 X 	NV12	NV12	NV12	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	112-127,240-255	7
GPU5	NV12	NV12	NV12	NV12	NV12	 X 	NV12	NV12	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	112-127,240-255	7
GPU6	NV12	NV12	NV12	NV12	NV12	NV12	 X 	NV12	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	80-95,208-223	5
GPU7	NV12	NV12	NV12	NV12	NV12	NV12	NV12	 X 	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	80-95,208-223	5

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Host-to-Device Transfer Speeds with PCI-Express generation 4.0

Just as it’s important for the GPUs to be able to communicate with each other, the CPUs must be able to communicate with the GPUs. A100 is the first NVIDIA GPU to support the new PCI-E gen4 bus speed, which doubles the transfer speeds of generation 3. True to expectations, NVIDIA bandwidthTest demonstrates 2X speedups on transfer speeds from the system to each GPU and from each GPU to the system:

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			24.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			26.1

As you might notice, these performance values are right in line with the throughput of each Mellanox 200Gbps adapter. Having eight network adapters with the exact same bandwidth as each of the eight GPUs allows for perfect balance. Data can stream into each GPU from the fabric at line rate (and vice versa).

Diving into the NVIDIA A100 SXM4 GPUs

The DGX A100 is unique in leveraging NVSwitch to provide the full 300GB/s NVLink bandwidth (600GB/s bidirectional) between all GPUs in the system. Although it’s possible to examine a single GPU within this platform, it’s important to keep in mind the context that the GPUs are tightly connected to each other (as well as their linkage to the EPYC CPUs and the Mellanox adapters). The single-GPU information we share below will likely match that shown for A100 SXM4 GPUs in other non-DGX systems. However, their overall performance will depend on the complete system architecture.

To start, here is the ‘brief’ dump of GPU information as provided by nvidia-smi on DGX A100:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   31C    P0    60W / 400W |      0MiB / 40537MiB |      7%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:0F:00.0 Off |                    0 |
| N/A   30C    P0    63W / 400W |      0MiB / 40537MiB |     14%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:47:00.0 Off |                    0 |
| N/A   30C    P0    62W / 400W |      0MiB / 40537MiB |     24%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:4E:00.0 Off |                    0 |
| N/A   29C    P0    58W / 400W |      0MiB / 40537MiB |     23%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM4-40GB      On   | 00000000:87:00.0 Off |                    0 |
| N/A   34C    P0    62W / 400W |      0MiB / 40537MiB |     23%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM4-40GB      On   | 00000000:90:00.0 Off |                    0 |
| N/A   33C    P0    60W / 400W |      0MiB / 40537MiB |     23%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-SXM4-40GB      On   | 00000000:B7:00.0 Off |                    0 |
| N/A   34C    P0    65W / 400W |      0MiB / 40537MiB |     22%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-SXM4-40GB      On   | 00000000:BD:00.0 Off |                    0 |
| N/A   33C    P0    63W / 400W |      0MiB / 40537MiB |     21%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

The clock speed and power consumption of each GPU will vary depending upon the workload (running low when idle to conserve energy and running as high as possible when executing applications). The idle, default, and max boost speeds are shown below. You will note that memory speeds are fixed at 1215 MHz.

    Clocks
        Graphics                          : 420 MHz (GPU is idle)
        Memory                            : 1215 MHz
    Default Applications Clocks
        Graphics                          : 1095 MHz
        Memory                            : 1215 MHz
    Max Clocks
        Graphics                          : 1410 MHz
        Memory                            : 1215 MHz

Those who have particularly stringent efficiency or power requirements will note that the NVIDIA A100 SXM4 GPU supports 81 different clock speeds between 210 MHz and 1410MHz. Power caps can be set to keep each GPU within preset limits between 100 Watts and 400 Watts. Microway’s post on nvidia-smi for GPU control offers more details for those who need such capabilities.

Each new generation of NVIDIA GPUs introduces new architecture capabilities and adjustments to existing features (such as resizing cache). Some details can be found through the deviceQuery utility, reports the CUDA capabilities of each NVIDIA A100 GPU device:

  CUDA Driver Version / Runtime Version          11.0 / 11.0
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 40537 MBytes (42506321920 bytes)
  (108) Multiprocessors, ( 64) CUDA Cores/MP:     6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1215 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes

In the NVIDIA A100 GPU, NVIDIA increased cache & global memory size, introduced new instruction types, enabled new asynchronous data copy capabilities, and more. More complete information is available in our Knowledge Center article which summarizes the features of the Ampere GPU architecture. However, it could be argued that the biggest architecture change is the introduction of MIG.

Multi-Instance GPU (MIG)

For years, virtualization has allowed CPUs to be virtually broken into chunks and shared between a wide group of users and/or applications. One physical CPU device might be simultaneously running jobs for a dozen different users. The flexibility and security offered by virtualization has spawned billion dollar businesses and whole new industries.

NVIDIA GPUs have supported multiple users and virtualization for a couple of generations, but NVIDIA A100 GPUs with MIG are the first to support physical separation of those tasks. In essence, one GPU can now be sliced into up to seven distinct hardware instances. Each instance then runs its own completely independent applications with no interruption or “noise” from other applications running on the GPU:

Diagram of NVIDIA Multi-Instance GPU demonstrating seven separate user instances on one GPU
NVIDIA Multi-Instance GPU supports seven separate user instances on one GPU

The MIG capabilities are significant enough that we won’t attempt to address them all here. Instead, we’ll highlight the most important aspects of MIG. Readers needing complete implementation details are encouraged to reference NVIDIA’s MIG documentation.

Each GPU can have MIG enabled or disabled (which means a DGX A100 system might have some shared GPUs and some dedicated GPUs). Enabling MIG on a GPU has the following effects:

  • One NVIDIA A100 GPU may be split into anywhere between 2 and 7 GPU Instances
  • Each of the GPU Instances receives a dedicated set of hardware units: GPU compute resources (including streaming multiprocessors/SMs, and GPU engines such as copy engines or NVDEC video decoders), and isolated paths through the entire memory system (L2 cache, memory controllers, and DRAM address busses, etc)
  • Each of the GPU Instances can be further divided into Compute Instances, if desired. Each Compute Instance is provided a set of dedicated compute resources (SMs), but all the Compute Instances within the GPU Instance share the memory and GPU engines (such as the video decoders)
  • A unique CUDA_VISIBLE_DEVICES identifier will be created for each Compute Instance and the corresponding parent GPU Instance. The identifier follows this convention:
    MIG-<GPU-UUID>/<GPU instance ID>/<compute instance ID>
  • Graphics API support (e.g. OpenGL etc.) is disabled
  • GPU to GPU P2P (either PCI-Express or NVLink) is disabled
  • CUDA IPC across GPU instances is not supported (though IPC across the Compute Instances within one GPU Instance is supported)

Though the above caveats are important to note, they are not expected to be significant pain points in practice. Applications which require NVLink will be workloads that require significant performance and should not be run on a shared GPU. Applications which need to virtualize GPUs for graphical applications are likely to use a different type of NVIDIA GPU.

Also note that the caveats don’t extend all the way through the CUDA capabilities and software stack. The following features are supported when MIG is enabled:

  • MIG is transparent to CUDA and existing CUDA programs can run under MIG unchanged
  • CUDA MPS is supported on top of MIG
  • GPUDirect RDMA is supported when used from GPU Instances
  • CUDA debugging (e.g. using cuda-gdb) and memory/race checking (e.g. using cuda-memcheck or compute-sanitizer) is supported

When MIG is fully-enabled on the DGX A100 system, up to 56 separate GPU Instances can be executed simultaneously. That could be 56 unique workloads, 56 separate users each running a Jupyter notebook, or some other combination of users and applications. And if some of the users/workloads have more demanding needs than others, MIG can be reconfigured to issue larger slices of the GPU to those particular applications.

DGX A100 Review Summary

DGX-POD with DGX A100

As mentioned at the top, this new hardware is quite impressive, but is only one part of the DGX story. NVIDIA has multiple software stacks to suit the broad range of possible uses for this system. If you’re just getting started, there’s a lot left to learn. Depending upon what you need next, I’d suggest a few different directions:

The post DGX A100 review: Throughput and Hardware Summary appeared first on Microway.

]]>
https://www.microway.com/hardware/dgx-a100-review-throughput-and-hardware-summary/feed/ 0
What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science https://www.microway.com/hpc-tech-tips/what-can-you-do-with-a-15k-nvidia-data-science-workstation-change-healthcare/ https://www.microway.com/hpc-tech-tips/what-can-you-do-with-a-15k-nvidia-data-science-workstation-change-healthcare/#respond Wed, 04 Mar 2020 22:31:20 +0000 https://www.microway.com/?p=12259 NVIDIA’s Data Science Workstation Platform is designed to bring the power of accelerated computing to a broad set of data science workflows. Recently, we found out what happens when you lend a talented data scientist (with a serious appetite for after-hours projects + coffee) a $15k accelerated data science tool. You can recreate a massive […]

The post What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science appeared first on Microway.

]]>
NVIDIA’s Data Science Workstation Platform is designed to bring the power of accelerated computing to a broad set of data science workflows. Recently, we found out what happens when you lend a talented data scientist (with a serious appetite for after-hours projects + coffee) a $15k accelerated data science tool. You can recreate a massive Pubmed literature search project on a Data Science WhisperStation in hours versus weeks.

Kyle Gallatin, an engineer at Pfizer, has deep data science credentials. He’s been working on projects for over 10 years. At the end of 2019 we gave him special access to one of our Data Science WhisperStations in partnership with NVIDIA:

When NVIDIA asked if I wanted to try one of the latest data science workstations, I was stoked. However, a sobering thought followed the excitement: what in the world should I use this for?

I thought back to my first data science project: a massive, multilingual search engine for medical literature. If I had access to the compute and GPU libraries I have now in 2020 back in 2017, what might I have been able to accomplish? How much faster would I have accomplished it?

Experimentation, Performance, and GPU Accelerated Data Science Tooling

Gallatin used Data Science WhisperStation to rapidly create an accelerated data science workflow for a healthcare—and tell us about his experience. And it was a remarkable one.

Not only was a previously impossible workflow made possible, but portions of the application were accelerated up to 39X!

The Data Science Workstation allowed him to design a Pubmed healthcare article search engine where he:

  1. Ingested a larger database than ever imagined (30,000,000 research article abstracts!)
  2. Didn’t require massive code changes to GPU accelerate the algorithm
  3. Used familiar looking tools for his workflow
  4. Had unsurpassed agility—he could search large portions of the abstract database in .1 seconds!

This last point is really critical and shows why we believe the NVIDIA Data Science Workstation Platform and its RAPIDS tools are so special. As Kyle put it:

Data science is a field grounded in experimentation. With big data or large models, the number of times a scientist can try out new configurations or parameters is limited without massive resources. Everyone knows the pain of starting a computationally-intensive process, only be blindsided by an unforeseen error literal hours into running it. Then you have to correct it and start all over again.

Walkthrough with Step-by-Step Instructions

The new article is available on Medium. It provides a complete step-by-step walkthrough of how NVIDIA Rapids tools and NVIDIA Quadro RTX 6000 with NVLink were utilized to revolutionize this process.

A short set of Kyle’s key findings about the environment and the hardware are below. We’re excited about how this kind of rapid development could change healthcare:

Running workflows with GPU libraries can speed up code by orders of magnitude — which can mean hours instead of weeks with every experiment run

Additionally, if you’ve ever set up a data science environment from scratch you know it can really suck. Having Docker, RAPIDs, tensorflow, pytorch and everything else installed and configured out-of-the-box saved hours in setup time

..

With these general-purpose data science libraries offering massive computational enhancements for traditionally CPU-bound processes (data loading, cleansing, feature engineering, linear models, etc…), the path is paved to entirely new frontier of data science.

Read on at Medium.com

The post What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/what-can-you-do-with-a-15k-nvidia-data-science-workstation-change-healthcare/feed/ 0
Multi-GPU Scaling of MLPerf Benchmarks on NVIDIA DGX-1 https://www.microway.com/hpc-tech-tips/multi-gpu-scaling-of-mlperf-benchmarks-on-nvidia-dgx-1/ https://www.microway.com/hpc-tech-tips/multi-gpu-scaling-of-mlperf-benchmarks-on-nvidia-dgx-1/#respond Fri, 23 Aug 2019 15:39:17 +0000 https://www.microway.com/?p=11628 In this post, we discuss how the training of deep neural networks scales on DGX-1. Considering 6 models across 4 out of 5 popular domains covered in the MLPerf v0.5 benchmarking suite, we discuss the time to state-of-the-art accuracy as set by MLPerf.  We also highlight the models that scale well and should be trained […]

The post Multi-GPU Scaling of MLPerf Benchmarks on NVIDIA DGX-1 appeared first on Microway.

]]>
In this post, we discuss how the training of deep neural networks scales on DGX-1. Considering 6 models across 4 out of 5 popular domains covered in the MLPerf v0.5 benchmarking suite, we discuss the time to state-of-the-art accuracy as set by MLPerf.  We also highlight the models that scale well and should be trained on larger numbers of GPUs. Models with poor scalability should be trained on fewer GPUs, which allows for resource sharing among multiple users. As such, we provide insight into common deep learning workloads and how to best leverage the multi-gpu DGX-1 deep learning system for training the models.

MLPerf – a benchmarking suite for deep learning applications

Just as HPC system design is evolving to achieve good performance for Deep Learning applications, there is also an ever-increasing need to have a good set of benchmarks to quantify this performance. Many benchmarking tools have been proposed. For example, Baidu Research released DeepBench which focuses on basic operations involved in neural networks like convolution, GEMM, Recurrent Layers, and All Reduce. Yet there is no provision to compare different systems/workstations or even software frameworks. Tensorflow introduced TF_CNN_BENCH which is only single-domain and benchmarks only convolutional network-based deep-learning workloads. With a diversity of workloads and a variety of different hardware configurations, we need a more general approach to benchmarking deep learning applications.

With support from both industry, universities, and inspired by SPEC and TPC standards, MLPerf is a leading choice as a set of benchmarks covering different areas of Machine Learning. The goals here are multi-fold which includes a fair comparison of different hardware configurations and software frameworks, while encouraging innovation and also easy reproducibility of results.

MLPerf suite includes Image Classification, Object Detection (light and heavy), Language Translation (Recurrent and Non-Recurrent), Recommendation Systems, and Reinforcement Learning benchmarks. The suite is divided into two divisions: Closed and Open. In the Closed division the data preprocessing, training method, and model must be the same as the MLPerf reference implementation. Only very limited changes to hyperparameters are allowed. This aims for fair comparison of different deep learning hardware platforms. In the Open division any model, preprocessing, or training method can be used.

Version v0.5 received no submissions to the Open division. However, Google, NVIDIA, and Intel made submissions to the Closed division. Only Google (on cloud instance) and NVIDIA submitted GPU-accelerated results. No GPU submissions were made for the reinforcement learning benchmark, but Intel did submit a CPU-only result on Skylake processors. Software frameworks varied from Tensorflow v1.12, to MXNet for image classification, and PyTorch for the rest of the domains.

The results discussed in this post largely replicate NVIDIA’s submission in the Closed Model Division of MLPerf v0.5.0 for training. This division places restrictions on modifying hyperparameters like learning rate and batch size to provide a fair comparison of hardware/software systems. However, minor changes were required to successfully train on small numbers of GPUs. All our changes are reflected in the below log files for interested folks who want to dive deeper. We performed scaling analysis on 1, 4, and 8 GPUs on DGX-1. Our findings help deep learning practitioners and researchers determine the best options for their deep learning problem/application(s).

Training Deep Neural Networks

Training deep neural networks can be a formidable task. With millions of parameters, the model risks overfitting the training data. The deep layers in the model can have extreme gradients that lead to vanishing/exploding gradient problems. Even after accounting for all these pitfalls, the training of a network can be really slow. As a non-convex optimization problem, there can be multiple solutions and training neural networks boils down to finding a right selection of hyperparameters in order to achieve a certain threshold of accuracy. This can be done by manually tuning parameters, observing a low generalization error, and reiterating with a different combination of values until reaching the desired accuracy. When there are only a few hyperparameters, a grid search can be applied, which is more computationally intensive. A range of discrete values for each parameter is selected and the model is trained on every combination of parameters as described by the Cartesian product (grid) of the values chosen.

The following is a brief description of each model being used in the MLPerf benchmarks:

  1. Convolutional Neural Networks (CNN):  Most widely used for image processing and pattern recognition applications like object detection/localization, human pose estimation, scene recognition; also for certain non-image workflows (e.g., processing acoustic, seismic, radio, or radar signals). In general, any data that has a grid-like topology can be processed using CNNs. Typical CNNs consist of convolutional layers, pooling layers, and fully connected layers. The convolution operation involves convolving a filter on the image, which extracts features in a local region of the image. In any image the pixels at large distances are randomly related, as opposed to smaller distances where they are correlated. The size of the filter, stride, and padding are some of the hyperparameters that need proper tuning. Pooling layers are used to reduce the number of parameters in the network, in turn reducing the number of computations. Fully connected layers help in classifying images based on the features extracted by the convolution layers. The MLPerf benchmarks Image Classification, Single Stage Detector, and Object Detection make use of a special type of CNN called ResNet. Introduced by Microsoft, ResNet [1] won the ILSVRC 2015 challenge and continues to lead. ResNets consist of residual blocks which ease the process of training extremely deep networks. A residual connection is a shortcut from one layer to another usually after skipping a few layers, basically copying the output from one layer and adding it to another layer just before applying non-linearity. MLPerf benchmarks Image Classification and Object Detection use ResNet-50 (50 layers) while the Single-Stage detector uses ResNet-34 (34 layers) as the backbone.
  2. Recurrent Neural Network (RNN): RNNs are interesting neural networks that offer a lot of flexibility in designing the model. It lets you operate with sequenced data at input, output, or both.  For example, in image captioning with a fixed-size image input, where the RNN model generates a sequence of words describing the contents of the image. In the case of sentiment analysis, the input is a sequence of words and the output is the sentiment of the sentence: whether it is good (positive) or bad (negative).  The MLPerf RNN benchmark uses the sequenced input and sequenced output model, similar to Google’s Neural Machine Translation (GNMT). GNMT has 3 components: an encoder, a decoder, and an attention network. The encoder modifies the input sequence into a list of vectors and the decoder decodes the vector into another sequence of words as an output. The encoder and decoder are connected via an attention network that allows for giving attention to different parts of the input sentence/sequence while decoding. For a more detailed description of the model, read the GNMT [2] paper.
  3. Transformers : A Transformer is a new type of sequence-to-sequence architecture for machine translation that uses both an encoder and a decoder, but does not use Recurrent layers like LSTMs or GRUs. Transformers are a new advancement in NLP which perform better than RNNs. A typical Transformer model would have an encoder and a decoder, with both containing modules like ‘Multi-Head Attention’ and ‘Feed Forward layers’. Since there is no RNN, there is no way of knowing the order of the words fed to the network. Therefore, we need part of the model to have a positional encoding of the words in the sequence. The source language sequence is fed to the encoder and the corresponding target language sequence is fed into the decoder, but shifted by a position. The model tries to predict the next word in the target sequence while having seen only the words prior to that position, and avoids simply copying the decoder sequence as the output. For more detailed model description, read the Attention is all you need [3] paper.
  4. Neural Collaborative Filtering (NCF) : Many online services (e.g., e-commerce, social networking) provide their customers with millions of options to choose from. With digital transformation resulting in huge amounts of data overload, it’s almost impossible to browse through an entire online collection. Recommender systems are needed to filter these options and help users make selections. Collaborative Filtering models the past interactions between the user and the collection. This essentially boils down to a Matrix Factorization problem where the user and collection are projected onto a latent space and the similarity (using the inner product) between the latent vectors is computed. The predictions are based on similarities. However, ‘Inner Product‘ is not a good choice of function to model complex interactions and an alternate approach of using a neural architecture to learn the arbitrary function from the data was devised. This approach is known as Neural Collaborative Filtering (NCF) [4]. Both the user and collection are represented as one-hot encoded in the input layer (sparse). A fully-connected (Embedding) layer projects this sparse representation to a dense vector. The output of the embedding layer is then fed into the Neural CF layers where each layer can learn certain structure among the interactions.

MLPerf Scaling on NVIDIA DGX-1

The MLPerf results submitted by NVIDIA make use of single-node and multi-node DGX-1 and DGX-2 systems, utilizing the entirety of the systems to train a single network. Our post discusses how performance scales when using a single DGX-1 (using 1, 4, or all 8 NVIDIA Tesla GPUs). This is important to understand how a single DGX-1 system can be used as a shared resource among multiple users, or to be used to run multiple cases of the same problem. It also helps establish which deep learning domains require the training to be done on a large scale.

Image Classification

Trained on the ILSVRC2012 dataset with 1.2 million images, this benchmark scales well. It achieves better than linear speedups going from 1 to 4 (~5x) and 1 to 8 GPUs (~10x). DGX users will achieve better throughput if they use the full system for each job.

Figure 1. Evaluation accuracy vs Epochs for Image Classification.

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
151216.2183fp-16
416644.5663fp-16
816642.200263fp-16

Table 1. Synopsis of Image Classification benchmarks

Figure 1. shows the validation accuracy versus the number of epochs it took to reach that accuracy. The accuracy set by MLPerf for this benchmark is 74.9%. The 4- and 8-GPU plots achieve this accuracy in the same number of epochs, however, the average time for each epoch are different as reported in the Table 1. For a single-GPU run, the batch size needed to be reduced in order to avoid “Out of Memory (OOM)” errors. With less data being processed per epoch on a single GPU compared to 4 and 8 GPUs, it took more epochs to train the model to the same accuracy.

Object Detection – Heavy

This is the heaviest workload among all the benchmarks considered in MLPerf. Utilizing the full DGX-1, it takes ~325 minutes to train on the COCO2014 dataset. The model used is the same ResNet-50 as the Image-Classification benchmark. The speedup obtained is ~2.5x going from 1 to 4 GPUs and ~6x when going from 1 to 8 GPUs (which is sub-linear).

Figure 2. Mask mAP and Bounding Box mAP vs Epochs for heavy Object Detection.

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
12 179.18311fp-16
4444.31518fp-16
8424.989513fp-16

Table 2. Synopsis of Object Detection (heavy) benchmarks

Figure 2a and 2b (click on the tabs to toggle between figures) shows the accuracy plots for the heavy object detection benchmark. There are two different accuracy thresholds here: BBOX (Fig. 2b) which stands for Bounding Box accuracy and SEGM (Fig. 2a) which stands for Instance Segmentation. Simply put, an object detection problem requires that the object be correctly located within the image and also that the object be correctly identified/categorized. Instance segmentation refers to instance of each pixel associated with an object in the image.

Object Detection – Light

The light weight object detection benchmark makes use of the COCO2017 dataset and scales with close to linear speedups: about ~3.7x going from 1 to 4 GPUs, and ~7.3x going from 1 to 8 GPUs. Total runtime varies from more than 3 hours on a single GPU to less than half an hour on 8 GPUs.

Figure 3. Accuracy vs Epoch for Single Stage Detector

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
11524.08049fp-16
41521.11549fp-16
81520.562849fp-16

Table 3. Synopsis of SSD benchmark.

Figure 3. shows the accuracy plots for the single stage detector benchmark. The evaluation of the model occurs only at epoch 32, 43, and 48 – hence the 3 data points in the plot. This, of course, can be modified to evaluating more often to have more data points for the plot. However, we stuck to the default values.

Language Translation – Recurrent (GNMT) and Non-Recurrent (Transformer)

The Recurrent model is trained on the WMT16 English-German dataset and the Transformer model is trained on the WMT17 EN-DE dataset. Both language translation models scale well, however transformer not only scales better but also achieves higher accuracy and averaging more in total training time.

Figure 4. BLEU score vs Epochs for Google’s NMT and Transformer Translation models.

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
151239.985fp-16
451212.315fp-16
810246.43fp-16

Table 4. Synopsis of RNN benchmark for Language Translation (GNMT)

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
1512060.348fp-16
4512022.384fp-16
851207.6484fp-16

Table 5. Synopsis of Non-Recurrent benchmark for Language Translation (Transformer)

Figure 4a and 4b (click on the tabs to toggle between images) shows the validation accuracy plots vs epochs for the language translation models. Google’s NMT uses a Recurrent Neural Network based model and achieves an accuracy of 21.80 BLEU. The Transformer model is a new advancement in the models used in language translation which does not use Recurrent Neural network and performs better achieving a higher quality target of 25.00 BLEU.

Table 4. and 5. shows the synopsis for these benchmarks. The length of the sequence is a key parameter for a Recurrent model and does affect the scaling.

Recommendation Systems

This is the quickest benchmark to run. Even on a single GPU, it only takes a little over a minute to train to the desired accuracy of 0.635. The speedups are ~1.8x and ~2.8x when going from 1 to 4 and 1 to 8 GPUs, respectively.

Figure 5. Evaluation accuracy vs Epochs of Neural Collaborative Filtering model for Recommendation Systems .

# GPUsBatch SizeAverage Time per Epoch (min)Number of EpochsPrecision
110485760.13538461513fp-16
410485760.07692307713fp-16
810485760.04846153813fp-16

Table 6. Synopsis of Recommendation Systems benchmark

Figure 5. shows the accuracy plots for the recommendation benchmark.All the plots in the figure are quite close to each other. This suggests that it’s not a cost effective strategy to use multiple GPUs for this type of workload. The benefit of using a machine like DGX-1 for such workloads is to run multiple cases, each on a single GPU. Dedicating an entire DGX-1 to a single training will reduce the training time, but is not as efficient if overall throughput is the goal.

MLPerf Scaling Results

This section summarizes the scaling results and discusses the speedups. Figure 6 (click to enlarge) shows the scaling analysis of six MLPerf benchmarks on 1, 4, and 8 GPUs on an NVIDIA DGX-1 (with Tesla V100 32GB GPUs). A general conclusion to draw from the Figure is that “all the models do not scale the same way”. Most of the models scale well. The better a model scales, the more efficiently you can train networks on large resources (an entire DGX or a cluster of DGX).

Figure 6. Scaling plots on 1-4-8 GPUs for MLPerf v0.5 Closed Model Division Benchmarks submitted by NVIDIA. The X-axis shows the number of GPUs and the Y-axis shows the training time to desired accuracy in minutes (the metric set by MLPerf). The inset axis shows a zoomed in view of the plot.

We see substantial speedups for Image Classification and Transformer Translation benchmarks (both are super-linear, running more quickly the more GPUs are added). Single-Stage Detector and Mask-RCNN Object Detection benchmark remain close to linear, while the RNN benchmark goes from linear speedup on 4 GPUs to super-linear speedup on 8 GPUs (which indicates that all of the above will scale efficiently). The Recommendation benchmark scales poorly, with fairly insignificant time savings when run on many GPUs. Table 7 lists the speedups for all benchmarks, including a calculated speed-up as the ratio of total training time on a single GPU to the total training time on multiple GPUs.

For a more detailed understanding of hyperparameters used to train these models, please reference the log files below [10].

BenchmarkSpeed Up (1-4 GPU)Speed Up  (1-8 GPU)
Image Classification4.769.70
Single Stage Detector3.667.25
Object Detection2.476.066
RNN GNMT3.2410.411
Transformer Translation5.39215.778
Recommendation Systems (NCF)1.76*2.789*

(*) Recommendation systems is not a good benchmark for studying scaling analysis of deep learning workloads, since it is the quickest of the bunch and the achieved speedup is on the order of seconds.

Table 7. Speed Ups for all the benchmarks going from 1 to 4 to 8 GPUs

Based on the results, a general takeaway message would be to select systems based on the type of deep learning application one is trying to build. We see that the recommendation systems benchmark doesn’t scale well, which suggests that such projects should limit multi-GPU training and instead share the resources (either shared between multiple users or between multiple models).  On the other hand, if your team trains neural networks on large image sets (image classification, object localization, object detection, instance segmentation), using multi-GPU systems is crucial for quick results.

Next Steps for Successful Deep Learning Deployment

Of course, a powerful compute resource is just one part of successful deep learning implementation. Depending upon your project needs and the anticipated growth of your datasets, storage requirements may eclipse compute requirements. Connectivity also becomes critical, as neural network training stresses system and network I/O.

Whether you are planning a new project or looking to improve your existing deep learning practice, Microway’s team would be happy to help you define the requirements and deliver a successful solution. With experience in everything from GPU workstations to DGX-2 SuperPODS, our experts can ensure the deployment meets your needs. Contact an AI expert today!

References

[1]
Deep Residual Learning for Image Recognition
[2]
Google’s Neural Machine Translation System
[3]
Attention is all you need
[4]
Neural Collaborative Filtering
[5]
Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville
[6]
Demystifying Hardware Infrastructure Choices for Deep Learning Using MLPerf
[7]
MLPerfv0.5 Training Results
[8]
Mask R-CNN for Object Detection
[9]
Single Shot Multibox Detector
[10]
Training results log files

The post Multi-GPU Scaling of MLPerf Benchmarks on NVIDIA DGX-1 appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/multi-gpu-scaling-of-mlperf-benchmarks-on-nvidia-dgx-1/feed/ 0
NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/ https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/#respond Fri, 15 Mar 2019 17:06:57 +0000 https://www.microway.com/?p=11118 Performance benchmarks are an insightful way to compare new products on the market. With so many GPUs available, it can be difficult to assess which are suitable to your needs. Various benchmarks provide information to compare performance on individual algorithms or operations. Since there are so many different algorithms to choose from, there is no […]

The post NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks appeared first on Microway.

]]>
Performance benchmarks are an insightful way to compare new products on the market. With so many GPUs available, it can be difficult to assess which are suitable to your needs. Various benchmarks provide information to compare performance on individual algorithms or operations. Since there are so many different algorithms to choose from, there is no shortage of benchmarking suites available.

For this comparison, the SHOC benchmark suite (https://github.com/vetter/shoc/) is used to compare the performance of the NVIDIA Tesla T4 with other GPUs commonly used for scientific computing: the NVIDIA Tesla P100 and Tesla V100.

The Scalable Heterogeneous Computing Benchmark Suite (SHOC) is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing, and the software used to program them. Its initial focus is on systems containing Graphics Processing Units (GPUs) and multi-core processors, and on the OpenCL programming standard. It can be used on clusters as well as individual hosts.

The SHOC benchmark suite includes options for many benchmarks relevant to a variety of scientific computations. Most of the benchmarks are provided in both single- and double-precision and with and without PCIE transfer consideration. This means that for each test there are up to four results for each benchmark. These benchmarks are organized into three levels and can be run individually or all together.

The Tesla P100 and V100 GPUs are well-established accelerators for HPC and AI workloads. They typically offer the highest performance, consume the most power (250~300W), and have the highest price tag (~$10k). The Tesla T4 is a new product based on the latest “Turing” architecture, delivering increased efficiency along with new features. However, it is not a replacement for the bigger/more power-hungry GPUs. Instead, it offers good performance while consuming far less power (70W) at a lower price (~$2.5k). You’ll want to use the right tool for the job, which will depend upon your workload(s). A summary of each Tesla GPU is shown below.

In our testing, both single- and double-precision SHOC benchmarks were run, which allows us to make a direct comparison of the capabilities of each GPU. A few HPC-relevant benchmarks were selected to compare the T4 to the P100 and V100. Tesla P100 is based on the “Pascal” architecture, which provides standard CUDA cores. Tesla V100 features the “Volta” architecture, which introduced deep-learning specific TensorCores to complement CUDA cores. Tesla T4 has NVIDIA’s “Turing” architecture, which includes TensorCores and CUDA cores (weighted towards single-precision). This product was designed primarily with machine learning in mind, which results in higher single-precision performance and relatively low double-precision performance. Below, some of the commonly-used HPC benchmarks are compared side-by-side for the three GPUs.

Double Precision Results

GPUTesla T4Tesla V100Tesla P100
Max Flops (GFLOPS)253.387072.864736.76
Fast Fourier Transform (GFLOPS)132.601148.75756.29
Matrix Multiplication (GFLOPS)249.575920.014256.08
Molecular Dynamics  (GFLOPS)105.26908.62402.96
S3D (GFLOPS)59.97227.85161.54

 

Single Precision Results

GPUTesla T4Tesla V100Tesla P100
Max Flops (GFLOPS)8073.2614016.509322.46
Fast Fourier Transform (GFLOPS)660.052301.321510.49
Matrix Multiplication (GFLOPS)3290.9413480.408793.33
Molecular Dynamics (GFLOPS)572.91997.61480.02
S3D (GFLOPS)99.42434.78295.20

 

What Do These Results Mean?

The single-precision results show Tesla T4 performing well for its size, though it falls short in double precision compared to the NVIDIA Tesla V100 and Tesla P100 GPUs. Applications that require double-precision accuracy are not suited to the Tesla T4. However, the single precision performance is impressive and bodes well for the performance of applications that are optimized for lower or mixed precision.

Plot comparing the performance of Tesla T4 with the Tesla P100 and Tesla V100 GPUs

To explain the single-precision benchmarks shown above:

  • The Max Flops for the T4 are good compared to V100 and competitive with P100. Tesla T4 provides more than half as many FLOPS as V100 and more than 80% of P100.
  • The T4 shows impressive performance in the Molecular Dynamics benchmark (an n-body pairwise computation using the Lennard-Jones potential). It again offers more than half the performance of Tesla V100, while beating the Tesla P100.
  • In the Fast Fourier Transform (FFT) and Matrix Multiplication benchmarks, the performance of Tesla T4 is on par for both price/performance and power/performance (one fourth the performance of V100 for one fourth the price and one fourth the wattage). This reflects how the T4 will perform in a large number of HPC applications.
  • For S3D, the T4 falls behind by a few additional percent.

Looking at these results, it’s important to remember the context. Tesla T4 consumes only ~25% the wattage of the larger Tesla GPUs and costs only ~25% as much. It is also a physically smaller GPU that can be installed in a wider variety of servers and compute nodes. In that context, the Tesla T4 holds its own as a powerful option for a reasonable price when compared to the larger NVIDIA Tesla GPUs.

What to Expect from the NVIDIA Tesla T4

Cost-Effective Machine Learning

The T4 has substantial single/mixed precision machine learning focused performance, with a price tag significantly lower than larger Tesla GPUs. What the T4 lacks in double precision, it makes up for with impressive single-precision results. The single-precision performance available will strongly cater to the machine learning algorithms with potential to be applied to mixed precision. Future work will examine this aspect more closely, but Tesla T4 is expected to be of high interest for deep learning inference and to have specific use-cases for deep learning training.

Impressive Single-Precision HPC Performance

In the molecular dynamics benchmark, the T4 outperforms the Tesla P100 GPU. This is extremely impressive, and for those interested in single- or mixed-precision calculations involving similar algorithms, the T4 could provide an excellent solution. With some adapting algorithms, the T4 may be a strong contender for scientific applications that also want to utilize machine learning capabilities to analyze results or run a variety of different types of algorithms from both machine learning and scientific computing on an easily accessible GPU.

In addition to the outright lower price tag, the T4 also operates at 70 Watts, in comparison to the 250+ Watts required for the Tesla P100 / V100 GPUs. Running on one quarter of the power means that it is both cheaper to purchase and cheaper to operate.

Next Steps for leveraging Tesla T4

If it appears the new Tesla T4 will accelerate your workload, but you’d like to benchmark, please sign up to Test Drive for yourself. We also invite you to contact one of our experts to discuss your needs further. Our goal is to understand your requirements, provide guidance on best options, and see the project through to successful system/cluster deployment.

Full SHOC Benchmark Results

The post NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/feed/ 0
NVIDIA Datacenter Manager (DCGM) for More Effective GPU Management https://www.microway.com/hpc-tech-tips/nvidia-dcgm-for-gpu-management/ https://www.microway.com/hpc-tech-tips/nvidia-dcgm-for-gpu-management/#respond Mon, 02 Apr 2018 03:58:47 +0000 https://www.microway.com/?p=10643 Managing an HPC server can be a tricky job, and managing multiple servers even more complex. Adding GPUs adds even more power yet new levels of granularity. Luckily, there’s a powerful, and effective tool available for managing multiple servers or a cluster of GPUs: NVIDIA Datacenter GPU Manager. Executing hardware or health checks DCGM’s power […]

The post NVIDIA Datacenter Manager (DCGM) for More Effective GPU Management appeared first on Microway.

]]>
Managing an HPC server can be a tricky job, and managing multiple servers even more complex. Adding GPUs adds even more power yet new levels of granularity. Luckily, there’s a powerful, and effective tool available for managing multiple servers or a cluster of GPUs: NVIDIA Datacenter GPU Manager.

Executing hardware or health checks

DCGM’s power comes from its ability to access all kinds of low level data from the GPUs in your system. Much of this data is reported by NVML (NVIDIA Management Library), and it may be accessible via IPMI on your system. But DCGM helps make it far easier to access and use the following:

Report what GPUs are installed, in which slots and PCI-E trees and make a group

Build a group of GPUs once you know which slots your GPUs are installed in and on which PCI-E trees and NUMA nodes they are on. This is great for binding jobs, linking available capabilities.

Determine GPU link states, bandwidths

Provide a report of the PCI-Express link speed each GPU is running at. You may also perform D2D and H2D bandwidth tests inside your system (to take action on the reports)

Read temps, boost states, power consumption, or utilization

Deliver data on the energy usage and utilization of your GPUs. This data can be used to control the cluster

Driver versions and CUDA versions

Report on the versions of CUDA, NVML, and the NVIDIA GPU driver installed on your system

Run sample jobs and integrated validation

Run basic diagnostics and sample jobs that are built into the DCGM package.

Set policies

DCGM provide a mechanism to set policies to a group of GPUs.

Policy driven management: elevating from “what’s happening” to “what can I do”

Simply accessing data about your GPUs is only of modest use. The power of DCGM is in how it arms you to take act upon that data.DCGM allows administrators to take programmatic or preventative action when “it’s not right”

Here’s are a few scenarios where data provided by DCGM allows for both powerful control of your hardware and action:

Scenario 1: Healthchecks – periodic or before the job

Run a check before each job, after a job, or daily/hourly to ensure a cluster is performing optimally.

This allows you to preemptively stop a run if diagnostics fail or move GPUs/nodes out of the scheduling queue for the next job.

Scenario 2: Resource Allocation

Jobs often need a certain class of node (ex: with >4 GPUs or with IB & GPUs on the same PCI-E tree). DCGM can be used to report on the capabilities of a node and help identify appropriate resources.

Users/schedulers can subsequently send jobs only where they are capable of being executed

Scenario 3: “Personalities”

Some codes request specific CUDA or NVIDIA driver versions. DCGM can be used to probe the CUDA version/NVIDIA GPU driver version on a compute node.

Users can then script the deployment of alternate versions or the launch containerized apps to support non-standard versions.

Scenario 4: Stress tests

Periodically stress test GPUs in a cluster with integrated functions

Stress tests like Microway GPU Checker can tease out failing GPUs, and reading data via DCGM during or after can identify bad nodes to be sidelined.

Scenario 5: Power Management

Programmatically set GPU Boost or max TDP levels for an application or run. This allow you to eke out extra performance.

Alternatively, set your GPUs to stay within a certain power band to reduce electricity costs when rates are high or lower total cluster consumption when there is insufficient generation capacity

Scenario 6: Logging for Validation

Script the pull of error logs and take action with that data.

You can accumulate error logging over time, and determine tendencies of your cluster. For example, a group of GPUs with consistently high temperatures may indicate a hotspot in your datacenter

Getting Started with DCGM: Starting a Health Check

DCGM can be used in many ways. We won’t explore them all here, but it’s important to understand the ease of use of these capabilities.

Here’s the code for a simple health check and also for a basic diagnostic:

dcgmi health --check -g 1
dcgmi diag –g 1 -r 1

The syntax is very standard and includes dcgmi, the command, and the group of GPUs (you must set a group first). In the diagnostic, you include the level of diagnostics requested (-r 1, or lowest level here).

DCGM and Cluster Management

While the Microway team loves advanced scripting, you may prefer integrating DCGM or its capabilities in with your existing schedulers or cluster managers. The following are supported or already leverage DCGM today:

What’s Next

What will you do with DCGM or DCGM-enabled tools? We’ve only scratched the surface. There are extensive resources on how to use DCGM and/or how it is integrated with other tools. We recommend this blog post.

The post NVIDIA Datacenter Manager (DCGM) for More Effective GPU Management appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-dcgm-for-gpu-management/feed/ 0
One-shot Learning Methods Applied to Drug Discovery with DeepChem https://www.microway.com/hpc-tech-tips/one-shot-learning-methods-applied-drug-discovery-deepchem/ https://www.microway.com/hpc-tech-tips/one-shot-learning-methods-applied-drug-discovery-deepchem/#respond Wed, 26 Jul 2017 14:01:15 +0000 https://www.microway.com/?p=8929 Experimental data sets for drug discovery are sometimes limited in size, due to the difficulty of gathering this type of data. Drug discovery data sets are expensive to obtain, and some are the result of clinical trials, which might not be repeatable for ethical reasons. The ClinTox data set, for example, is comprised of data […]

The post One-shot Learning Methods Applied to Drug Discovery with DeepChem appeared first on Microway.

]]>
Experimental data sets for drug discovery are sometimes limited in size, due to the difficulty of gathering this type of data. Drug discovery data sets are expensive to obtain, and some are the result of clinical trials, which might not be repeatable for ethical reasons. The ClinTox data set, for example, is comprised of data from FDA clinical trials of drug candidates, where some data sets are derived from failures, due to toxic side effects [2].For cases where training data is scarce, application of one-shot learning methods have demonstrated significantly improved performance over methods consisting only of graphical convolution networks.The performance of one-shot network architectures will be discussed here for several drug discovery data sets, which are described in Table 1.

These data sets, along with one-shot learning methods, have been integrated into the DeepChem deep learning framework, as a result of research published by Altae-Tran, et al. [1].While data remains scarce for some problem domains, such as drug discovery, one-shot learning methods could pose an important alternative network architecture, which can possibly far outperform methods which use only graphical convolution.

DatasetCategoryDescriptionNetwork TypeNumber of TasksCompounds
Tox21PhysiologytoxicityClassification128,014
SIDERPhysiologyside reactionsClassification271,427
MUVBiophysicsbioactivityClassification1793,127

Table 1. DeepChem drug discovery data sets investigated with one-shot learning.

One-Shot Network Architectures Produce Most Accurate Results When Applied to Small Population Training Sets

The original motivation for investigating one-shot neural networks arose from the fact that humans can learn sufficient representations, given small amounts of data, and then later apply a learned representation to correctly distinguish between objects which have been observed only once.The one-shot network architecture has previously been developed, and applied to image data, with this motivational context in mind [3, 5].

The question arose, as to whether an artificial neural network, given a small data set, could similarly learn a sufficient number of features through training, and perform at a satisfactory level.After some period of development, one-shot learning methods have emerged to demonstrate good success [3,4,5,6].

The description provided here of the one-shot approach focuses mostly on methodology, and less on the theoretical and experimental results which support the method.The simplest one-shot method computes a distance weighted combination of support set labels.The distance metric can be defined using a structure called a Siamese network, where two identical networks are used.The first twin produces a vector output for the molecule being queried, while the other twin produces a vector representing an element of the support set.Any difference between the outputs can be interpreted as a dissimilarity measure between the query structure and any particular structure in the support set.A separate dissimilarity measure can be computed for each element in the support set, and then a normalized, weighted representation of the query structure can be determined.For example, if the query structure is significantly less dissimilar to two structures, out of, say, twenty, in the support set, then the weighted representation will be nearly the average of the vectors which represent the two support structures which most resemble the queried structure.

There are two other one-shot methods which take more complex considerations into account.In the Siamese network one-shot approach, the vector embeddings of both the query structure and each individual support structure is computed independently of the support set.However, it has been shown empirically, that by taking into account the context of all support set elements, when computing the vector embeddings of the query, and each individual support structure, better one-shot network performance can be realized.This approach is called full context embedding, since the full context of the support set is taken into account when computing every vector embedding.In the full context embedding approach, the embeddings for the every support structure are allowed to influence the embedding of the query structure.

The full context embedding approach uses Siamese, i.e. matching, networks like before, but once the embeddings are computed, they are then further processed by Long Short-Term Memory (LSTM) structures.The embeddings, before processing by LSTM structures, will be referred to here as pre-contextualized vectors. The full contextual embeddings for the support structures are produced using an LSTM structure called a bidirectional LSTM (biLSTM), while the full contextual embedding for the query structure is produced by an LSTM structure called an attentional LSTM (attLSTM).An LSTM is a type of recurring neural network, which can process sequences of input.With the biLSTM, the support set is viewed as a sequence of vectors. A bidirectional LSTM is used, instead of just an LSTM, in order to reduce dependence on the sequence order.This improves model performance because the support set has no natural order.However, not all dependence on sequence order is removed with the biLSTM.

The attLSTM constructs an order-independent full contextual embedded vector of the query structure.The full details of the attLSTM will not be discussed here, beyond saying that both the biLSTM and attLSTM are network elements which interpret some set of structures as a sequence of pre-contextualized vectors, and converts a sequence into a single full context embedded vector.One full context embedded vector is produced for the support set of structures, and one is produced for the query structure.

A further improvement has been made to the one-shot model described here.As mentioned, the biLSTM does not produce an entirely order-independent full context embedding for each pre-contextualized vector, corresponding to a support structure.As mentioned, the support set does not contain any natural order to it, so any sequence order dependence present in the full context embedded support vector is an unwanted artifact, and will lead to reduced model performance.There is another problem, which is that, in the way they have been defined, the full context embedded vectors of the support structures depend only on the pre-contextualized support vectors, and not on the pre-contextualized query vector.On the other hand, the full context embedded vector of the query structure depends on both its own pre-contextualized vector, and the pre-contextualized vectors of the support set.This asymmetry indicates that some additional information is not accounted for in the model, and that performance could be improved if this asymmetry could be removed, and if the order dependence of the full context embedded support vectors could also be removed.

To address this problem, a new LSTM model was developed by Altae-Tran, et al., called the Iteratively Refined LSTM (IterRefLSTM).The full details of how the IterRefLSTM model operates is beyond the scope of this discussion.A full explanation can be found in Altae-Tran, et al.Put briefly, the full contextual embedded vectors of the support and query structures are co-evolved, in an iterative process, which uses an attLSTM element, and results in removal of order-dependence in the full contextual embedding for the support, as well removal of the asymmetry in dependency between the full context embedded vectors of the support and query structures.

A brief summary of the one-shot network architectures discussed is presented in Table 2.

ArchitectureDescription
Siamese Networksscore comparison, dissimilarity measure
Attention LSTM (attLSTM)better extraction of prior data, contains order-dependence of input data
Iterative Refinement LSTMs (IterRefLSTM)similar to attLSTM, but removes all order dependence of data by iteratively evolving the query and support embeddings simultaneously in an iterative loop

Table 2. One-shot networks used for investigating low-population biological assay data sets.

Computed Results of One-Shot Performance Metric is Compared to Published Values

A comparison of independently computed values is made here with published values from Altae-Tran, et al. [1].Quantitative results for classification tasks associated with the Tox21, SIDER, and MUV datasets were obtained by evaluating the the area under the receiver operating characteristic curve (read more on AUROC).For datasets having more than one task, the median of the performance metric over all tasks in the held-out data sets is reported.A k-fold cross-validation was then done, with k=4.The mean of performances across all cross-validations was then taken, and reported as the performance measure for the data set.A discussion of the standard deviation is given further below.

Since the tasks for Tox21, SIDER, and MUV are all classification tasks for binary assay data, with positive and negative results from a clinical trial, for example, the performance values, as mentioned, are reported with the AUROC metric.With AUROC, a value of 0.5 indicates no predictive power, while a result of 1.0 indicates that every outcome in the held out data set has been predicted correctly [Kennis Research, 9]. A value less than 0.5 can be interpreted as a value of 1.0 minus the metric value.This operation corresponds to inverting the model, where True is now False, and vice versa.This way, a metric value between 0.5 and 1.0 can always be realized. Each data set performance is reported with a standard deviation, containing dependence on the dispersion of metric values across classifications tasks, and then k cross-validations.

Our computed values match well with those published by Altae-Tran, et al. [1], and essentially confirm their published performance metric values, from their ACS Central Science publication.The first and second columns in Table 3 show classification performance for GC tasks, and RF, respectively, as computed by Altae-Tran, et al.Single task GC and RF results are presented as a baseline of comparison to one-shot methods.

The use of k-fold cross validation improves the estimated predicted performance of the model, as it would perform if trained on all of the data, and not just a training subset, with a portion reserved testing.Since we cannot directly measure the performance of a model trained on the full data set (since no testing data would remain), the k-fold cross validation is used to provide a best guess of a performance estimate we cannot see (until final deployment), where a deployed network would be trained on all of the data.

 Tox21SIDERMUV
Random Forests‡,⁑0.539 ± 0.0490.557 ± 0.0590.751 ± 0.062Ω
Graphical Convolution‡,⁑0.625 ± 0.0360.482 ± 0.0380.583 ± 0.061
Siamese Networks0.783 ± 0.0090.660 ± 0.0880.500 ± 0.043
AttLSTM0.759 ± 0.0070.607 ± 0.0800.500 ± 0.058
IterRefLSTM0.807 ± 0.003Ω0.751 ± 0.002Ω0.533 ± 0.051

Table 3. AUROC performance metric values for each one-shot method, plus the random forests (RF), and graphical convolution (GC) methods.Metric values were measured across Tox21, SIDER, and MUV test data sets, using a trained modelΦ.Randomness arises from using a trained model to evaluate the AUROC metric on a test set.First a support setΨ, S, of 20 data points is chosen from the set of data points for a test task.The metric is then evaluated over the remaining points in a test task data set.This process is repeated 20 times for every test task in the data set. The mean and standard deviation for all AUROC measures generated in this way are computed.

Finally, for each data set (Tox21, SIDER, and MUV), the reported performance result is actually the median performance value across all test tasks for a data set.This indirectly implies that the individual metric performances on individual tasks is unimportant, and that they more or less tend to all do well or poorly together, without too much variance across tasks.However, a median measure can mask outliers, where performance on one task might be very bad.If outliers can be removed for rational reasons, then using the median across task performance can be an effective way of removing the influence of outliers.


The performance measures for RF and GC were computed with one-fold cross validation (i.e. no cross-validation).This is because the RF and GC scripts available with our current version of DeepChem (July, 2017), are written for performing only one-fold validation with these models.

The variances of k-fold cross validation performance estimates were determined from pooling all performance values, and then finding the median variance of the entire pool.More complex techniques exist for estimating the variance from a cross-validated set, and the reader is invited to investigate other methods [Nadeau, et al.].

Ω This performance measure by IterRefLSTM on the Tox21 data set is the only performance which rates rates as good.IterRefLSTM performance on the SIDER dataset performs fairly, while RF on MUV, rates as only fair.

Φ Since network inference (predicting outcomes) can be done much faster than network training, due to the computationally expensive backprogragation algorithm, only a batch, B, of data points, and not the entire training data, excluding support data, are selected to train.A support set, S, of 20 data points, along with a batch of queries, B, of 128 data points, is selected for each training set task, in each of the the held-out training sets, for a given episode of training.

A number of training episodes equal to 2000 * ntrain is performed, with one step of minimization performed by the ADAM optimizer per episode[11]. ntrain is the number of test tasks in a test set.After the total number of training episodes has been computed, an intermediate information structure for the the attLSTM, and IterRefLSTM models, called the embedding vector set, described earlier, is produced.It should be noted that the same size of support set, S, is also used during model testing on the held out testing tasks.

Ψ Every support set, S, whether selected during training or testing, was chosen so that it contained 10 positive and 10 negative samples for the task in question.In the full study done in [1], however, variations on the number of positive and negatives samples are explored for the support set, S.Investigators found that by sampling more data points in S, rather than increasing the number of backpropagation iterations, better model performance resulted.


It should be noted that, for a support set of 10 positive, and 10 negative assay results, our computed results for the Siamese method on MUV do not show any predictive performance.The results published by Altaei-Tran, however, indicate marginally predictive, but poor predictability, with an AUROC metric value of 0.601 ± 0.041.

Our metric was computed several times on both a Tesla P100 16GB GPU, and a Tesla M40 GPU, but we found, with this particular support, the Siamese model has no predictive power, with a metric value of 0.500 ± 0.043 (see Table 3). Our other computed results for the AttLSTM and IterRefLSTM concur with published results, which show that neither one-shot learning method has predictive power on MUV data, with a support set containing 10 positive, and 10 negative assay results.

The Iterative Refinement LSTM shows a narrower dispersion of scores than other one-shot Learning models.This result agrees with published standard deviation values for LSTM in Altae-Tran, et al. [1].

Speedups factors are determined by comparing runtimes on the NVIDIA Tesla P100 GPU, to runtimes on the Tesla M40 GPU, and are presented in Tables 4 and 5.Speedup factors are found to be not as pronounced for one-shot methods, and an explanation of the speedup results is presented.The approach for training and testing one-shot methods is described, as they involve some extra considerations which do not apply to graphical convolution.

 Tesla P100 runtimesTesla M40 runtimes
 Tox21SIDERMUVTox21SIDERMUV
Random Forests253784243783
Graphical Convolution38796441100720
Siamese8572,1801,4649562,4071,617
AttLSTM9332,4051,5911,0412,5811,725
IterRefLSTM1,0062,5111,6801,1012,7211,834

Table 4. Runtimes for each one-shot model on the NVIDIA Tesla M40 and Tesla P100 16GB PCIe GPU.All runtimes are in seconds.

RF are run entirely on CPU, and reflect CPU runtimes. Their run times are shown with strikethrough, to indicate that their values are not be considered for determining GPU speedup factors.

A quick inspection of the results in Table 4 shows that the one-shot methods perform better on the Tox21 and SIDER data sets, but not on the MUV data.A reason for poor performance of one-shot methods in MUV data is proposed below.

Limitations of One-Shot Networks

Compared to previous methods, one-shot networks demonstrate extraction of more information from the prior (support) data than RF or GC, but with a limitation.One-shot methods are only successful when data in the held out testing set is sufficiently similar to data seen during training.Networks trained using one-shot methods do not perform well when trying to classify data that is too dissimilar from the sample data used for training.In the context of drug discovery, this problem is encountered when trying to apply one-shot learning to the Maximum Unbiased Validation (MUV) dataset, for example [10].Benchmark results show that all three one-shot learning methods explored here do little better than pure chance when making classification predictions with MUV data (see Table 3).

The MUV dataset contains around 93,000 compounds, and represents a diverse collection of molecular scaffolds, compared to Tox21 and SIDER.One-shot methods do not perform as well on this data set, probably because there is less structural similarity between the elements of the MUV dataset, compared to Tox21 and SIDER.One-shot networks require some amount structural similarity, within the data set, in order to extrapolate from limited data, and correctly classify new, but similar, compounds.

A metric of self-similarity within a data set could be computed as a data set size-independent, extensive measure, where every element is compared to every other measurement, and some attention measure is evaluated, such a cosine distance.The attention measure can be summed through all unique comparisons, and then be normalized, by diving by the number of unique comparisons between N elements in the set to all other elements in the set.

 Tox21SIDERMUV
GC1.0791.26611.25α
Siamese1.116λ1.1041.105
AttLSTM1.116λ1.1161.084
IterRefLSTM1.094λ1.0841.092

Table 5. Speedup Factors, comparing the Tesla P100 16GB GPU to the Tesla M40. All speedups are Tesla P100 runtimes divided by Tesla M40 runtimes.


α The greatest speedup is observed with GC on the MUV data set (Table 5).

GC also exhibits the most precipitous drop in performance, as it transitions to one-shot models. Table 4 indicates that the GC model performs better across all data sets, compared to one-shot methods.This is not surprising, because the purely graphical model is more susceptible to GPU acceleration.However, it is crucial to note that GC models perform worse than one-shot models on Tox21, SIDER, but not MUV.On MUV, GC has nearly no predictive ability (Table 3), compared the one-shot models which have absolutely no predictability with MUV.

λ The one-shot newtorks, while providing substantial improvement in performance, do not seem to show a significant speedup, observing the values for the data sets, in the rows for Siamese, attLTSM, or IterRefLSTM.The nearly absent-speedup could arise from high GPU-system memory transfers.Note, however, that although small, there is a slight but consistent improvement is speedup for the one-shot networks for the Tox21 set.The Tox21 data set may therefore require fewer transfers to system memory.A general observation of the element flatline in speedup for one-shot methods may be from the LSTM elements.


Generally, deep convolutional network models, such as GC, or models which benefit from having a large data set containing structurally diverse groups, such as RF and GC, perform better on the MUV data.RF, for example, shows the best performance, even if very poor.Deep networks have demonstrated that, provided enough layers, they have the information-holding capacity required in order to learn and retain representations for the MUV data.Their information-holding capacity is what enables them to classify between the large number of structurally diverse classes in MUV.It may be the case that the hyper parameters for the graphical convolutional network at not set such that the GC model would yield a poor to fair level of performance on MUV.In their paper, Altae-Tran stated that hyperparameters for the convolutional networks were not optimized, and that there may be an opportunity to improve performance there [1].

Remarks on Neural Network Information Structure, and How One-Shot Networks are Different

All neural networks require training data in order to develop structure under training pressure.Feature complexity, in image classification networks, becomes stratified, under training pressure, through the network’s layers.The lowest layers emerge as edge detectors, with successive layers building upon features from previous layers.The second layer, for example, can build corner detectors, or curved edge detectors, by detecting combinations of simpler edges.Through a buildup of feature complexity, eventually, higher layers can emerge which can detect complex, high-level features such as faces.The combinatorial size of the detectable feature space grows with the number of viable filters (kernels) connecting each layer to the preceding layer.With Natural Language Processing (NLP) networks, layer complexity progresses from sentence features, to paragraphs, then chapters, and finally whole book vector representations, which consist of succinct thematic summaries of written works.

To reiterate, all networks require information structure, acquired under training pressure, to develop some inner representation, or “belief” about data.Deep networks allow for more diverse structures to be learned, compared to one-shot networks, which are limited in their ability to learn diverse representations.One-shot structural features are designed to improve extraction of information for support data, in order to learning a representation which can be used to extrapolate from a smaller group of similar classes.One-shot methods do not perform as well with MUV, compared to RF, for the reason that they are not designed to produce a useful network from a data set having the level of dissimilarity between molecular scaffolds between elements, such as with MUV.

Transfer Learning with One-Shot Learning Network Architecture

A network trained on the Tox21 data set was evaluated on the SIDER data set.The results, given by the performance metric values, shown in Table 6, indicate that the network trained on Tox21 has nearly no predictive power on the SIDER data.This indicates that the performance does not generalize well to new chemical scaffolds, which supports the explanation for why one-shot methods do poorly at predicting the results for the MUV dataset.

 SiameseattnLSTMIterRefLSTM
To SIDER from Tox210.5050.5020.504

Table 6. Transfer Learning to SIDER from Tox21. These results agree with the performance metric values reported for transfer learning in [1], and support the conclusion that transfer learning between data sets will result in no predictive capability, unless the data sets are significantly similar.

Conclusion

For binary classification tasks associated with small population data sources, one-shot learning methods may provide significantly better results compared to baseline performances of graphical convolution and random forests.The results show that the performance of one shot learning methods may depend on the diversity of molecular scaffolds in a data set.With MUV, for example, one shot methods did not extrapolate well to unseen molecular scaffolds.Also, the failure of transfer learning from the Tox21 network, to correctly predict SIDER assay outcomes, also indicates that data set training may not be easily generalized with one shot networks.

The Iterative Refinement LSTM method developed in [1] demonstrates that LSTMs can generalize to similar experimental assays which are not identical to assays in the data set, but which have some common relation.

References

1.) Altae-Tran, Han, Ramsundar, Bharath, Pappu, Aneesh S., and Pande, Vijay”Low Data Drug Discovery with One-Shot Learning.” ACS central science 3.4 (2017): 283-293.
2.) Wu, Zhenqin, et al. “MoleculeNet: A Benchmark for Molecular Machine Learning.” arXiv preprint arXiv:1703.00564 (2017).
3.) Hariharan, Bharath, and Ross Girshick. “Low-shot visual object recognition.” arXiv preprint arXiv:1606.02819 (2016).
4.) Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. “Siamese neural networks for one-shot image recognition.” ICML Deep Learning Workshop. Vol. 2. 2015.
5.) Vinyals, Oriol, et al.Matching networks for one shot learning.” Advances in Neural Information Processing Systems. 2016.
6.) Wang, Peilu, et al. “A unified tagging solution: Bidirectional LSTM recurrent neural network with word embedding.” arXiv preprint arXiv:1511.00215 (2015).
7.) Duvenaud, David K., et al.Convolutional networks on graphs for learning molecular fingerprints.” Advances in neural information processing systems. 2015.
8.) Lusci, Alessandro, Gianluca Pollastri, and Pierre Baldi. “Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules.” Journal of chemical information and modeling 53.7 (2013): 1563-1575.
9. Receiver Operating Curves Applet, Kennis Research, 2016.
10. Maximum Unbiased Validation Chemical Data Set
11. Kingma, D. and Ba, J. Adam: a Method for Stochastic Optimization, arXiv preprint: arxiv.org/pdf/1412.6980v8.pdf.
12. University of Nebraska Medical Center online information resource, AUROC
13. Inference for the Generalization of Error

The post One-shot Learning Methods Applied to Drug Discovery with DeepChem appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/one-shot-learning-methods-applied-drug-discovery-deepchem/feed/ 0
NVIDIA Tesla P40 GPU Accelerator (Pascal GP102) Up Close https://www.microway.com/hpc-tech-tips/nvidia-tesla-p40-gpu-accelerator-pascal-gp102-up-close/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-p40-gpu-accelerator-pascal-gp102-up-close/#respond Tue, 07 Feb 2017 15:58:14 +0000 https://www.microway.com/?p=8592 As NVIDIA’s GPUs become increasingly vital to the fields of AI and intelligent machines, NVIDIA has produced GPU models specifically targeted to these applications. The new Tesla P40 GPU is NVIDIA’s premiere product for deep learning deployments. It is specifically designed for high-speed inference workloads, which means running data through pre-trained neural networks. However, it […]

The post NVIDIA Tesla P40 GPU Accelerator (Pascal GP102) Up Close appeared first on Microway.

]]>
As NVIDIA’s GPUs become increasingly vital to the fields of AI and intelligent machines, NVIDIA has produced GPU models specifically targeted to these applications. The new Tesla P40 GPU is NVIDIA’s premiere product for deep learning deployments. It is specifically designed for high-speed inference workloads, which means running data through pre-trained neural networks. However, it also offers significant processing performance for projects which do not require 64-bit double-precision floating point capability (many neural networks can be trained using the 32-bit single-precision floating point on the Tesla P40). For those cases, these GPUs can be used to accelerate both the neural network training and the inference.

Highlights of the new Tesla P40 GPU include:

  • Up to 12 TFLOPS single-precision floating-point performance
  • Support for INT8 operations with up to 47 TOPS (ideal for high-speed/high-volume inference)
  • 24GB of GDDR5 GPU memory, with bandwidths up to 346GB/s

PCI-Express Data Transfer Speeds

The Tesla P40 GPUs use the same generation 3.0 PCI-E connectivity as other recent GPUs (such as the Maxwell generation), so you should expect to achieve similar transfer speeds. As shown below, we’re able to achieve transfers up to ~12.8GB/s between the host and the GPU:

[root@node4 ~]# ./bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P40
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11842.8

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12899.9

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			240357.5

Result = PASS

Technical Details of the Tesla P40 GPU

Below are the technical details reported by nvidia-smi. Note that “Pascal” Tesla GPUs now include fully integrated memory ECC support that is always enabled (memory performance in previous generations could be improved by disabling ECC).

[root@node4 ~]# nvidia-smi -a -i 0

==============NVSMI LOG==============

Timestamp                           : Mon Feb  6 12:30:52 2017
Driver Version                      : 367.57

Attached GPUs                       : 4
GPU 0000:02:00.0
    Product Name                    : Tesla P40
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Enabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0324416xxxxxx
    GPU UUID                        : GPU-16254654-0bd3-8d18-e8fe-d53865xxxxxx
    Minor Number                    : 0
    VBIOS Version                   : 86.02.23.00.01
    MultiGPU Board                  : No
    Board ID                        : 0x200
    GPU Part Number                 : 900-2G610-0000-000
    Inforom Version
        Image Version               : G610.0200.00.03
        OEM Object                  : 1.1
        ECC Object                  : 4.1
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1B3810DE
        Bus Id                      : 0000:02:00.0
        Sub System Id               : 0x11D910DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : N/A
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 22912 MiB
        Used                        : 0 MiB
        Free                        : 22912 MiB
    BAR1 Memory Usage
        Total                       : 32768 MiB
        Used                        : 2 MiB
        Free                        : 32766 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 27 C
        GPU Shutdown Temp           : 95 C
        GPU Slowdown Temp           : 92 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 12.34 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 250.00 W
    Clocks
        Graphics                    : 544 MHz
        SM                          : 544 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
    Applications Clocks
        Graphics                    : 1531 MHz
        Memory                      : 3615 MHz
    Default Applications Clocks
        Graphics                    : 1303 MHz
        Memory                      : 3615 MHz
    Max Clocks
        Graphics                    : 1531 MHz
        SM                          : 1531 MHz
        Memory                      : 3615 MHz
        Video                       : 1379 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

The latest NVIDIA GPU architectures support large numbers of clock speeds, as well as automated boosting of the clock speed (when power and thermals allow). Administrators can also set specific power consumption limits and monitor the clock speeds (including explanations for any reasons the clocks are running at a lower speed). The list below shows the available clock speeds for the Tesla P40 GPU:

[root@node4 ~]# nvidia-smi -q -d SUPPORTED_CLOCKS -i 0

==============NVSMI LOG==============

Timestamp                           : Mon Feb  6 12:31:56 2017
Driver Version                      : 367.57

Attached GPUs                       : 4
GPU 0000:02:00.0
    Supported Clocks
        Memory                      : 3615 MHz
            Graphics                : 1531 MHz
            Graphics                : 1518 MHz
            Graphics                : 1506 MHz
            Graphics                : 1493 MHz
            Graphics                : 1480 MHz
            Graphics                : 1468 MHz
            Graphics                : 1455 MHz
            Graphics                : 1442 MHz
            Graphics                : 1430 MHz
            Graphics                : 1417 MHz
            Graphics                : 1404 MHz
            Graphics                : 1392 MHz
            Graphics                : 1379 MHz
            Graphics                : 1366 MHz
            Graphics                : 1354 MHz
            Graphics                : 1341 MHz
            Graphics                : 1328 MHz
            Graphics                : 1316 MHz
            Graphics                : 1303 MHz
            Graphics                : 1290 MHz
            Graphics                : 1278 MHz
            Graphics                : 1265 MHz
            Graphics                : 1252 MHz
            Graphics                : 1240 MHz
            Graphics                : 1227 MHz
            Graphics                : 1215 MHz
            Graphics                : 1202 MHz
            Graphics                : 1189 MHz
            Graphics                : 1177 MHz
            Graphics                : 1164 MHz
            Graphics                : 1151 MHz
            Graphics                : 1139 MHz
            Graphics                : 1126 MHz
            Graphics                : 1113 MHz
            Graphics                : 1101 MHz
            Graphics                : 1088 MHz
            Graphics                : 1075 MHz
            Graphics                : 1063 MHz
            Graphics                : 1050 MHz
            Graphics                : 1037 MHz
            Graphics                : 1025 MHz
            Graphics                : 1012 MHz

NVIDIA deviceQuery on Tesla P40 GPU

Each new GPU generation brings tweaks to the design. The output below, from the CUDA 8.0 SDK samples, shows additional details of the architecture and capabilities of the “Pascal” Tesla P40 GPU accelerators. Take note of the new Compute Capability 6.1, which is what you’ll want to target if you’re compiling your own CUDA code.

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla P40"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 22913 MBytes (24025956352 bytes)
  (30) Multiprocessors, (128) CUDA Cores/MP:     3840 CUDA Cores
  GPU Max Clock rate:                            1531 MHz (1.53 GHz)
  Memory Clock rate:                             3615 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = Tesla P40, Device1 = Tesla P40, Device2 = Tesla P40, Device3 = Tesla P40
Result = PASS

Additional Information on Tesla P40 GPUs

To learn more about the NVIDIA “Pascal” GPU architecture and to compare Tesla P40 with other models in the Tesla product line, read our “Pascal” Tesla GPU knowledge center article.

If you’re thinking about using GPUs for the first time, please consider getting in touch with us. We’ve been implementing GPU-accelerated systems for nearly a decade and have the expertise to help make your project a success!

If you’re an existing GPU user considering a new deployment, review or Tesla GPU clusters page and our list of GPU servers.

The post NVIDIA Tesla P40 GPU Accelerator (Pascal GP102) Up Close appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-p40-gpu-accelerator-pascal-gp102-up-close/feed/ 0
Deep Learning Benchmarks of NVIDIA Tesla P100 PCIe, Tesla K80, and Tesla M40 GPUs https://www.microway.com/hpc-tech-tips/deep-learning-benchmarks-nvidia-tesla-p100-16gb-pcie-tesla-k80-tesla-m40-gpus/ https://www.microway.com/hpc-tech-tips/deep-learning-benchmarks-nvidia-tesla-p100-16gb-pcie-tesla-k80-tesla-m40-gpus/#comments Fri, 27 Jan 2017 21:14:49 +0000 https://www.microway.com/?p=8410 Sources of CPU benchmarks, used for estimating performance on similar workloads, have been available throughout the course of CPU development.For example, the Standard Performance Evaluation Corporation has compiled a large set of applications benchmarks, running on a variety of CPUs, across a multitude of systems. There are certainly benchmarks for GPUs, but only during the […]

The post Deep Learning Benchmarks of NVIDIA Tesla P100 PCIe, Tesla K80, and Tesla M40 GPUs appeared first on Microway.

]]>
Sources of CPU benchmarks, used for estimating performance on similar workloads, have been available throughout the course of CPU development.For example, the Standard Performance Evaluation Corporation has compiled a large set of applications benchmarks, running on a variety of CPUs, across a multitude of systems. There are certainly benchmarks for GPUs, but only during the past year has an organized set of deep learning benchmarks been published. Called DeepMarks, these deep learning benchmarks are available to all developers who want to get a sense of how their application might perform across various deep learning frameworks.

The benchmarking scripts used for the DeepMarks study are published at GitHub. The original DeepMarks study was run on a Titan X GPU (Maxwell microarchitecture), having 12GB of onboard video memory. Here we will examine the performance of several deep learning frameworks on a variety of Tesla GPUs, including the Tesla P100 16GB PCIe, Tesla K80, and Tesla M40 12GB GPUs.

Data from Deep Learning Benchmarks

The deep learning frameworks covered in this benchmark study are TensorFlow, Caffe, Torch, and Theano. All deep learning benchmarks were single-GPU runs. The benchmarking scripts used in this study are the same as those found at DeepMarks. DeepMarks runs a series of benchmarking scripts which report the time required for a framework to process one forward propagation step, plus one backpropagation step. The sum of both comprises one training iteration. The times reported are the times required for one training iteration per batch, in milliseconds.

To start, we ran CPU-only trainings of each neural network. We then ran the same trainings on each type of GPU. The plot below depicts the ranges of speedup that were obtained via GPU acceleration.

Plot of deep learning benchmark results across Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs
Figure 1. GPU speedup ranges over CPU-only trainings – geometrically averaged across all four framework types and all four neural network types.

If we expand the plot and show the speedups for the different types of neural networks, we see that some types of networks undergo a larger speedup than others.

Plot of deep learning benchmark speedups (with geometric averages) for each network on Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs
Figure 2. GPU speedups over CPU-only trainings – geometrically averaged across all four deep learning frameworks. The speedup ranges from Figure 1 are uncollapsed into values for each neural network architecture.

If we take a step back and look at the ranges of speedups the GPUs provide, there is a fairly wide range of speedup. The plot below shows the full range of speedups measured (without geometrically averaging across the various deep learning frameworks). Note that the ranges are widened and become overlapped.

Plot of deep learning benchmark results (without geometric averages) across Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs
Figure 3. Speedup factor ranges without geometric averaging across frameworks. Range is taken across set of runtimes for all framework/network pairs.

We believe the ranges resulting from geometric averaging across frameworks (as shown in Figure 1) results in narrower distributions and appears to be a more accurate quality measure than is shown in Figure 3. However, it is instructive to expand the plot from Figure 3 to show each deep learning framework. Those ranges, as shown below, demonstrate that your neural network training time will strongly depend upon which deep learning framework you select.

Plot of deep learning benchmark results for each framework (without geometric averages) across Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs
Figure 4. GPU speedups over CPU-only trainings – showing the range of speedups when training four neural network types. The speedup ranges from Figure 3 are uncollapsed into values for each deep learning framework.

As shown in all four plots above, the Tesla P100 PCIe GPU provides the fastest speedups for neural network training. With that in mind, the plot below shows the raw training times for each type of neural network on each of the four deep learning frameworks.

Plot of deep learning benchmark training iteration times for each framework on Tesla P100 16GB PCIe GPUs
Figure 5. Training iteration times (in milliseconds) for each deep learning framework and neural network architecture (as measured on the Tesla P100 16GB PCIe GPU).

We provide more discussion below. For reference, we have listed the measurements from each set of tests.

Tesla P100 16GB PCIe Benchmark Results

AlexNetOverfeatGoogLeNetVGG (ver.a)Speedup Over CPU
Caffe80288279393(35x ~ 70x speedups)
TensorFlow46144253277(16x ~ 40x speedups)
Theano1614826242075(19x ~ 43x speedups)
cuDNN-fp32 (Torch)44107247222(33x ~ 41x speedups)
geometric average over frameworks71215331473(29x ~ 42x speedups)

Table 1: Benchmarks were run on a single Tesla P100 16GB PCIe GPU. Times reported are in msec per batch. The batch size for all training iterations measured for runtime in this study is 128, except for VGG net, which uses a batch size of 64.

Tesla K80 Benchmark Results

AlexNetOverfeatGoogLeNetVGG (ver.a)Speedup Over CPU
Caffe3651,1871,2361,747(9x ~ 15x speedups)
TensorFlow1816229791,104(4x ~ 10x speedups)
Theano5151,7161,793(8x ~ 16x speedups)
cuDNN-fp32 (Torch)171379914743(9x ~ 12x speedups)
geometric average over frameworks2768321,1871,127(9x ~ 11x speedups)

Table 2: Benchmarks were run on a single Tesla K80 GPU chip. Times reported are in msec per batch.

Tesla M40 Benchmark Results

AlexNetOverfeatGoogLeNetVGG (ver.a)Speedup Over CPU
Caffe128448468637(22x ~ 53x speedups)
TensorFlow82273418498(10x ~ 22x speedups)
Theano245786963(17x ~ 28x speedups)
cuDNN-fp32 (Torch)79182433400(19x ~ 22x speedups)
geometric average over frameworks119364534506(20x ~ 27x speedups)

Table 3: Benchmarks were run on a single Tesla M40 GPU. Times reported are in msec per batch.

CPU-only Benchmark Results

AlexNetOverfeatGoogLeNetVGG (ver.a)
Caffe4,52910,35018,54514,010
TensorFlow1,8235,2754,0187,341
Theano5,27513,57926,82938,687
cuDNN-fp32 (Torch)1,8383,6048,2349,166
geometric average over frameworks2,9917,19011,32613,819

Table 4: Benchmarks were run on dual Xeon E5-2690v4 processors in a system with 256GB RAM. Times reported are in msec per batch.

Discussion

When geometric averaging is applied across framework runtimes, a range of speedup values is derived for each GPU, as shown in Figure 1.CPU times are also averaged geometrically across framework type.These results indicate that the greatest speedups are realized with the Tesla P100, with the Tesla M40 ranking second, and the Tesla K80 yielding the lowest speedup factors.Figure 2 shows the range of speedup values by network architecture, uncollapsed from the ranges shown in Figure 1.

The speedup ranges for runtimes not geometrically averaged across frameworks are shown in Figure 3.Here the set of all runtimes corresponding to each framework/network pair is considered when determining the range of speedups for each GPU type.Figure 4 shows the speedup ranges by framework, uncollapsed from the ranges shown in figure 3.The degree of overlap in Figure 3 suggests that geometric averaging across framework type yields a better measure of GPU performance, with more narrow and distinct ranges resulting for each GPU type, as shown in Figure 1.

The greatest speedups were observed when comparing Caffe forward+backpropagation runtime to CPU runtime, when solving the GoogLeNet network model. Caffe generally showed speedups larger than any other framework for this comparison, ranging from 35x to ~70x (see Figure 4 and Table 1). Despite the higher speedups, Caffe does not turn out to be the best performing framework on these benchmarks (see Figure 5).When comparing runtimes on the Tesla P100, Torch performs best and has the shortest runtimes (see Figure 5).Note that although the VGG net tends to be the slowest of all, it does train faster then GooLeNet when run on the Torch framework (see Figure 5).

The data show that Theano and TensorFlow display similar speedups on GPUs (see Figure 4).Despite the fact that Theano sometimes has larger speedups than Torch, Torch and TensorFlow outperform Theano.While Torch and TensorFlow yield similar performance, Torch performs slightly better with most network / GPU combinations.However, TensorFlow outperforms Torch in most cases for CPU-only training (see Table 4).

Theano is outperformed by all other frameworks, across all benchmark measurements and devices (see Tables 1 – 4). Figure 5 shows the large runtimes for Theano compared to other frameworks run on the Tesla P100.It should be noted that since VGG net was run with a batch size of only 64, compared to 128 with all other network architectures, the runtimes can sometimes be faster with VGG net, than with GoogLeNet.See, for example, the runtimes for Torch, on GoogLeNet, compared to VGG net, across all GPU devices (Tables 1 – 3).

Deep Learning Benchmark Conclusions

The single-GPU benchmark results show that speedups over CPU increase from Tesla K80, to Tesla M40, and finally to Tesla P100, which yields the greatest speedups (Table 5, Figure 1) and fastest runtimes (Table 6).

Range of Speedups, by GPU type

Tesla P100 16GB PCIeTesla M40 12GBTesla K80
19x ~ 70x10x ~ 53x4x ~ 16x

Table 5: Measured speedups for running various deep learning frameworks on GPUs (see Table 1)

Fastest Runtime for VGG net, by GPU type

Tesla P100 16GB PCIeTesla M40 12GBTesla K80
222408743

Table 6: Absolute best runtimes (msec / batch) across all frameworks for VGG net (ver. a). The Torch framework provides the best VGG runtimes, across all GPU types.

The results show that of the tested GPUs, Tesla P100 16GB PCIe yields the absolute best runtime, and also offers the best speedup over CPU-only runs. Regardless of which deep learning framework you prefer, these GPUs offer valuable performance boosts.

Benchmark Setup

Microway’s GPU Test Drive compute nodes were used in this study. Each is configured with 256GB of system memory and dual 14-core Intel Xeon E5-2690v4 processors (with a base frequency of 2.6GHz and a Turbo Boost frequency of 3.5GHz). Identical benchmark workloads were run on the Tesla P100 16GB PCIe, Tesla K80, and Tesla M40 GPUs. The batch size is 128 for all runtimes reported, except for VGG net (which uses a batch size of 64).All deep learning frameworks were linked to the NVIDIA cuDNN library (v5.1), instead of their own native deep network libraries.This is because linking to cuDNN yields better performance than using the native library of each framework.

When running benchmarks of Theano, slightly better runtimes resulted when CNMeM, a CUDA memory manager, is used to manage the GPU’s memory. By setting lib.cnmem=0.95, the GPU device will have CNMeM manage 95% of its memory:
THEANO_FLAGS='floatX=float32,device=gpu0,lib.cnmem=0.95,allow_gc=True' python ...

Notes on Tesla M40 versus Tesla K80

The data demonstrate that Tesla M40 outperforms Tesla K80. When geometrically averaging runtimes across frameworks, the speedup of the Tesla K80 ranges from 9x to 11x, while for the Tesla M40, speedups range from 20x to 27x.The same relationship exists when comparing ranges without geometric averaging.This result is expected, considering that the Tesla K80 card consists of two separate GK210 GPU chips (connected by a PCIe switch on the GPU card).Since the benchmarks here were run on single GPU chips, the benchmarks reflect only half the throughput possible on a Tesla K80 GPU. If running a perfectly parallel job, or two separate jobs, the Tesla K80 should be expected to approach the throughput of a Tesla M40.

Singularity Containers

Logo image of the Singularity projectSingularity is a new type of container designed specifically for HPC environments. Singularity enables the user to define an environment within the container, which might include customized deep learning frameworks, NVIDIA device drivers, and the CUDA 8.0 toolkit. The user can copy and transport this container as a single file, bringing their customized environment to a different machine where the host OS and base hardware may be completely different. The container will process the workflow within it to execute in the host’s OS environment, just as it does in its internal container environment. The workflow is pre-defined inside of the container, including and necessary library files, packages, configuration files, environment variables, and so on.

In order to facilitate benchmarking of four different deep learning frameworks, Singularity containers were created separately for Caffe, TensorFlow, Theano, and Torch. Given its simplicity and powerful capabilities, you should expect to hear more about Singularity soon.

References

DeepMarks
Deep Learning Benchmarks published on GitHub

Singularity
Containers for Full User Control of Environment

Alexnet
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.

Overfeat
Sermanet, Pierre, et al. “Overfeat: Integrated recognition, localization and detection using convolutional networks.” arXiv preprint arXiv:1312.6229 (2013).

GoogLeNet
Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

VGG Net
Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.”arXiv preprint arXiv:1409.1556 (2014).

The post Deep Learning Benchmarks of NVIDIA Tesla P100 PCIe, Tesla K80, and Tesla M40 GPUs appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/deep-learning-benchmarks-nvidia-tesla-p100-16gb-pcie-tesla-k80-tesla-m40-gpus/feed/ 3
Comparing NVLink vs PCI-E with NVIDIA Tesla P100 GPUs on OpenPOWER Servers https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/ https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/#comments Thu, 26 Jan 2017 14:41:45 +0000 https://www.microway.com/?p=8492 The new NVIDIA Tesla P100 GPUs are available with both PCI-Express and NVLink connectivity. How do these two types of connectivity compare? This post provides a rundown of NVLink vs PCI-E and explores the benefits of NVIDIA’s new NVLink technology. Considering the variety of options for Tesla P100 GPUs, you may wish to review our […]

The post Comparing NVLink vs PCI-E with NVIDIA Tesla P100 GPUs on OpenPOWER Servers appeared first on Microway.

]]>
The new NVIDIA Tesla P100 GPUs are available with both PCI-Express and NVLink connectivity. How do these two types of connectivity compare? This post provides a rundown of NVLink vs PCI-E and explores the benefits of NVIDIA’s new NVLink technology.

Photo of NVIDIA Tesla P100 NVLink GPUs in an OpenPOWER server

Considering the variety of options for Tesla P100 GPUs, you may wish to review our other recent posts:

Primary considerations when comparing NVLink vs PCI-E

On systems with x86 CPUs (such as Intel Xeon), the connectivity to the GPU is only through PCI-Express (although the GPUs connect to each other through NVLink). On systems with POWER8 CPUs, the connectivity to the GPU is through NVLink (in addition to the NVLink between GPUs).

Nevertheless, the performance characteristics of the GPU itself (GPU cores, GPU memory, etc) do not vary. The Tesla P100 GPU itself will be performing at the same level. It’s the data flow and total system throughput that will determine the final performance for your workload. To review:

  • Full NVLink connectivity is only available with IBM POWER8 CPUs (not x86 CPUs)
  • GPU-to-GPU NVLink connectivity (without CPU-to-GPU) is available with x86 CPUs
  • Internal performance of an NVIDIA Tesla P100 SXM2 GPU will not vary between x86 and POWER8

With that in mind, let’s compare their throughput.

Tesla P100 with NVLink on OpenPOWER

The NVLink connections on Tesla P100 GPUs provide a theoretical peak throughput of 80GB/s (160GB/s bi-directional). However, those links are made up of several bricks, which can be split up to connect to a number of other devices. For example, one GPU might dedicate 40GB/s for a link to a CPU and 40GB/s for a link to a nearby GPU.

Device <-> Device NVLink Performance

Below is the output from NVIDIA’s GPU peer-to-peer (P2P) utility, which is included with CUDA 8.0. The results summarize the throughput (in gigabytes per second) and latency (in microseconds) when sending messages between pairs of Tesla P100 GPUs in our OperPOWER system.

It’s important to understand that the test below was run on a system with four Tesla GPUs. On each GPU, the available 80GB/s bandwidth was divided in half. One link goes to a POWER8 CPU and one link goes to the adjacent P100 GPU (see diagram below).

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla P100-SXM2-16GB, pciBusID: 1, pciDeviceID: 0, pciDomainID:2
Device: 1, Tesla P100-SXM2-16GB, pciBusID: 1, pciDeviceID: 0, pciDomainID:3
Device: 2, Tesla P100-SXM2-16GB, pciBusID: 1, pciDeviceID: 0, pciDomainID:a
Device: 3, Tesla P100-SXM2-16GB, pciBusID: 1, pciDeviceID: 0, pciDomainID:b

...

Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 457.93  35.30  20.37  20.40
     1  35.30 454.78  20.16  20.14
     2  20.19  20.16 454.56  35.29
     3  18.36  18.42  35.29 454.07

...

P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   4.99   7.92  15.56  15.43
     1   8.06   5.00  15.40  15.40
     2  15.47  15.52   5.04   8.07
     3  15.43  15.49   8.04   4.97

As the results show, each 40GB/s Tesla P100 NVLink will provide ~35GB/s in practice. Communications between GPUs on a remote CPU offer throughput of ~20GB/s. Latency between GPUs is 8~16 microseconds. The results were gathered on our 2U OpenPOWER GPU server with Tesla P100 NVLink GPUs, which is available to benchmark in our Test Drive cluster. The architectural design of this particular platform is:

Block diagram drawing of the Microway OpenPOWER GPU Server with NVLink GPUs
Block diagram of the 2U Microway OpenPOWER GPU server with Tesla P100 NVLink GPUs

Device <-> Device PCI-E Performance

A similar test, run on GPUs connected by standard PCI-Express, will result in the following performance:

Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 452.19  10.19  10.73  10.74
     1  10.19 450.04  10.76  10.75
     2  10.91  10.90 450.94  10.21
     3  10.90  10.91  10.18 450.95

...

P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   3.22   7.86  16.90  17.05
     1   7.85   3.21  17.08  17.22
     2  16.32  16.37   3.07   7.85
     3  16.26  16.35   7.84   3.07

The latencies between GPUs are about the same (although there is a larger latency when traveling to GPUs on remote CPUs. However, transfer bandwidth is significantly higher for NVlink vs PCI-E (two to three times higher). This increased throughput gives NVLink an advantage for fine-grained applications and others which send data between GPUs.

NVLink vs PCI-E: Host <-> Device Performance

CPU-to-GPU data transfers occur whenever data must be transferred into or out of the GPU. These are typically called host-to-device and device-to-host transfers. Traditional systems with x86 CPUs are only able to communicate with the GPUs over PCI-Express, which provides lower throughput. Our OpenPOWER systems provide full NVLink connectivity to the GPUs. Here’s the achieved performance:

Host <-> Device across NVLink

[root@openpower8 ~]# ./bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P100-SXM2-16GB
 Quick Mode

Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			33236.9

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			32322.6

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			448515.9

Result = PASS

Host <-> Device across PCI-E

A similar test, run on an x86 system with GPUs connected by PCI-Express, will result in the following performance:

...

 Device 0: Tesla P100-PCIE-16GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11658.4

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12882.0

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			446125.2

Just as with the GPU-to-GPU transfers, we see that NVLink enables much faster performance. Remember that this type of transfer occurs whenever you load a dataset into main memory and then process some of that data on the GPUs. These transfer times are often a factor in overall application performance, so a 3X speedup is welcome. This increased performance may also enable applications which were previously too data-movement-intensive.

Finally, consider that NVIDIA CUDA 8.0 (together with the Tesla P100 GPUs) allows for fully Unified Memory. You will be able to load datasets larger than GPU memory and let the system automatically manage data movement. In other words, the size of GPU memory no longer limits the size of your jobs. On such runs, having a wider pipe between CPU and GPU memory is of immense importance.

NVIDIA deviceQuery on OpenPOWER server with Tesla P100 GPUs and NVLink

Each new GPU generation brings tweaks to the design. The output below, from the CUDA 8.0 SDK samples, shows additional details of the architecture and capabilities of the “Pascal” Tesla P100 GPU accelerators with NVLink. Take note of the new Compute Capability 6.0, which is what you’ll want to target if you’re compiling your own CUDA code. Also note that in this platform there are three DMA copy engines per GPU.

deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla P100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16281 MBytes (17071669248 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1481 MHz (1.48 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   2 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

  [...]

> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU2) : No
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU3) : No
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU2) : No
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU3) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU0) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU1) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU0) : No
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU1) : No
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = Tesla P100-SXM2-16GB, Device1 = Tesla P100-SXM2-16GB, Device2 = Tesla P100-SXM2-16GB, Device3 = Tesla P100-SXM2-16GB
Result = PASS

How to move forward – GPU systems with Host-to-Device NVLink

Due to the new high-speed NVLink connection, there is only one server on the market with both Host-to-Device and Device-to-Device NVLink connectivity. This system, leveraging IBM’s POWER8 CPUs and innovation from the OpenPOWER foundation (including NVIDIA and Mellanox), began shipments in fall 2016. Please contact us to learn more, or read about this OpenPOWER server. Academic discounts are available.

To learn more about the available NVIDIA Tesla “Pascal” GPUs and to compare with other versions of the Tesla product line, please review our “Pascal” Tesla GPU knowledge center article:

If you’re thinking about using GPUs for the first time, please consider getting in touch with us. We’ve been implementing GPU-accelerated systems for nearly a decade and have the expertise to help make your project a success!

The post Comparing NVLink vs PCI-E with NVIDIA Tesla P100 GPUs on OpenPOWER Servers appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/feed/ 1
NVIDIA Tesla P100 NVLink 16GB GPU Accelerator (Pascal GP100 SXM2) Up Close https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-nvlink-16gb-gpu-accelerator-pascal-gp100-sxm2-close/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-nvlink-16gb-gpu-accelerator-pascal-gp100-sxm2-close/#comments Wed, 18 Jan 2017 13:10:46 +0000 https://www.microway.com/?p=8398 The NVIDIA Tesla P100 NVLink GPUs are a big advancement. For the first time, the GPU is stepping outside the traditional “add in card” design. No longer tied to the fixed specifications of PCI-Express cards, NVIDIA’s engineers have designed a new form factor that best suits the needs of the GPU. With their SXM2 design, […]

The post NVIDIA Tesla P100 NVLink 16GB GPU Accelerator (Pascal GP100 SXM2) Up Close appeared first on Microway.

]]>
The NVIDIA Tesla P100 NVLink GPUs are a big advancement. For the first time, the GPU is stepping outside the traditional “add in card” design. No longer tied to the fixed specifications of PCI-Express cards, NVIDIA’s engineers have designed a new form factor that best suits the needs of the GPU. With their SXM2 design, NVIDIA can run GPUs to their full potential.

One of the biggest changes this allows is the NVLink interconnect, which allows GPUs to operate beyond the restrictions of the PCI-Express bus. Instead, the GPUs communicate with one another over this high-speed link. Additionally, these new “Pascal” architecture GPUs bring improvements including higher performance, faster connectivity, and more flexibility for users & programmers.

Close-Up Photo of the NVIDIA Tesla P100 NVLink GPU

There is variety in the new line-up of GPU products. For the Tesla P100 GPU model, there are three separate paths to be considered:

Highlights of the new Tesla P100 NVLink GPUs include:

  • Up to 5.3 TFLOPS double- and 10.6 TFLOPS single-precision floating-point performance
  • 16GB of on-die HBM2 CoWoS GPU memory, with bandwidths up to 732GB/s
  • 80GB/s NVLink between GPUs boosts bandwidth between the Tesla P100 GPUs
  • High-speed, on-die GPU memory provides a 3X improvement over older GPUs
  • Pascal Unified Memory allows applications to directly access the memory of all GPUs and all of system memory

Improved Data Transfer Speeds

The NVLink connection on Tesla P100 GPUs has a theoretical peak throughput of 80GB/s (160GB/s bi-directional). However, that connectivity is only between GPUs. The GPUs still communicate via PCI-Express when transferring data to and from the host (via PCI-E x16 generation 3.0). The high-speed NVLink connection is only for data transfers directly between the GPUs.

Device <-> Device Tesla P100 NVLink Performance

Below is a section of output from NVIDIA’s GPU peer-to-peer (P2P) utility, which is included with CUDA 8.0. The results summarize the throughput (in gigabytes per second) and latency (in microseconds) when sending messages between any pair of GPUs.

It’s important to understand that the test below was run on a system with four Tesla GPUs. On each GPU, the available 80GB/s bandwidth was divided up so that connections could be made to the three other GPUs. The links are divided such that each GPU has two 20GB/s links and one 40GB/s link (see diagram below).

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla P100-SXM2-16GB, pciBusID: 6, pciDeviceID: 0, pciDomainID:0
Device: 1, Tesla P100-SXM2-16GB, pciBusID: 7, pciDeviceID: 0, pciDomainID:0
Device: 2, Tesla P100-SXM2-16GB, pciBusID: 84, pciDeviceID: 0, pciDomainID:0
Device: 3, Tesla P100-SXM2-16GB, pciBusID: 85, pciDeviceID: 0, pciDomainID:0

...

Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 449.69  18.45  18.45  36.72
     1  18.44 450.92  36.70  18.44
     2  18.45  36.70 450.37  18.44
     3  36.71  18.44  18.44 447.34

...

P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   3.66   9.25   9.31   9.67
     1   9.49   3.65  10.04   9.05
     2   9.85  10.13   3.13   9.79
     3  10.06  11.41   9.97   3.54

As the results show, a 20GB/s Tesla P100 NVLink will provide ~18GB/s in practice. A 40GB/s Tesla P100 NVLink will provide ~36GB/s. Latency between GPUs is 9~10 microseconds. The results were gathered on our 1U NumberSmasher Server with four Tesla P100 NVLink GPUs, which is also available in our Test Drive cluster. The architectural design of this particular platform is:

NumberSmasher 1U NVLink with Tesla P100-SYS-1028GQ-TXR

Host <-> Device Performance

Transfers between system memory and the GPU are still via PCI-Express and will perform similarly to previous-generation “Kepler” and “Maxwell” GPUs. With Tesla P100, you will be able to achieve transfers up to ~12.8GB/s between the host and the GPU:

[root@node2 ~]# ./bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P100-SXM2-16GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11463.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12868.0

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			446271.0

Result = PASS

Technical Details

Below are the technical details reported by nvidia-smi. Note that “Pascal” Tesla P100 GPUs now include fully integrated memory ECC support that is always enabled (memory performance in previous generations could be improved by disabling ECC).

[root@node2 ~]# nvidia-smi -a -i 0

==============NVSMI LOG==============

Timestamp                           : Tue Dec  6 16:30:58 2016
Driver Version                      : 367.48

Attached GPUs                       : 1
GPU 0000:06:00.0
    Product Name                    : Tesla P100-SXM2-16GB
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Enabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 032311609xxxx
    GPU UUID                        : GPU-70ba5857-9613-1213-c5f5-3b201233xxxx
    Minor Number                    : 0
    VBIOS Version                   : 86.00.26.00.02
    MultiGPU Board                  : No
    Board ID                        : 0x600
    GPU Part Number                 : 900-2H403-0000-000
    Inforom Version
        Image Version               : H403.0201.00.04
        OEM Object                  : 1.1
        ECC Object                  : 4.1
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x06
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x15F910DE
        Bus Id                      : 0000:06:00.0
        Sub System Id               : 0x116B10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 16276 MiB
        Used                        : 0 MiB
        Free                        : 16276 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 2 MiB
        Free                        : 16382 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 40 C
        GPU Shutdown Temp           : 85 C
        GPU Slowdown Temp           : 82 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 34.89 W
        Power Limit                 : 300.00 W
        Default Power Limit         : 300.00 W
        Enforced Power Limit        : 300.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 300.00 W
    Clocks
        Graphics                    : 405 MHz
        SM                          : 405 MHz
        Memory                      : 715 MHz
        Video                       : 835 MHz
    Applications Clocks
        Graphics                    : 1480 MHz
        Memory                      : 715 MHz
    Default Applications Clocks
        Graphics                    : 1328 MHz
        Memory                      : 715 MHz
    Max Clocks
        Graphics                    : 1480 MHz
        SM                          : 1480 MHz
        Memory                      : 715 MHz
        Video                       : 1480 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

The latest NVIDIA GPU architectures support large numbers of clock speeds, as well as automated boosting of the clock speed (when power and thermals allow). Administrators can also set specific power consumption limits and monitor the clock speeds (including explanations for any reasons the clocks are running at a lower speed).

[root@node2 ~]# nvidia-smi -q -d SUPPORTED_CLOCKS -i 0

==============NVSMI LOG==============

Timestamp                           : Tue Dec  6 16:39:20 2016
Driver Version                      : 367.48

Attached GPUs                       : 4
GPU 0000:06:00.0
    Supported Clocks
        Memory                      : 715 MHz
            Graphics                : 1480 MHz
            Graphics                : 1468 MHz
            Graphics                : 1455 MHz
            Graphics                : 1442 MHz
            Graphics                : 1430 MHz
            Graphics                : 1417 MHz
            Graphics                : 1404 MHz
            Graphics                : 1392 MHz
            Graphics                : 1379 MHz
            Graphics                : 1366 MHz
            Graphics                : 1354 MHz
            Graphics                : 1341 MHz
            Graphics                : 1328 MHz
            Graphics                : 1316 MHz
            Graphics                : 1303 MHz
            Graphics                : 1290 MHz
            Graphics                : 1278 MHz
            Graphics                : 1265 MHz
            Graphics                : 1252 MHz
            Graphics                : 1240 MHz
            Graphics                : 1227 MHz
            Graphics                : 1215 MHz
            Graphics                : 1202 MHz
            Graphics                : 1189 MHz
            Graphics                : 1177 MHz
            Graphics                : 1164 MHz
            Graphics                : 1151 MHz
            Graphics                : 1139 MHz
            Graphics                : 1126 MHz
            Graphics                : 1113 MHz
            Graphics                : 1101 MHz
            Graphics                : 1088 MHz
            Graphics                : 1075 MHz
            Graphics                : 1063 MHz
            Graphics                : 1050 MHz
            Graphics                : 1037 MHz
            Graphics                : 1025 MHz
            Graphics                : 1012 MHz

NVIDIA deviceQuery on Tesla P100 NVLink 16GB GPU

Each new GPU generation brings tweaks to the design. The output below, from the CUDA 8.0 SDK samples, shows additional details of the architecture and capabilities of the “Pascal” Tesla P100 NVLink GPU accelerators. Take note of the new Compute Capability 6.0, which is what you’ll want to target if you’re compiling your own CUDA code.

deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla P100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16276 MBytes (17066885120 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            405 MHz (0.41 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 6 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

  [...]

> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = Tesla P100-SXM2-16GB, 
Device1 = Tesla P100-SXM2-16GB, Device2 = Tesla P100-SXM2-16GB, Device3 = Tesla P100-SXM2-16GB
Result = PASS

Additional Information on Tesla P100 NVLink GPUs

To learn more about the available P100 GPUs and to compare with other versions of the Tesla product line, please review our “Pascal” Tesla GPU knowledge center article:

If you’re thinking about using GPUs for the first time, please consider getting in touch with us. We’ve been implementing GPU-accelerated systems for nearly a decade and have the expertise to help make your project a success!

Due to their novel design, Tesla P100 NVLink GPUs cannot be installed into existing GPU systems. Platforms with the NVLink-connected SXM2 sockets are required. For several options, have a look at our list of P100 GPU-accelerated systems. You may also wish to review our post on PCI-Express connected Tesla P100 GPUs.

Photo of the back side of the NVIDIA Tesla P100 NVLink GPU

The post NVIDIA Tesla P100 NVLink 16GB GPU Accelerator (Pascal GP100 SXM2) Up Close appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-nvlink-16gb-gpu-accelerator-pascal-gp100-sxm2-close/feed/ 2