P100 Archives - Microway https://www.microway.com/tag/p100/ We Speak HPC & AI Thu, 30 May 2024 19:43:16 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 Deep Learning Benchmarks of NVIDIA Tesla P100 PCIe, Tesla K80, and Tesla M40 GPUs https://www.microway.com/hpc-tech-tips/deep-learning-benchmarks-nvidia-tesla-p100-16gb-pcie-tesla-k80-tesla-m40-gpus/ https://www.microway.com/hpc-tech-tips/deep-learning-benchmarks-nvidia-tesla-p100-16gb-pcie-tesla-k80-tesla-m40-gpus/#comments Fri, 27 Jan 2017 21:14:49 +0000 https://www.microway.com/?p=8410 Sources of CPU benchmarks, used for estimating performance on similar workloads, have been available throughout the course of CPU development.For example, the Standard Performance Evaluation Corporation has compiled a large set of applications benchmarks, running on a variety of CPUs, across a multitude of systems. There are certainly benchmarks for GPUs, but only during the […]

The post Deep Learning Benchmarks of NVIDIA Tesla P100 PCIe, Tesla K80, and Tesla M40 GPUs appeared first on Microway.

]]>
Sources of CPU benchmarks, used for estimating performance on similar workloads, have been available throughout the course of CPU development.For example, the Standard Performance Evaluation Corporation has compiled a large set of applications benchmarks, running on a variety of CPUs, across a multitude of systems. There are certainly benchmarks for GPUs, but only during the past year has an organized set of deep learning benchmarks been published. Called DeepMarks, these deep learning benchmarks are available to all developers who want to get a sense of how their application might perform across various deep learning frameworks.

The benchmarking scripts used for the DeepMarks study are published at GitHub. The original DeepMarks study was run on a Titan X GPU (Maxwell microarchitecture), having 12GB of onboard video memory. Here we will examine the performance of several deep learning frameworks on a variety of Tesla GPUs, including the Tesla P100 16GB PCIe, Tesla K80, and Tesla M40 12GB GPUs.

Data from Deep Learning Benchmarks

The deep learning frameworks covered in this benchmark study are TensorFlow, Caffe, Torch, and Theano. All deep learning benchmarks were single-GPU runs. The benchmarking scripts used in this study are the same as those found at DeepMarks. DeepMarks runs a series of benchmarking scripts which report the time required for a framework to process one forward propagation step, plus one backpropagation step. The sum of both comprises one training iteration. The times reported are the times required for one training iteration per batch, in milliseconds.

To start, we ran CPU-only trainings of each neural network. We then ran the same trainings on each type of GPU. The plot below depicts the ranges of speedup that were obtained via GPU acceleration.

Plot of deep learning benchmark results across Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs
Figure 1. GPU speedup ranges over CPU-only trainings – geometrically averaged across all four framework types and all four neural network types.

If we expand the plot and show the speedups for the different types of neural networks, we see that some types of networks undergo a larger speedup than others.

Plot of deep learning benchmark speedups (with geometric averages) for each network on Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs
Figure 2. GPU speedups over CPU-only trainings – geometrically averaged across all four deep learning frameworks. The speedup ranges from Figure 1 are uncollapsed into values for each neural network architecture.

If we take a step back and look at the ranges of speedups the GPUs provide, there is a fairly wide range of speedup. The plot below shows the full range of speedups measured (without geometrically averaging across the various deep learning frameworks). Note that the ranges are widened and become overlapped.

Plot of deep learning benchmark results (without geometric averages) across Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs
Figure 3. Speedup factor ranges without geometric averaging across frameworks. Range is taken across set of runtimes for all framework/network pairs.

We believe the ranges resulting from geometric averaging across frameworks (as shown in Figure 1) results in narrower distributions and appears to be a more accurate quality measure than is shown in Figure 3. However, it is instructive to expand the plot from Figure 3 to show each deep learning framework. Those ranges, as shown below, demonstrate that your neural network training time will strongly depend upon which deep learning framework you select.

Plot of deep learning benchmark results for each framework (without geometric averages) across Tesla K80, Tesla M40, and Tesla P100 16GB PCIe GPUs
Figure 4. GPU speedups over CPU-only trainings – showing the range of speedups when training four neural network types. The speedup ranges from Figure 3 are uncollapsed into values for each deep learning framework.

As shown in all four plots above, the Tesla P100 PCIe GPU provides the fastest speedups for neural network training. With that in mind, the plot below shows the raw training times for each type of neural network on each of the four deep learning frameworks.

Plot of deep learning benchmark training iteration times for each framework on Tesla P100 16GB PCIe GPUs
Figure 5. Training iteration times (in milliseconds) for each deep learning framework and neural network architecture (as measured on the Tesla P100 16GB PCIe GPU).

We provide more discussion below. For reference, we have listed the measurements from each set of tests.

Tesla P100 16GB PCIe Benchmark Results

AlexNetOverfeatGoogLeNetVGG (ver.a)Speedup Over CPU
Caffe80288279393(35x ~ 70x speedups)
TensorFlow46144253277(16x ~ 40x speedups)
Theano1614826242075(19x ~ 43x speedups)
cuDNN-fp32 (Torch)44107247222(33x ~ 41x speedups)
geometric average over frameworks71215331473(29x ~ 42x speedups)

Table 1: Benchmarks were run on a single Tesla P100 16GB PCIe GPU. Times reported are in msec per batch. The batch size for all training iterations measured for runtime in this study is 128, except for VGG net, which uses a batch size of 64.

Tesla K80 Benchmark Results

AlexNetOverfeatGoogLeNetVGG (ver.a)Speedup Over CPU
Caffe3651,1871,2361,747(9x ~ 15x speedups)
TensorFlow1816229791,104(4x ~ 10x speedups)
Theano5151,7161,793(8x ~ 16x speedups)
cuDNN-fp32 (Torch)171379914743(9x ~ 12x speedups)
geometric average over frameworks2768321,1871,127(9x ~ 11x speedups)

Table 2: Benchmarks were run on a single Tesla K80 GPU chip. Times reported are in msec per batch.

Tesla M40 Benchmark Results

AlexNetOverfeatGoogLeNetVGG (ver.a)Speedup Over CPU
Caffe128448468637(22x ~ 53x speedups)
TensorFlow82273418498(10x ~ 22x speedups)
Theano245786963(17x ~ 28x speedups)
cuDNN-fp32 (Torch)79182433400(19x ~ 22x speedups)
geometric average over frameworks119364534506(20x ~ 27x speedups)

Table 3: Benchmarks were run on a single Tesla M40 GPU. Times reported are in msec per batch.

CPU-only Benchmark Results

AlexNetOverfeatGoogLeNetVGG (ver.a)
Caffe4,52910,35018,54514,010
TensorFlow1,8235,2754,0187,341
Theano5,27513,57926,82938,687
cuDNN-fp32 (Torch)1,8383,6048,2349,166
geometric average over frameworks2,9917,19011,32613,819

Table 4: Benchmarks were run on dual Xeon E5-2690v4 processors in a system with 256GB RAM. Times reported are in msec per batch.

Discussion

When geometric averaging is applied across framework runtimes, a range of speedup values is derived for each GPU, as shown in Figure 1.CPU times are also averaged geometrically across framework type.These results indicate that the greatest speedups are realized with the Tesla P100, with the Tesla M40 ranking second, and the Tesla K80 yielding the lowest speedup factors.Figure 2 shows the range of speedup values by network architecture, uncollapsed from the ranges shown in Figure 1.

The speedup ranges for runtimes not geometrically averaged across frameworks are shown in Figure 3.Here the set of all runtimes corresponding to each framework/network pair is considered when determining the range of speedups for each GPU type.Figure 4 shows the speedup ranges by framework, uncollapsed from the ranges shown in figure 3.The degree of overlap in Figure 3 suggests that geometric averaging across framework type yields a better measure of GPU performance, with more narrow and distinct ranges resulting for each GPU type, as shown in Figure 1.

The greatest speedups were observed when comparing Caffe forward+backpropagation runtime to CPU runtime, when solving the GoogLeNet network model. Caffe generally showed speedups larger than any other framework for this comparison, ranging from 35x to ~70x (see Figure 4 and Table 1). Despite the higher speedups, Caffe does not turn out to be the best performing framework on these benchmarks (see Figure 5).When comparing runtimes on the Tesla P100, Torch performs best and has the shortest runtimes (see Figure 5).Note that although the VGG net tends to be the slowest of all, it does train faster then GooLeNet when run on the Torch framework (see Figure 5).

The data show that Theano and TensorFlow display similar speedups on GPUs (see Figure 4).Despite the fact that Theano sometimes has larger speedups than Torch, Torch and TensorFlow outperform Theano.While Torch and TensorFlow yield similar performance, Torch performs slightly better with most network / GPU combinations.However, TensorFlow outperforms Torch in most cases for CPU-only training (see Table 4).

Theano is outperformed by all other frameworks, across all benchmark measurements and devices (see Tables 1 – 4). Figure 5 shows the large runtimes for Theano compared to other frameworks run on the Tesla P100.It should be noted that since VGG net was run with a batch size of only 64, compared to 128 with all other network architectures, the runtimes can sometimes be faster with VGG net, than with GoogLeNet.See, for example, the runtimes for Torch, on GoogLeNet, compared to VGG net, across all GPU devices (Tables 1 – 3).

Deep Learning Benchmark Conclusions

The single-GPU benchmark results show that speedups over CPU increase from Tesla K80, to Tesla M40, and finally to Tesla P100, which yields the greatest speedups (Table 5, Figure 1) and fastest runtimes (Table 6).

Range of Speedups, by GPU type

Tesla P100 16GB PCIeTesla M40 12GBTesla K80
19x ~ 70x10x ~ 53x4x ~ 16x

Table 5: Measured speedups for running various deep learning frameworks on GPUs (see Table 1)

Fastest Runtime for VGG net, by GPU type

Tesla P100 16GB PCIeTesla M40 12GBTesla K80
222408743

Table 6: Absolute best runtimes (msec / batch) across all frameworks for VGG net (ver. a). The Torch framework provides the best VGG runtimes, across all GPU types.

The results show that of the tested GPUs, Tesla P100 16GB PCIe yields the absolute best runtime, and also offers the best speedup over CPU-only runs. Regardless of which deep learning framework you prefer, these GPUs offer valuable performance boosts.

Benchmark Setup

Microway’s GPU Test Drive compute nodes were used in this study. Each is configured with 256GB of system memory and dual 14-core Intel Xeon E5-2690v4 processors (with a base frequency of 2.6GHz and a Turbo Boost frequency of 3.5GHz). Identical benchmark workloads were run on the Tesla P100 16GB PCIe, Tesla K80, and Tesla M40 GPUs. The batch size is 128 for all runtimes reported, except for VGG net (which uses a batch size of 64).All deep learning frameworks were linked to the NVIDIA cuDNN library (v5.1), instead of their own native deep network libraries.This is because linking to cuDNN yields better performance than using the native library of each framework.

When running benchmarks of Theano, slightly better runtimes resulted when CNMeM, a CUDA memory manager, is used to manage the GPU’s memory. By setting lib.cnmem=0.95, the GPU device will have CNMeM manage 95% of its memory:
THEANO_FLAGS='floatX=float32,device=gpu0,lib.cnmem=0.95,allow_gc=True' python ...

Notes on Tesla M40 versus Tesla K80

The data demonstrate that Tesla M40 outperforms Tesla K80. When geometrically averaging runtimes across frameworks, the speedup of the Tesla K80 ranges from 9x to 11x, while for the Tesla M40, speedups range from 20x to 27x.The same relationship exists when comparing ranges without geometric averaging.This result is expected, considering that the Tesla K80 card consists of two separate GK210 GPU chips (connected by a PCIe switch on the GPU card).Since the benchmarks here were run on single GPU chips, the benchmarks reflect only half the throughput possible on a Tesla K80 GPU. If running a perfectly parallel job, or two separate jobs, the Tesla K80 should be expected to approach the throughput of a Tesla M40.

Singularity Containers

Logo image of the Singularity projectSingularity is a new type of container designed specifically for HPC environments. Singularity enables the user to define an environment within the container, which might include customized deep learning frameworks, NVIDIA device drivers, and the CUDA 8.0 toolkit. The user can copy and transport this container as a single file, bringing their customized environment to a different machine where the host OS and base hardware may be completely different. The container will process the workflow within it to execute in the host’s OS environment, just as it does in its internal container environment. The workflow is pre-defined inside of the container, including and necessary library files, packages, configuration files, environment variables, and so on.

In order to facilitate benchmarking of four different deep learning frameworks, Singularity containers were created separately for Caffe, TensorFlow, Theano, and Torch. Given its simplicity and powerful capabilities, you should expect to hear more about Singularity soon.

References

DeepMarks
Deep Learning Benchmarks published on GitHub

Singularity
Containers for Full User Control of Environment

Alexnet
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.

Overfeat
Sermanet, Pierre, et al. “Overfeat: Integrated recognition, localization and detection using convolutional networks.” arXiv preprint arXiv:1312.6229 (2013).

GoogLeNet
Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

VGG Net
Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.”arXiv preprint arXiv:1409.1556 (2014).

The post Deep Learning Benchmarks of NVIDIA Tesla P100 PCIe, Tesla K80, and Tesla M40 GPUs appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/deep-learning-benchmarks-nvidia-tesla-p100-16gb-pcie-tesla-k80-tesla-m40-gpus/feed/ 3
Comparing NVLink vs PCI-E with NVIDIA Tesla P100 GPUs on OpenPOWER Servers https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/ https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/#comments Thu, 26 Jan 2017 14:41:45 +0000 https://www.microway.com/?p=8492 The new NVIDIA Tesla P100 GPUs are available with both PCI-Express and NVLink connectivity. How do these two types of connectivity compare? This post provides a rundown of NVLink vs PCI-E and explores the benefits of NVIDIA’s new NVLink technology. Considering the variety of options for Tesla P100 GPUs, you may wish to review our […]

The post Comparing NVLink vs PCI-E with NVIDIA Tesla P100 GPUs on OpenPOWER Servers appeared first on Microway.

]]>
The new NVIDIA Tesla P100 GPUs are available with both PCI-Express and NVLink connectivity. How do these two types of connectivity compare? This post provides a rundown of NVLink vs PCI-E and explores the benefits of NVIDIA’s new NVLink technology.

Photo of NVIDIA Tesla P100 NVLink GPUs in an OpenPOWER server

Considering the variety of options for Tesla P100 GPUs, you may wish to review our other recent posts:

Primary considerations when comparing NVLink vs PCI-E

On systems with x86 CPUs (such as Intel Xeon), the connectivity to the GPU is only through PCI-Express (although the GPUs connect to each other through NVLink). On systems with POWER8 CPUs, the connectivity to the GPU is through NVLink (in addition to the NVLink between GPUs).

Nevertheless, the performance characteristics of the GPU itself (GPU cores, GPU memory, etc) do not vary. The Tesla P100 GPU itself will be performing at the same level. It’s the data flow and total system throughput that will determine the final performance for your workload. To review:

  • Full NVLink connectivity is only available with IBM POWER8 CPUs (not x86 CPUs)
  • GPU-to-GPU NVLink connectivity (without CPU-to-GPU) is available with x86 CPUs
  • Internal performance of an NVIDIA Tesla P100 SXM2 GPU will not vary between x86 and POWER8

With that in mind, let’s compare their throughput.

Tesla P100 with NVLink on OpenPOWER

The NVLink connections on Tesla P100 GPUs provide a theoretical peak throughput of 80GB/s (160GB/s bi-directional). However, those links are made up of several bricks, which can be split up to connect to a number of other devices. For example, one GPU might dedicate 40GB/s for a link to a CPU and 40GB/s for a link to a nearby GPU.

Device <-> Device NVLink Performance

Below is the output from NVIDIA’s GPU peer-to-peer (P2P) utility, which is included with CUDA 8.0. The results summarize the throughput (in gigabytes per second) and latency (in microseconds) when sending messages between pairs of Tesla P100 GPUs in our OperPOWER system.

It’s important to understand that the test below was run on a system with four Tesla GPUs. On each GPU, the available 80GB/s bandwidth was divided in half. One link goes to a POWER8 CPU and one link goes to the adjacent P100 GPU (see diagram below).

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla P100-SXM2-16GB, pciBusID: 1, pciDeviceID: 0, pciDomainID:2
Device: 1, Tesla P100-SXM2-16GB, pciBusID: 1, pciDeviceID: 0, pciDomainID:3
Device: 2, Tesla P100-SXM2-16GB, pciBusID: 1, pciDeviceID: 0, pciDomainID:a
Device: 3, Tesla P100-SXM2-16GB, pciBusID: 1, pciDeviceID: 0, pciDomainID:b

...

Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 457.93  35.30  20.37  20.40
     1  35.30 454.78  20.16  20.14
     2  20.19  20.16 454.56  35.29
     3  18.36  18.42  35.29 454.07

...

P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   4.99   7.92  15.56  15.43
     1   8.06   5.00  15.40  15.40
     2  15.47  15.52   5.04   8.07
     3  15.43  15.49   8.04   4.97

As the results show, each 40GB/s Tesla P100 NVLink will provide ~35GB/s in practice. Communications between GPUs on a remote CPU offer throughput of ~20GB/s. Latency between GPUs is 8~16 microseconds. The results were gathered on our 2U OpenPOWER GPU server with Tesla P100 NVLink GPUs, which is available to benchmark in our Test Drive cluster. The architectural design of this particular platform is:

Block diagram drawing of the Microway OpenPOWER GPU Server with NVLink GPUs
Block diagram of the 2U Microway OpenPOWER GPU server with Tesla P100 NVLink GPUs

Device <-> Device PCI-E Performance

A similar test, run on GPUs connected by standard PCI-Express, will result in the following performance:

Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 452.19  10.19  10.73  10.74
     1  10.19 450.04  10.76  10.75
     2  10.91  10.90 450.94  10.21
     3  10.90  10.91  10.18 450.95

...

P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   3.22   7.86  16.90  17.05
     1   7.85   3.21  17.08  17.22
     2  16.32  16.37   3.07   7.85
     3  16.26  16.35   7.84   3.07

The latencies between GPUs are about the same (although there is a larger latency when traveling to GPUs on remote CPUs. However, transfer bandwidth is significantly higher for NVlink vs PCI-E (two to three times higher). This increased throughput gives NVLink an advantage for fine-grained applications and others which send data between GPUs.

NVLink vs PCI-E: Host <-> Device Performance

CPU-to-GPU data transfers occur whenever data must be transferred into or out of the GPU. These are typically called host-to-device and device-to-host transfers. Traditional systems with x86 CPUs are only able to communicate with the GPUs over PCI-Express, which provides lower throughput. Our OpenPOWER systems provide full NVLink connectivity to the GPUs. Here’s the achieved performance:

Host <-> Device across NVLink

[root@openpower8 ~]# ./bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P100-SXM2-16GB
 Quick Mode

Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			33236.9

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			32322.6

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			448515.9

Result = PASS

Host <-> Device across PCI-E

A similar test, run on an x86 system with GPUs connected by PCI-Express, will result in the following performance:

...

 Device 0: Tesla P100-PCIE-16GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11658.4

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12882.0

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			446125.2

Just as with the GPU-to-GPU transfers, we see that NVLink enables much faster performance. Remember that this type of transfer occurs whenever you load a dataset into main memory and then process some of that data on the GPUs. These transfer times are often a factor in overall application performance, so a 3X speedup is welcome. This increased performance may also enable applications which were previously too data-movement-intensive.

Finally, consider that NVIDIA CUDA 8.0 (together with the Tesla P100 GPUs) allows for fully Unified Memory. You will be able to load datasets larger than GPU memory and let the system automatically manage data movement. In other words, the size of GPU memory no longer limits the size of your jobs. On such runs, having a wider pipe between CPU and GPU memory is of immense importance.

NVIDIA deviceQuery on OpenPOWER server with Tesla P100 GPUs and NVLink

Each new GPU generation brings tweaks to the design. The output below, from the CUDA 8.0 SDK samples, shows additional details of the architecture and capabilities of the “Pascal” Tesla P100 GPU accelerators with NVLink. Take note of the new Compute Capability 6.0, which is what you’ll want to target if you’re compiling your own CUDA code. Also note that in this platform there are three DMA copy engines per GPU.

deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla P100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16281 MBytes (17071669248 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1481 MHz (1.48 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   2 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

  [...]

> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU2) : No
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU3) : No
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU2) : No
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU3) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU0) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU1) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU0) : No
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU1) : No
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = Tesla P100-SXM2-16GB, Device1 = Tesla P100-SXM2-16GB, Device2 = Tesla P100-SXM2-16GB, Device3 = Tesla P100-SXM2-16GB
Result = PASS

How to move forward – GPU systems with Host-to-Device NVLink

Due to the new high-speed NVLink connection, there is only one server on the market with both Host-to-Device and Device-to-Device NVLink connectivity. This system, leveraging IBM’s POWER8 CPUs and innovation from the OpenPOWER foundation (including NVIDIA and Mellanox), began shipments in fall 2016. Please contact us to learn more, or read about this OpenPOWER server. Academic discounts are available.

To learn more about the available NVIDIA Tesla “Pascal” GPUs and to compare with other versions of the Tesla product line, please review our “Pascal” Tesla GPU knowledge center article:

If you’re thinking about using GPUs for the first time, please consider getting in touch with us. We’ve been implementing GPU-accelerated systems for nearly a decade and have the expertise to help make your project a success!

The post Comparing NVLink vs PCI-E with NVIDIA Tesla P100 GPUs on OpenPOWER Servers appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/feed/ 1
NVIDIA Tesla P100 NVLink 16GB GPU Accelerator (Pascal GP100 SXM2) Up Close https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-nvlink-16gb-gpu-accelerator-pascal-gp100-sxm2-close/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-nvlink-16gb-gpu-accelerator-pascal-gp100-sxm2-close/#comments Wed, 18 Jan 2017 13:10:46 +0000 https://www.microway.com/?p=8398 The NVIDIA Tesla P100 NVLink GPUs are a big advancement. For the first time, the GPU is stepping outside the traditional “add in card” design. No longer tied to the fixed specifications of PCI-Express cards, NVIDIA’s engineers have designed a new form factor that best suits the needs of the GPU. With their SXM2 design, […]

The post NVIDIA Tesla P100 NVLink 16GB GPU Accelerator (Pascal GP100 SXM2) Up Close appeared first on Microway.

]]>
The NVIDIA Tesla P100 NVLink GPUs are a big advancement. For the first time, the GPU is stepping outside the traditional “add in card” design. No longer tied to the fixed specifications of PCI-Express cards, NVIDIA’s engineers have designed a new form factor that best suits the needs of the GPU. With their SXM2 design, NVIDIA can run GPUs to their full potential.

One of the biggest changes this allows is the NVLink interconnect, which allows GPUs to operate beyond the restrictions of the PCI-Express bus. Instead, the GPUs communicate with one another over this high-speed link. Additionally, these new “Pascal” architecture GPUs bring improvements including higher performance, faster connectivity, and more flexibility for users & programmers.

Close-Up Photo of the NVIDIA Tesla P100 NVLink GPU

There is variety in the new line-up of GPU products. For the Tesla P100 GPU model, there are three separate paths to be considered:

Highlights of the new Tesla P100 NVLink GPUs include:

  • Up to 5.3 TFLOPS double- and 10.6 TFLOPS single-precision floating-point performance
  • 16GB of on-die HBM2 CoWoS GPU memory, with bandwidths up to 732GB/s
  • 80GB/s NVLink between GPUs boosts bandwidth between the Tesla P100 GPUs
  • High-speed, on-die GPU memory provides a 3X improvement over older GPUs
  • Pascal Unified Memory allows applications to directly access the memory of all GPUs and all of system memory

Improved Data Transfer Speeds

The NVLink connection on Tesla P100 GPUs has a theoretical peak throughput of 80GB/s (160GB/s bi-directional). However, that connectivity is only between GPUs. The GPUs still communicate via PCI-Express when transferring data to and from the host (via PCI-E x16 generation 3.0). The high-speed NVLink connection is only for data transfers directly between the GPUs.

Device <-> Device Tesla P100 NVLink Performance

Below is a section of output from NVIDIA’s GPU peer-to-peer (P2P) utility, which is included with CUDA 8.0. The results summarize the throughput (in gigabytes per second) and latency (in microseconds) when sending messages between any pair of GPUs.

It’s important to understand that the test below was run on a system with four Tesla GPUs. On each GPU, the available 80GB/s bandwidth was divided up so that connections could be made to the three other GPUs. The links are divided such that each GPU has two 20GB/s links and one 40GB/s link (see diagram below).

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla P100-SXM2-16GB, pciBusID: 6, pciDeviceID: 0, pciDomainID:0
Device: 1, Tesla P100-SXM2-16GB, pciBusID: 7, pciDeviceID: 0, pciDomainID:0
Device: 2, Tesla P100-SXM2-16GB, pciBusID: 84, pciDeviceID: 0, pciDomainID:0
Device: 3, Tesla P100-SXM2-16GB, pciBusID: 85, pciDeviceID: 0, pciDomainID:0

...

Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 449.69  18.45  18.45  36.72
     1  18.44 450.92  36.70  18.44
     2  18.45  36.70 450.37  18.44
     3  36.71  18.44  18.44 447.34

...

P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   3.66   9.25   9.31   9.67
     1   9.49   3.65  10.04   9.05
     2   9.85  10.13   3.13   9.79
     3  10.06  11.41   9.97   3.54

As the results show, a 20GB/s Tesla P100 NVLink will provide ~18GB/s in practice. A 40GB/s Tesla P100 NVLink will provide ~36GB/s. Latency between GPUs is 9~10 microseconds. The results were gathered on our 1U NumberSmasher Server with four Tesla P100 NVLink GPUs, which is also available in our Test Drive cluster. The architectural design of this particular platform is:

NumberSmasher 1U NVLink with Tesla P100-SYS-1028GQ-TXR

Host <-> Device Performance

Transfers between system memory and the GPU are still via PCI-Express and will perform similarly to previous-generation “Kepler” and “Maxwell” GPUs. With Tesla P100, you will be able to achieve transfers up to ~12.8GB/s between the host and the GPU:

[root@node2 ~]# ./bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P100-SXM2-16GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11463.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12868.0

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			446271.0

Result = PASS

Technical Details

Below are the technical details reported by nvidia-smi. Note that “Pascal” Tesla P100 GPUs now include fully integrated memory ECC support that is always enabled (memory performance in previous generations could be improved by disabling ECC).

[root@node2 ~]# nvidia-smi -a -i 0

==============NVSMI LOG==============

Timestamp                           : Tue Dec  6 16:30:58 2016
Driver Version                      : 367.48

Attached GPUs                       : 1
GPU 0000:06:00.0
    Product Name                    : Tesla P100-SXM2-16GB
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Enabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 032311609xxxx
    GPU UUID                        : GPU-70ba5857-9613-1213-c5f5-3b201233xxxx
    Minor Number                    : 0
    VBIOS Version                   : 86.00.26.00.02
    MultiGPU Board                  : No
    Board ID                        : 0x600
    GPU Part Number                 : 900-2H403-0000-000
    Inforom Version
        Image Version               : H403.0201.00.04
        OEM Object                  : 1.1
        ECC Object                  : 4.1
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x06
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x15F910DE
        Bus Id                      : 0000:06:00.0
        Sub System Id               : 0x116B10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 16276 MiB
        Used                        : 0 MiB
        Free                        : 16276 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 2 MiB
        Free                        : 16382 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 40 C
        GPU Shutdown Temp           : 85 C
        GPU Slowdown Temp           : 82 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 34.89 W
        Power Limit                 : 300.00 W
        Default Power Limit         : 300.00 W
        Enforced Power Limit        : 300.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 300.00 W
    Clocks
        Graphics                    : 405 MHz
        SM                          : 405 MHz
        Memory                      : 715 MHz
        Video                       : 835 MHz
    Applications Clocks
        Graphics                    : 1480 MHz
        Memory                      : 715 MHz
    Default Applications Clocks
        Graphics                    : 1328 MHz
        Memory                      : 715 MHz
    Max Clocks
        Graphics                    : 1480 MHz
        SM                          : 1480 MHz
        Memory                      : 715 MHz
        Video                       : 1480 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

The latest NVIDIA GPU architectures support large numbers of clock speeds, as well as automated boosting of the clock speed (when power and thermals allow). Administrators can also set specific power consumption limits and monitor the clock speeds (including explanations for any reasons the clocks are running at a lower speed).

[root@node2 ~]# nvidia-smi -q -d SUPPORTED_CLOCKS -i 0

==============NVSMI LOG==============

Timestamp                           : Tue Dec  6 16:39:20 2016
Driver Version                      : 367.48

Attached GPUs                       : 4
GPU 0000:06:00.0
    Supported Clocks
        Memory                      : 715 MHz
            Graphics                : 1480 MHz
            Graphics                : 1468 MHz
            Graphics                : 1455 MHz
            Graphics                : 1442 MHz
            Graphics                : 1430 MHz
            Graphics                : 1417 MHz
            Graphics                : 1404 MHz
            Graphics                : 1392 MHz
            Graphics                : 1379 MHz
            Graphics                : 1366 MHz
            Graphics                : 1354 MHz
            Graphics                : 1341 MHz
            Graphics                : 1328 MHz
            Graphics                : 1316 MHz
            Graphics                : 1303 MHz
            Graphics                : 1290 MHz
            Graphics                : 1278 MHz
            Graphics                : 1265 MHz
            Graphics                : 1252 MHz
            Graphics                : 1240 MHz
            Graphics                : 1227 MHz
            Graphics                : 1215 MHz
            Graphics                : 1202 MHz
            Graphics                : 1189 MHz
            Graphics                : 1177 MHz
            Graphics                : 1164 MHz
            Graphics                : 1151 MHz
            Graphics                : 1139 MHz
            Graphics                : 1126 MHz
            Graphics                : 1113 MHz
            Graphics                : 1101 MHz
            Graphics                : 1088 MHz
            Graphics                : 1075 MHz
            Graphics                : 1063 MHz
            Graphics                : 1050 MHz
            Graphics                : 1037 MHz
            Graphics                : 1025 MHz
            Graphics                : 1012 MHz

NVIDIA deviceQuery on Tesla P100 NVLink 16GB GPU

Each new GPU generation brings tweaks to the design. The output below, from the CUDA 8.0 SDK samples, shows additional details of the architecture and capabilities of the “Pascal” Tesla P100 NVLink GPU accelerators. Take note of the new Compute Capability 6.0, which is what you’ll want to target if you’re compiling your own CUDA code.

deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla P100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16276 MBytes (17066885120 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            405 MHz (0.41 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 6 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

  [...]

> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = Tesla P100-SXM2-16GB, 
Device1 = Tesla P100-SXM2-16GB, Device2 = Tesla P100-SXM2-16GB, Device3 = Tesla P100-SXM2-16GB
Result = PASS

Additional Information on Tesla P100 NVLink GPUs

To learn more about the available P100 GPUs and to compare with other versions of the Tesla product line, please review our “Pascal” Tesla GPU knowledge center article:

If you’re thinking about using GPUs for the first time, please consider getting in touch with us. We’ve been implementing GPU-accelerated systems for nearly a decade and have the expertise to help make your project a success!

Due to their novel design, Tesla P100 NVLink GPUs cannot be installed into existing GPU systems. Platforms with the NVLink-connected SXM2 sockets are required. For several options, have a look at our list of P100 GPU-accelerated systems. You may also wish to review our post on PCI-Express connected Tesla P100 GPUs.

Photo of the back side of the NVIDIA Tesla P100 NVLink GPU

The post NVIDIA Tesla P100 NVLink 16GB GPU Accelerator (Pascal GP100 SXM2) Up Close appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-nvlink-16gb-gpu-accelerator-pascal-gp100-sxm2-close/feed/ 2
NVIDIA Tesla P100 PCI-E 16GB GPU Accelerator (Pascal GP100) Up Close https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-pci-e-16gb-gpu-accelerator-pascal-gp100-close/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-pci-e-16gb-gpu-accelerator-pascal-gp100-close/#respond Wed, 28 Dec 2016 22:12:07 +0000 https://www.microway.com/?p=8376 NVIDIA’s new Tesla P100 PCI-E GPU is a big step up for HPC users, and for GPU users in general. Although other workloads have been leveraging the newer “Maxwell” architecture, HPC applications have been using “Kepler” GPUs for a couple years. The new GPUs bring many improvements, including higher performance, faster connectivity, and more flexibility […]

The post NVIDIA Tesla P100 PCI-E 16GB GPU Accelerator (Pascal GP100) Up Close appeared first on Microway.

]]>
NVIDIA’s new Tesla P100 PCI-E GPU is a big step up for HPC users, and for GPU users in general. Although other workloads have been leveraging the newer “Maxwell” architecture, HPC applications have been using “Kepler” GPUs for a couple years. The new GPUs bring many improvements, including higher performance, faster connectivity, and more flexibility for users & programmers.

Close-up photo of the NVIDIA Tesla P100 PCI-E GPU

Because GPUs have proven themselves so well, there are now GPUs optimized for particular applications. For example, a video transcoding project would be unlikely to use the same GPU as a computational chemistry project. However, the Tesla P100 serves as the best all-round choice for those who need to support a variety of applications. With that in mind, there are three separate paths to be considered:

Highlights of the new Tesla P100 PCI-E GPUs include:

  • Up to 4.7 TFLOPS double- and 9.3 TFLOPS single-precision floating-point performance
  • 16GB of on-die HBM2 CoWoS GPU memory, with bandwidths up to 732GB/s
  • High-speed, on-die GPU memory provides a 3X improvement over older GPUs
  • Pascal Unified Memory allows applications to directly access the memory of all GPUs and all of system memory

Improved Data Transfer Speeds

Although the Tesla P100 GPU uses the same generation of PCI-E connectivity, some optimizations have been made since the Kepler generation. With P100, you will be able to achieve transfers up to ~12.8GB/s between the host and the GPU:

[root@node6 ~]# ./bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P100-PCIE-16GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11688.2

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12886.0

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			444927.2*

Result = PASS

* Note that there is also a 12GB version of the Tesla P100 PCI-E GPU – the memory operates 25% slower

Technical Details

Below are the technical details reported by nvidia-smi. Note that “Pascal” Tesla GPUs now include fully integrated memory ECC support that is always enabled (memory performance in previous generations could be improved by disabling ECC).

[root@node6 ~]# nvidia-smi -a -i 0

==============NVSMI LOG==============

Timestamp                           : Wed Sep 28 11:03:51 2016
Driver Version                      : 367.44

Attached GPUs                       : 1
GPU 0000:02:00.0
    Product Name                    : Tesla P100-PCIE-16GB
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Enabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 032301607xxxx
    GPU UUID                        : GPU-de136156-7f6d-ced1-869c-4dc56e09xxxx
    Minor Number                    : 1
    VBIOS Version                   : 86.00.26.00.01
    MultiGPU Board                  : No
    Board ID                        : 0x200
    GPU Part Number                 : 900-2H400-0000-000
    Inforom Version
        Image Version               : H400.0201.00.06
        OEM Object                  : 1.1
        ECC Object                  : 4.1
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x15F810DE
        Bus Id                      : 0000:02:00.0
        Sub System Id               : 0x118F10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 16276 MiB
        Used                        : 0 MiB
        Free                        : 16276 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 2 MiB
        Free                        : 16382 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 36 C
        GPU Shutdown Temp           : 85 C
        GPU Slowdown Temp           : 82 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 26.39 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 250.00 W
    Clocks
        Graphics                    : 405 MHz
        SM                          : 405 MHz
        Memory                      : 715 MHz
        Video                       : 835 MHz
    Applications Clocks
        Graphics                    : 1328 MHz
        Memory                      : 715 MHz
    Default Applications Clocks
        Graphics                    : 1189 MHz
        Memory                      : 715 MHz
    Max Clocks
        Graphics                    : 1328 MHz
        SM                          : 1328 MHz
        Memory                      : 715 MHz
        Video                       : 1328 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

The latest NVIDIA GPU architectures support large numbers of clock speeds, as well as automated boosting of the clock speed (when power and thermals allow). Administrators can also set specific power consumption limits and monitor the clock speeds (including explanations for any reasons the clocks are running at a lower speed).

[root@node6 ~]# nvidia-smi -q -d SUPPORTED_CLOCKS -i 0

==============NVSMI LOG==============

Timestamp                           : Wed Sep 28 11:05:36 2016
Driver Version                      : 367.44

Attached GPUs                       : 2
GPU 0000:02:00.0
    Supported Clocks
        Memory                      : 715 MHz
            Graphics                : 1328 MHz
            Graphics                : 1316 MHz
            Graphics                : 1303 MHz
            Graphics                : 1290 MHz
            Graphics                : 1278 MHz
            Graphics                : 1265 MHz
            Graphics                : 1252 MHz
            Graphics                : 1240 MHz
            Graphics                : 1227 MHz
            Graphics                : 1215 MHz
            Graphics                : 1202 MHz
            Graphics                : 1189 MHz
            Graphics                : 1177 MHz
            Graphics                : 1164 MHz
            Graphics                : 1151 MHz
            Graphics                : 1139 MHz
            Graphics                : 1126 MHz
            Graphics                : 1113 MHz
            Graphics                : 1101 MHz
            Graphics                : 1088 MHz
            Graphics                : 1075 MHz
            Graphics                : 1063 MHz
            Graphics                : 1050 MHz
            Graphics                : 1037 MHz
            Graphics                : 1025 MHz
            Graphics                : 1012 MHz

NVIDIA deviceQuery on Tesla P100 PCI-E 16GB GPU

Each new GPU generation brings tweaks to the design. The output below, from the CUDA 8.0 SDK samples, shows additional details of the architecture and capabilities of the “Pascal” Tesla P100 PCI-E GPU accelerators. Take note of the new Compute Capability 6.0, which is what you’ll want to target if you’re compiling your own CUDA code.

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla P100-PCIE-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16276 MBytes (17066885120 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            405 MHz (0.41 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla P100-PCIE-16GB
Result = PASS

Additional Information on Tesla P100 PCI-E GPUs

To learn more about the available P100 GPUs and to compare with other versions of the Tesla product line, please review our “Pascal” Tesla GPU knowledge center article:

If you’re thinking about using GPUs for the first time, please consider getting in touch with us. We’ve been implementing GPU-accelerated systems for nearly a decade and have the expertise to help make your project a success!

If you’re thinking about upgrading to P100, have a look at our list of P100 GPU-accelerated systems. You may also wish to review our posts on NVLink-connected Tesla P100 GPUs. If you’re hoping to install Tesla P100 PCI-E GPUs in your existing systems, take note that you’ll need a compatible server platform – one of our experts can help you review.

Photo of the rear side of the NVIDIA Tesla P100 PCI-E GPU

The post NVIDIA Tesla P100 PCI-E 16GB GPU Accelerator (Pascal GP100) Up Close appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-pci-e-16gb-gpu-accelerator-pascal-gp100-close/feed/ 0