tesla Archives - Microway https://www.microway.com/tag/tesla/ We Speak HPC & AI Thu, 30 May 2024 20:12:07 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/ https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/#respond Fri, 15 Mar 2019 17:06:57 +0000 https://www.microway.com/?p=11118 Performance benchmarks are an insightful way to compare new products on the market. With so many GPUs available, it can be difficult to assess which are suitable to your needs. Various benchmarks provide information to compare performance on individual algorithms or operations. Since there are so many different algorithms to choose from, there is no […]

The post NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks appeared first on Microway.

]]>
Performance benchmarks are an insightful way to compare new products on the market. With so many GPUs available, it can be difficult to assess which are suitable to your needs. Various benchmarks provide information to compare performance on individual algorithms or operations. Since there are so many different algorithms to choose from, there is no shortage of benchmarking suites available.

For this comparison, the SHOC benchmark suite (https://github.com/vetter/shoc/) is used to compare the performance of the NVIDIA Tesla T4 with other GPUs commonly used for scientific computing: the NVIDIA Tesla P100 and Tesla V100.

The Scalable Heterogeneous Computing Benchmark Suite (SHOC) is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing, and the software used to program them. Its initial focus is on systems containing Graphics Processing Units (GPUs) and multi-core processors, and on the OpenCL programming standard. It can be used on clusters as well as individual hosts.

The SHOC benchmark suite includes options for many benchmarks relevant to a variety of scientific computations. Most of the benchmarks are provided in both single- and double-precision and with and without PCIE transfer consideration. This means that for each test there are up to four results for each benchmark. These benchmarks are organized into three levels and can be run individually or all together.

The Tesla P100 and V100 GPUs are well-established accelerators for HPC and AI workloads. They typically offer the highest performance, consume the most power (250~300W), and have the highest price tag (~$10k). The Tesla T4 is a new product based on the latest “Turing” architecture, delivering increased efficiency along with new features. However, it is not a replacement for the bigger/more power-hungry GPUs. Instead, it offers good performance while consuming far less power (70W) at a lower price (~$2.5k). You’ll want to use the right tool for the job, which will depend upon your workload(s). A summary of each Tesla GPU is shown below.

In our testing, both single- and double-precision SHOC benchmarks were run, which allows us to make a direct comparison of the capabilities of each GPU. A few HPC-relevant benchmarks were selected to compare the T4 to the P100 and V100. Tesla P100 is based on the “Pascal” architecture, which provides standard CUDA cores. Tesla V100 features the “Volta” architecture, which introduced deep-learning specific TensorCores to complement CUDA cores. Tesla T4 has NVIDIA’s “Turing” architecture, which includes TensorCores and CUDA cores (weighted towards single-precision). This product was designed primarily with machine learning in mind, which results in higher single-precision performance and relatively low double-precision performance. Below, some of the commonly-used HPC benchmarks are compared side-by-side for the three GPUs.

Double Precision Results

GPUTesla T4Tesla V100Tesla P100
Max Flops (GFLOPS)253.387072.864736.76
Fast Fourier Transform (GFLOPS)132.601148.75756.29
Matrix Multiplication (GFLOPS)249.575920.014256.08
Molecular Dynamics  (GFLOPS)105.26908.62402.96
S3D (GFLOPS)59.97227.85161.54

 

Single Precision Results

GPUTesla T4Tesla V100Tesla P100
Max Flops (GFLOPS)8073.2614016.509322.46
Fast Fourier Transform (GFLOPS)660.052301.321510.49
Matrix Multiplication (GFLOPS)3290.9413480.408793.33
Molecular Dynamics (GFLOPS)572.91997.61480.02
S3D (GFLOPS)99.42434.78295.20

 

What Do These Results Mean?

The single-precision results show Tesla T4 performing well for its size, though it falls short in double precision compared to the NVIDIA Tesla V100 and Tesla P100 GPUs. Applications that require double-precision accuracy are not suited to the Tesla T4. However, the single precision performance is impressive and bodes well for the performance of applications that are optimized for lower or mixed precision.

Plot comparing the performance of Tesla T4 with the Tesla P100 and Tesla V100 GPUs

To explain the single-precision benchmarks shown above:

  • The Max Flops for the T4 are good compared to V100 and competitive with P100. Tesla T4 provides more than half as many FLOPS as V100 and more than 80% of P100.
  • The T4 shows impressive performance in the Molecular Dynamics benchmark (an n-body pairwise computation using the Lennard-Jones potential). It again offers more than half the performance of Tesla V100, while beating the Tesla P100.
  • In the Fast Fourier Transform (FFT) and Matrix Multiplication benchmarks, the performance of Tesla T4 is on par for both price/performance and power/performance (one fourth the performance of V100 for one fourth the price and one fourth the wattage). This reflects how the T4 will perform in a large number of HPC applications.
  • For S3D, the T4 falls behind by a few additional percent.

Looking at these results, it’s important to remember the context. Tesla T4 consumes only ~25% the wattage of the larger Tesla GPUs and costs only ~25% as much. It is also a physically smaller GPU that can be installed in a wider variety of servers and compute nodes. In that context, the Tesla T4 holds its own as a powerful option for a reasonable price when compared to the larger NVIDIA Tesla GPUs.

What to Expect from the NVIDIA Tesla T4

Cost-Effective Machine Learning

The T4 has substantial single/mixed precision machine learning focused performance, with a price tag significantly lower than larger Tesla GPUs. What the T4 lacks in double precision, it makes up for with impressive single-precision results. The single-precision performance available will strongly cater to the machine learning algorithms with potential to be applied to mixed precision. Future work will examine this aspect more closely, but Tesla T4 is expected to be of high interest for deep learning inference and to have specific use-cases for deep learning training.

Impressive Single-Precision HPC Performance

In the molecular dynamics benchmark, the T4 outperforms the Tesla P100 GPU. This is extremely impressive, and for those interested in single- or mixed-precision calculations involving similar algorithms, the T4 could provide an excellent solution. With some adapting algorithms, the T4 may be a strong contender for scientific applications that also want to utilize machine learning capabilities to analyze results or run a variety of different types of algorithms from both machine learning and scientific computing on an easily accessible GPU.

In addition to the outright lower price tag, the T4 also operates at 70 Watts, in comparison to the 250+ Watts required for the Tesla P100 / V100 GPUs. Running on one quarter of the power means that it is both cheaper to purchase and cheaper to operate.

Next Steps for leveraging Tesla T4

If it appears the new Tesla T4 will accelerate your workload, but you’d like to benchmark, please sign up to Test Drive for yourself. We also invite you to contact one of our experts to discuss your needs further. Our goal is to understand your requirements, provide guidance on best options, and see the project through to successful system/cluster deployment.

Full SHOC Benchmark Results

The post NVIDIA “Turing” Tesla T4 HPC Performance Benchmarks appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-turing-tesla-t4-hpc-performance-benchmarks/feed/ 0
NVIDIA Tesla V100 Price Analysis https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/#respond Wed, 09 May 2018 00:52:23 +0000 https://www.microway.com/?p=10150 Now that NVIDIA has launched their new Tesla V100 32GB GPUs, the next questions from many customers are “What is the Tesla V100 Price?” “How does it compare to Tesla P100?” “How about Tesla V100 16GB?” and “Which GPU should I buy?” Tesla V100 32GB GPUs are shipping in volume, and our full line of […]

The post NVIDIA Tesla V100 Price Analysis appeared first on Microway.

]]>
Now that NVIDIA has launched their new Tesla V100 32GB GPUs, the next questions from many customers are “What is the Tesla V100 Price?” “How does it compare to Tesla P100?” “How about Tesla V100 16GB?” and “Which GPU should I buy?”

Tesla V100 32GB GPUs are shipping in volume, and our full line of Tesla V100 GPU-accelerated systems are ready for the new GPUs. If you’re planning a new project, we’d be happy to help steer you towards the right choices.

Tesla V100 Price

The table below gives a quick breakdown of the Tesla V100 GPU price, performance and cost-effectiveness:

Tesla GPU modelPriceDouble-Precision Performance (FP64)Dollars per TFLOPSDeep Learning Performance (TensorFLOPS or 1/2 Precision)Dollars per DL TFLOPS
Tesla V100 PCI-E 16GB or 32GB$10,664* $11,458* for 32GB7 TFLOPS$1,523 $1,637 for 32GB112 TFLOPS$95.21 $102.30 for 32GB
Tesla P100 PCI-E 16GB$7,374*4.7 TFLOPS$1,56918.7 TFLOPS$394.33
Tesla V100 SXM 16GB or 32GB$10,664* $11,458* for 32GB7.8 TFLOPS$1,367 $1,469 for 32GB125 TFLOPS$85.31 $91.66 for 32GB
Tesla P100 SXM2 16GB$9,428*5.3 TFLOPS$1,77921.2 TFLOPS$444.72

* single-unit list price before any applicable discounts (ex: EDU, volume)

Key Points

  • Tesla V100 delivers a big advance in absolute performance, in just 12 months
  • Tesla V100 PCI-E maintains similar price/performance value to Tesla P100 for Double Precision Floating Point, but it has a higher entry price
  • Tesla V100 delivers dramatic absolute performance & dramatic price/performance gains for AI
  • Tesla P100 remains a reasonable price/performance GPU choice, in select situations
  • Tesla P100 will still dramatically outperform a CPU-only configuration

Tesla V100 Double Precision HPC: Pay More for the GPU, Get More Performance

VMD visualization of a nucleosome

You’ll notice that Tesla V100 delivers an almost 50% increase in double precision performance. This is crucial for many HPC codes. A variety of applications have been shown to mirror this performance boost. In addition, Tesla V100 now offers the option of 2X the memory of Tesla P100 16GB for memory bound workloads.

Tesla V100 can is a compelling choice for HPC workloads: it will almost always deliver the greatest absolute performance. However, in the right situation a Tesla P100 can still deliver reasonable price/performance as well.

Both Tesla P100 and V100 GPUs should be considered for GPU accelerated HPC clusters and servers. A Microway expert can help you evaluate what’s best for your needs and applications and/or provide you remote benchmarking resources.

Tesla V100 for Deep Learning: Enormous Advancement & Value- The New Standard


If your goal is maximum Deep Learning performance, Tesla V100 is an enormous on-paper leap in performance. The dedicated TensorCores have huge performance potential for deep learning applications. NVIDIA has even termed a new “TensorFLOP” to measure this gain. Tesla V100 delivers a 6X on-paper advancement.

If your budget allows you to purchase at least 1 Tesla V100, it’s the right GPU to invest in for deep learning performance. For the first time, the beefy Tesla V100 GPU is compelling for not just AI Training, but AI Inference as well (unlike Tesla P100).

Moreover, only a selection of Deep Learning frameworks are fully taking advantage of the TensorCore today. As more and more DL Frameworks are optimized to use these new TensorCores and their instructions, the gains will grow. Even before many major optimizations, many workloads have advanced 3X-4X.

Finally, there is no more SXM cost premium for Tesla V100 GPUs (and only a modest premium for SXM-enabled host-servers). Nearly all DL applications benefit greatly from the NVLink interface from GPU:GPU; a selection of HPC applications (ex: AMBER) do today.

If you’re running DL frameworks, select Tesla V100 and if possible the SXM-enabled GPUs and servers.

FLOPS vs Real Application Performance

Unless you firmly know your workload correlates, we strongly discourage anyone from making purchasing decisions strictly based upon raw $/FLOP calculations.

While the generalizations above are useful, application performance differs dramatically from any simplistic FLOPS calculation. Device/device bandwidth, host-device bandwidth, GPU memory bandwidth, code maturity, are all equal levers to FLOPS on realized application performance.

Here’s some of NVIDIA’s own application performance testing across some real applications


You’ll see that some codes scale similarly to the on-paper FLOPS gains, and others are frankly far more removed.

At the most, use such simplistic FLOPS and price/performance calculations to guide the following higher level decision-making: to help predict new hardware relative to prior testing of FLOPS vs. actual performance, to steer what GPUs to consider, to decide what to purchase for POCs, or as the way to identify appropriate GPUs to remotely test to validate actual application performance.

No one should buy based upon price/performance per FLOP; most should buy based upon price/performance per workload (or basket of workloads).

When Paper Performance + Intuition Collide with Reality

While the above guidelines are helpful, there are still a wide diversity of workloads out there in the field. Apart from testing that steers you to one GPU or another, here’s some good reasons we’ve seen or advised customers to use to make other selections:

Tesla V100 SXM 2.0 GPU
  • Your application has shown diminishing returns to advances in GPU performance in the past (Tesla P100 might be a price/performance choice)
  • Your budget doesn’t allow for even a single Tesla V100 (pick Tesla P100, still great speedups)
  • Your budget allows for a server with 2 Tesla P100s, but not 2 Tesla V100s (Pick 2 Tesla P100s vs 1 Tesla V100)
  • Your application is GPU memory capacity-bound (pick Tesla V100 32GB)
  • There are workload sharing considerations (ex: preferred scheduler only allocates whole GPUs)
  • Your application isn’t multi-GPU enabled (pick Tesla V100, the most powerful single GPU)
  • Your application is GPU memory bandwidth limited (test it, but potential case for Tesla P100)

Further Resources

You may wish to reference our Comparison of Tesla “Volta” GPUs, which summarizes the technical improvements made in these new GPUs or Tesla V100 GPU Review for more extended discussion.

If you’re looking to see how these GPUs will be deployed in production, read our NVIDIA GPU Clusters page. As always, please feel free to reach out to us if you’d like to get a better understanding of these latest HPC systems and what they can do for you.

The post NVIDIA Tesla V100 Price Analysis appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/feed/ 0
In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-tesla-volta-gpu-accelerators/ https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-tesla-volta-gpu-accelerators/#respond Mon, 12 Mar 2018 22:25:26 +0000 https://www.microway.com/?post_type=incsub_wiki&p=10062 This article provides in-depth details of the NVIDIA Tesla V-series GPU accelerators (codenamed “Volta”). “Volta” GPUs improve upon the previous-generation “Pascal” architecture. Volta GPUs began shipping in September 2017 and were updated to 32GB of memory in March 2018; Tesla V100S was released in late 2019. Note: these have since been superseded by the NVIDIA […]

The post In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators appeared first on Microway.

]]>
This article provides in-depth details of the NVIDIA Tesla V-series GPU accelerators (codenamed “Volta”). “Volta” GPUs improve upon the previous-generation “Pascal” architecture. Volta GPUs began shipping in September 2017 and were updated to 32GB of memory in March 2018; Tesla V100S was released in late 2019. Note: these have since been superseded by the NVIDIA Ampere GPU architecture.

This page is intended to be a fast and easy reference of key specs for these GPUs. You may wish to browse our Tesla V100 Price Analysis and Tesla V100 GPU Review for more extended discussion.

Important features available in the “Volta” GPU architecture include:

  • Exceptional HPC performance with up to 8.2 TFLOPS double- and 16.4 TFLOPS single-precision floating-point performance.
  • Deep Learning training performance with up to 130 TFLOPS FP16 half-precision floating-point performance.
  • Deep Learning inference performance with up to 62.8 TeraOPS INT8 8-bit integer performance.
  • Simultaneous execution of FP32 and INT32 operations improves the overall computational throughput of the GPU
  • NVLink enables an 8~10X increase in bandwidth between the Tesla GPUs and from GPUs to supported system CPUs (compared with PCI-E).
  • High-bandwidth HBM2 memory provides a 3X improvement in memory performance compared to previous-generation GPUs.
  • Enhanced Unified Memory allows GPU applications to directly access the memory of all GPUs as well as all of system memory (up to 512TB).
  • Native ECC Memory detects and corrects memory errors without any capacity or performance overhead.
  • Combined L1 Cache and Shared Memory provides additional flexibility and higher performance than Pascal.
  • Cooperative Groups – a new programming model introduced in CUDA 9 for organizing groups of communicating threads

Tesla “Volta” GPU Specifications

The table below summarizes the features of the available Tesla Volta GPU Accelerators. To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

Comparison between “Kepler”, “Pascal”, and “Volta” GPU Architectures

FeatureKepler GK210Pascal GP100Volta GV100
Compute Capability ^3.76.07.0
Threads per Warp32
Max Warps per SM64
Max Threads per SM2048
Max Thread Blocks per SM1632
Max Concurrent Kernels32128
32-bit Registers per SM128 K64 K
Max Registers per Thread Block64 K
Max Registers per Thread255
Max Threads per Thread Block1024
L1 Cache Configurationsplit with shared memory24KB dedicated L1 cache32KB ~ 128KB
(dynamic with shared memory)
Shared Memory Configurations16KB + 112KB L1 Cache

32KB + 96KB L1 Cache

48KB + 80KB L1 Cache

(128KB total)
64KBconfigurable up to 96KB; remainder for L1 Cache

(128KB total)
Max Shared Memory per Thread Block48KB96KB*
Max X Grid Dimension232-1
Hyper-QYes
Dynamic ParallelismYes
Unified MemoryNoYes
Pre-EmptionNoYes

^ For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation
* above 48 KB requires dynamic shared memory

Hardware-accelerated video encoding and decoding

All NVIDIA “Volta” GPUs include one or more hardware units for video encoding and decoding (NVENC / NVDEC). For complete hardware details, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.

The post In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-tesla-volta-gpu-accelerators/feed/ 0
Tesla V100 “Volta” GPU Review https://www.microway.com/hpc-tech-tips/tesla-v100-volta-gpu-review/ https://www.microway.com/hpc-tech-tips/tesla-v100-volta-gpu-review/#respond Thu, 28 Sep 2017 13:50:32 +0000 https://www.microway.com/?p=9401 The next generation NVIDIA Volta architecture is here. With it comes the new Tesla V100 “Volta” GPU, the most advanced datacenter GPU ever built. Volta is NVIDIA’s 2nd GPU architecture in ~12 months, and it builds upon the massive advancements of the Pascal architecture. Whether your workload is in HPC, AI, or even remote visualization […]

The post Tesla V100 “Volta” GPU Review appeared first on Microway.

]]>

The next generation NVIDIA Volta architecture is here. With it comes the new Tesla V100 “Volta” GPU, the most advanced datacenter GPU ever built.

Volta is NVIDIA’s 2nd GPU architecture in ~12 months, and it builds upon the massive advancements of the Pascal architecture. Whether your workload is in HPC, AI, or even remote visualization & graphics acceleration, Tesla V100 has something for you.

Two Flavors, one giant leap: Tesla V100 PCI-E & Tesla V100 with NVLink

For those who love speeds and feeds, here’s a summary of the key enhancements vs Tesla P100 GPUs

Tesla V100 with NVLinkTesla V100 PCI-ETesla P100 with NVLinkTesla P100 PCI-ERatio Tesla V100:P100
DP TFLOPS7.8 TFLOPS7.0 TFLOPS5.3 TFLOPS4.7 TFLOPS~1.4-1.5X
SP TFLOPS15.7 TFLOPS14 TFLOPS9.3 TFLOPS8.74 TFLOPS~1.4-1.5X
TensorFLOPS125 TFLOPS112 TFLOPS21.2 TFLOPS 1/2 Precision18.7 TFLOPS 1/2 Precision~6X
Interface (bidirec. BW) 300GB/sec32GB/sec160GB/sec32GB/sec1.88X NVLink
9.38X PCI-E
Memory Bandwidth900GB/sec900GB/sec720GB/sec720GB/sec1.25X
CUDA Cores (Tensor Cores) 5120 (640)5120 (640)35843584
Performance of Tesla GPUs, Generation to Generation

Selecting the right Tesla V100 for you:

With Tesla P100 “Pascal” GPUs, there was a substantial price premium to the NVLink-enabled SXM2.0 form factor GPUs. We’re excited to see things even out for Tesla V100.

However, that doesn’t mean selecting a GPU is as simple as picking one that matches a system design. Here’s some guidance to help you evaluate your options:

Performance Summary

Increases in relative performance are widely workload dependent. But early testing demonstates HPC performance advancing approximately 50%, in just a 12 month period.
Tesla V100 HPC PerformanceTesla V100 HPC Performance
If you haven’t made the jump to Tesla P100 yet, Tesla V100 is an even more compelling proposition.

For Deep Learning, Tesla V100 delivers a massive leap in performance. This extends to training time, and it also brings unique advancements for inference as well.
Deep Learning Performance Summary -Tesla V100

Enhancements for Every Workload

Now that we’ve learned a bit about each GPU, let’s review the breakthrough advancements for Tesla V100.

Next Generation NVIDIA NVLink Technology (NVLink 2.0)

NVLink helps you to transcend the bottlenecks in PCI-Express based systems and dramatically improve the data flow to GPU accelerators.

The total data pipe for each Tesla V100 with NVLink GPUs is now 300GB/sec of bidirectional bandwidth—that’s nearly 10X the data flow of PCI-E x16 3.0 interfaces!

Bandwidth

NVIDIA has enhanced the bandwidth of each individual NVLink brick by 25%: from 40GB/sec (20+20GB/sec in each direction) to 50GB/sec (25+25GB/sec) of bidirectional bandwidth.

This improved signaling helps carry the load of data intensive applications.

Number of Bricks

But enhanced NVLink isn’t just about simple signaling improvements. Point to point NVLink connections are divided into “bricks” or links. Each brick delivers

From 4 bricks to 6

Over and above the signaling improvements, Tesla V100 with NVLink increases the number of NVLink “bricks” that are embedded into each GPU: from 4 bricks to 6.

This 50% increase in brick quantity delivers a large increase in bandwidth for the world’s most data intensive workloads. It also allows for more diverse set of system designs and configurations.

The NVLink Bank Account

NVIDIA NVLink technology continues to be a breakthrough for both HPC and Deep Learning workloads. But as with previous generations of NVLink, there’s a design choice related to a simple question:

Where do you want to spend your NVLink bandwidth?

Think about NVLink bricks as a “spending” or “bank account.” Each NVLink system-design strikes a different balance between where they “spend the funds.” You may wish to:

  • Spend links on GPU:GPU communication
  • Focus on increasing the number of GPUs
  • Broaden CPU:GPU bandwidth to overcome the PCI-E bottleneck
  • Create a balanced system, or prioritize design choices solely for a single workload

There are good reasons for each, or combinations, of these choices. DGX-1V, NumberSmasher 1U GPU Server with NVLink, and future IBM Power Systems products all set different priorities.

We’ll dive more deeply into these choices in a future post.

Net-net: the NVLink bandwidth increases from 160GB/sec to 300GB/sec (bidirectional), and a number of new, diverse HPC and AI hardware designs are now possible.

Programming Improvements

Tesla V100 and CUDA 9 bring a host of improvements for GPU programmers. These include:

  • Cooperative Groups
  • A new L1 cache + shared memory, that simplifies programming
  • A new SIMT model, that relieves the need to program to fit 32 thread warps

We won’t explore these in detail in this post, but we encourage you to read the following resources for more:

What does Tesla V100 mean for me?

Tesla V100 GPUs mean a massive change for many workloads. But each workload will see different impacts. Here’s a short summary of what we see:

  1. An increase in performance on HPC applications (on paper FLOPS increase of 50%, diverse real-world impacts)
  2. A massive leap for Deep Learning Training
  3. 1 GPU, many Deep Learning workloads
  4. New system designs, better tuned to your applications
  5. Radical new, and radically simpler, programming paradigms (coherence)

The post Tesla V100 “Volta” GPU Review appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/tesla-v100-volta-gpu-review/feed/ 0
DeepChem – a Deep Learning Framework for Drug Discovery https://www.microway.com/hpc-tech-tips/deepchem-deep-learning-framework-for-drug-discovery/ https://www.microway.com/hpc-tech-tips/deepchem-deep-learning-framework-for-drug-discovery/#respond Fri, 28 Apr 2017 19:02:51 +0000 https://www.microway.com/?p=8687 A powerful new open source deep learning framework for drug discovery is now available for public download on github.This new framework, called DeepChem, is python-based, and offers a feature-rich set of functionality for applying deep learning to problems in drug discovery and cheminformatics.Previous deep learning frameworks, such as scikit-learn have been applied to chemiformatics, but […]

The post DeepChem – a Deep Learning Framework for Drug Discovery appeared first on Microway.

]]>
A powerful new open source deep learning framework for drug discovery is now available for public download on github.This new framework, called DeepChem, is python-based, and offers a feature-rich set of functionality for applying deep learning to problems in drug discovery and cheminformatics.Previous deep learning frameworks, such as scikit-learn have been applied to chemiformatics, but DeepChem is the first to accelerate computation with NVIDIA GPUs.

The framework uses Google TensorFlow, along with scikit-learn, for expressing neural networks for deep learning.It also makes use of the RDKit python framework, for performing more basic operations on molecular data, such as converting SMILES strings into molecular graphs.The framework is now in the alpha stage, at version 0.1.As the framework develops, it will move toward implementing more models in TensorFlow, which use GPUs for training and inference.This new open source framework is poised to become an accelerating factor for innovation in drug discovery across industry and academia.

Another unique aspect of DeepChem is that it has incorporated a large amount of publicly-available chemical assay datasets, which are described in Table 1.

DeepChem Assay Datasets

DatasetCategoryDescriptionClassification TypeCompounds
QM7Quantum Mechanicsorbital energies
atomization energies
Regression7,165
QM7bQuantum Mechanicsorbital energiesRegression7,211
ESOLPhysical ChemistrysolubilityRegression1,128
FreeSolvPhysical Chemistrysolvation energyRegression643
PCBABiophysicsbioactivityClassification439,863
MUVBiophysicsbioactivityClassification93,127
HIVBiophysicsbioactivityClassification41,913
PDBBindBiophysicsbinding activityRegression11,908
Tox21PhysiologytoxicityClassification8,014
ToxCastPhysiologytoxicityClassification8,615
SIDERPhysiologyside reactionsClassification1,427
ClinToxPhysiologyclinical toxicityClassification1,491

Table 1:The current v0.1 DeepChem Framework includes the data sets in this table, along others which will be added to future versions.

Metrics

The squared Pearson Correleation Coefficient is used to quantify the quality of performance of a model trained on any of these regression datasets.Models trained on classification datasets have their predictive quality measured by the area under curve (AUC) for receiver operator characteristic (ROC) curves (AUC-ROC).Some datasets have more than one task, in which case the mean over all tasks is reported by the framework.

Data Splitting

DeepChem uses a number of methods for randomizing or reordering datasets so that models can be trained on sets which are more thoroughly randomized, in both the training and validation sets, for example.These methods are summarized in Table 2.

DeepChem Dataset Splitting Methods

Split Typeuse cases
Index Splitdefault index is sufficient as long as it contains no built-in bias
Random Splitif there is some bias to the default index
Scaffold Splitif chemical properties of dataset will be depend on molecular scaffold
Stratified Random Splitwhere one needs to ensure that each dataset split contains a full range of some real-valued property

Table 2:Various methods are available for splitting the dataset in order to avoid sampling bias.

Featurizations

DeepChem offers a number of featurization methods, summarized in Table 3.SMILES strings are unique representations of molecules, and can themselves can be used as a molecular feature.The use of SMILES strings has been explored in recent work.SMILES featurization will likely become a part of future versions of DeepChem.

Most machine learning methods, however, require more feature information than can be extracted from a SMILES string alone.

DeepChem Featurizers

Featurizeruse cases
Extended-Connectivity Fingerprints (ECFP)for molecular datasets not containing large numbers of non-bonded interactions
Graph ConvolutionsLike ECFP, graph convolution produces granular representations of molecular topology. Instead of applying fixed hash functions, as with ECFP, graph convolution uses a set of parameters which can learned by training a neural network associated with a molecular graph structure.
Coloumb MatrixColoumb matrix featurization captures information about the nuclear charge state, and internuclear electric repulsion. This featurization is less granular than ECFP, or graph convolutions, and may perform better where intramolecular electrical potential may play an important role in chemical activity
Grid Featurizationdatasets containing molecules interacting through non-bonded forces, such as docked protein-ligand complexes

Table 3:Various methods are available for splitting the dataset in order to avoid sampling bias.

Supported Models

Supported Models as of v0.1

Model Typepossible use case
Logistic Regressioncontinuous, real-valued prediction required
Random ForestClassification or Regression
Multitask NetworkIf various prediction types required, a multitask network would be a good choice. An example would be a continuous real-valued prediction, along with one or more categorical predictions, as predicted outcomes.
Bypass NetworkClassification and Regression
Graph Convolution Modelsame as Multitask Networks

Table 4: Model types supported by DeepChem 0.1

A Glimpse into the Tox21 Dataset and Deep Learning

The Toxicology in the 21st Century (Tox21) research initiative led to the creation of a public dataset which includes measurements of activation of stress response and nuclear receptor response pathways by 8,014 distinct molecules.Twelve response pathways were observed in total, with each having some association with toxicity.Table 5 summarizes the pathways investigated in the study.

Tox21 Assay Descriptions

Biological Assaydescription
NR-ARNuclear Receptor Panel, Androgen Receptor
NR-AR-LBDNuclear Receptor Panel, Androgen Receptor, luciferase
NR-AhRNuclear Receptor Panel, aryl hydrocarbon receptor
NR-AromataseNuclear Receptor Panel, aromatase
NR-ERNuclear Receptor Panel, Estrogen Receptor alpha
NR-ER-LBDNuclear Receptor Panel, Estrogen Receptor alpha, luciferase
NR-PPAR-gammaNuclear Receptor Panel, peroxisome profilerator-activated receptor gamma
SR-AREStress Response Panel, nuclear factor (erythroid-derived 2)-like 2 antioxidant responsive element
SR-ATAD5Stress Response Panel, genotoxicity indicated by ATAD5
SR-HSEStress Response Panel, heat shock factor response element
SR-MMPStress Response Panel, mitochondrial membrane potential
SR-p53Stress Response Panel, DNA damage p53 pathway

Table 5:Biological pathway responses investigated in the Tox21 Machine Learning Challenge.

We used the Tox21 dataset to make predictions on molecular toxicity in DeepChem using the variations shown in Table 6.

Model Construction Parameter Variations Used

Dataset SplittingIndexScaffold
FeaturizationECFPMolecular Graph Convolution

Table 6:Model construction parameter variations used in generating our predictions, as shown in Figure 1.

A .csv file containing SMILES strings for 8,014 molecules was used to first featurize each molecule by using either ECFP or molecular graph convolution.IUPAC names for each molecule were queried from NIH Cactus, and toxicity predictions were made, using a trained model, on a set of nine molecules randomly selected from the total tox21 data set.Nine results showing molecular structure (rendered by RDKit), IUPAC names, and predicted toxicity scores, across all 12 biochemical response pathways, described in Table 5, are shown in Figure 1.

Tox21 wprediction ith DeepChem
Figure 1. Tox21 Predictions for nine randomly selected molecules from the tox21 dataset

Expect more from DeepChem in the Future

The DeepChem framework is undergoing rapid development, and is currently at the 0.1 release version.New models and features will be added, along with more data sets in future.You can download the DeepChem framework from github.There is also a website for framework documentation at deepchem.io.

Microway offers DeepChem pre-installed on our line of WhisperStation products for Deep Learning. Researchers interested in exploring deep learning applications with chemistry and drug discovery can browse our line of WhisperStation products.

References

1.) Subramanian, Govindan, et al. “Computational Modeling of β-secretase 1 (BACE-1) Inhibitors using Ligand Based Approaches.” Journal of Chemical Information and Modeling 56.10 (2016): 1936-1949.
2.) Altae-Tran, Han, et al. “Low Data Drug Discovery with One-shot Learning.” arXiv preprint arXiv:1611.03199 (2016).
3.) Wu, Zhenqin, et al. “MoleculeNet: A Benchmark for Molecular Machine Learning.” arXiv preprint arXiv:1703.00564 (2017).
4.) Gomes, Joseph, et al. “Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity.” arXiv preprint arXiv:1703.10603 (2017).
5.) Gómez-Bombarelli, Rafael, et al. “Automatic chemical design using a data-driven continuous representation of molecules.” arXiv preprint arXiv:1610.02415 (2016).
6.) Mayr, Andreas, et al. “DeepTox: toxicity prediction using deep learning.” Frontiers in Environmental Science 3 (2016): 80.

The post DeepChem – a Deep Learning Framework for Drug Discovery appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/deepchem-deep-learning-framework-for-drug-discovery/feed/ 0
NVIDIA Tesla P40 GPU Accelerator (Pascal GP102) Up Close https://www.microway.com/hpc-tech-tips/nvidia-tesla-p40-gpu-accelerator-pascal-gp102-up-close/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-p40-gpu-accelerator-pascal-gp102-up-close/#respond Tue, 07 Feb 2017 15:58:14 +0000 https://www.microway.com/?p=8592 As NVIDIA’s GPUs become increasingly vital to the fields of AI and intelligent machines, NVIDIA has produced GPU models specifically targeted to these applications. The new Tesla P40 GPU is NVIDIA’s premiere product for deep learning deployments. It is specifically designed for high-speed inference workloads, which means running data through pre-trained neural networks. However, it […]

The post NVIDIA Tesla P40 GPU Accelerator (Pascal GP102) Up Close appeared first on Microway.

]]>
As NVIDIA’s GPUs become increasingly vital to the fields of AI and intelligent machines, NVIDIA has produced GPU models specifically targeted to these applications. The new Tesla P40 GPU is NVIDIA’s premiere product for deep learning deployments. It is specifically designed for high-speed inference workloads, which means running data through pre-trained neural networks. However, it also offers significant processing performance for projects which do not require 64-bit double-precision floating point capability (many neural networks can be trained using the 32-bit single-precision floating point on the Tesla P40). For those cases, these GPUs can be used to accelerate both the neural network training and the inference.

Highlights of the new Tesla P40 GPU include:

  • Up to 12 TFLOPS single-precision floating-point performance
  • Support for INT8 operations with up to 47 TOPS (ideal for high-speed/high-volume inference)
  • 24GB of GDDR5 GPU memory, with bandwidths up to 346GB/s

PCI-Express Data Transfer Speeds

The Tesla P40 GPUs use the same generation 3.0 PCI-E connectivity as other recent GPUs (such as the Maxwell generation), so you should expect to achieve similar transfer speeds. As shown below, we’re able to achieve transfers up to ~12.8GB/s between the host and the GPU:

[root@node4 ~]# ./bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P40
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11842.8

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12899.9

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			240357.5

Result = PASS

Technical Details of the Tesla P40 GPU

Below are the technical details reported by nvidia-smi. Note that “Pascal” Tesla GPUs now include fully integrated memory ECC support that is always enabled (memory performance in previous generations could be improved by disabling ECC).

[root@node4 ~]# nvidia-smi -a -i 0

==============NVSMI LOG==============

Timestamp                           : Mon Feb  6 12:30:52 2017
Driver Version                      : 367.57

Attached GPUs                       : 4
GPU 0000:02:00.0
    Product Name                    : Tesla P40
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Enabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0324416xxxxxx
    GPU UUID                        : GPU-16254654-0bd3-8d18-e8fe-d53865xxxxxx
    Minor Number                    : 0
    VBIOS Version                   : 86.02.23.00.01
    MultiGPU Board                  : No
    Board ID                        : 0x200
    GPU Part Number                 : 900-2G610-0000-000
    Inforom Version
        Image Version               : G610.0200.00.03
        OEM Object                  : 1.1
        ECC Object                  : 4.1
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1B3810DE
        Bus Id                      : 0000:02:00.0
        Sub System Id               : 0x11D910DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : N/A
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 22912 MiB
        Used                        : 0 MiB
        Free                        : 22912 MiB
    BAR1 Memory Usage
        Total                       : 32768 MiB
        Used                        : 2 MiB
        Free                        : 32766 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 27 C
        GPU Shutdown Temp           : 95 C
        GPU Slowdown Temp           : 92 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 12.34 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 250.00 W
    Clocks
        Graphics                    : 544 MHz
        SM                          : 544 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
    Applications Clocks
        Graphics                    : 1531 MHz
        Memory                      : 3615 MHz
    Default Applications Clocks
        Graphics                    : 1303 MHz
        Memory                      : 3615 MHz
    Max Clocks
        Graphics                    : 1531 MHz
        SM                          : 1531 MHz
        Memory                      : 3615 MHz
        Video                       : 1379 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

The latest NVIDIA GPU architectures support large numbers of clock speeds, as well as automated boosting of the clock speed (when power and thermals allow). Administrators can also set specific power consumption limits and monitor the clock speeds (including explanations for any reasons the clocks are running at a lower speed). The list below shows the available clock speeds for the Tesla P40 GPU:

[root@node4 ~]# nvidia-smi -q -d SUPPORTED_CLOCKS -i 0

==============NVSMI LOG==============

Timestamp                           : Mon Feb  6 12:31:56 2017
Driver Version                      : 367.57

Attached GPUs                       : 4
GPU 0000:02:00.0
    Supported Clocks
        Memory                      : 3615 MHz
            Graphics                : 1531 MHz
            Graphics                : 1518 MHz
            Graphics                : 1506 MHz
            Graphics                : 1493 MHz
            Graphics                : 1480 MHz
            Graphics                : 1468 MHz
            Graphics                : 1455 MHz
            Graphics                : 1442 MHz
            Graphics                : 1430 MHz
            Graphics                : 1417 MHz
            Graphics                : 1404 MHz
            Graphics                : 1392 MHz
            Graphics                : 1379 MHz
            Graphics                : 1366 MHz
            Graphics                : 1354 MHz
            Graphics                : 1341 MHz
            Graphics                : 1328 MHz
            Graphics                : 1316 MHz
            Graphics                : 1303 MHz
            Graphics                : 1290 MHz
            Graphics                : 1278 MHz
            Graphics                : 1265 MHz
            Graphics                : 1252 MHz
            Graphics                : 1240 MHz
            Graphics                : 1227 MHz
            Graphics                : 1215 MHz
            Graphics                : 1202 MHz
            Graphics                : 1189 MHz
            Graphics                : 1177 MHz
            Graphics                : 1164 MHz
            Graphics                : 1151 MHz
            Graphics                : 1139 MHz
            Graphics                : 1126 MHz
            Graphics                : 1113 MHz
            Graphics                : 1101 MHz
            Graphics                : 1088 MHz
            Graphics                : 1075 MHz
            Graphics                : 1063 MHz
            Graphics                : 1050 MHz
            Graphics                : 1037 MHz
            Graphics                : 1025 MHz
            Graphics                : 1012 MHz

NVIDIA deviceQuery on Tesla P40 GPU

Each new GPU generation brings tweaks to the design. The output below, from the CUDA 8.0 SDK samples, shows additional details of the architecture and capabilities of the “Pascal” Tesla P40 GPU accelerators. Take note of the new Compute Capability 6.1, which is what you’ll want to target if you’re compiling your own CUDA code.

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla P40"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 22913 MBytes (24025956352 bytes)
  (30) Multiprocessors, (128) CUDA Cores/MP:     3840 CUDA Cores
  GPU Max Clock rate:                            1531 MHz (1.53 GHz)
  Memory Clock rate:                             3615 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = Tesla P40, Device1 = Tesla P40, Device2 = Tesla P40, Device3 = Tesla P40
Result = PASS

Additional Information on Tesla P40 GPUs

To learn more about the NVIDIA “Pascal” GPU architecture and to compare Tesla P40 with other models in the Tesla product line, read our “Pascal” Tesla GPU knowledge center article.

If you’re thinking about using GPUs for the first time, please consider getting in touch with us. We’ve been implementing GPU-accelerated systems for nearly a decade and have the expertise to help make your project a success!

If you’re an existing GPU user considering a new deployment, review or Tesla GPU clusters page and our list of GPU servers.

The post NVIDIA Tesla P40 GPU Accelerator (Pascal GP102) Up Close appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-p40-gpu-accelerator-pascal-gp102-up-close/feed/ 0
Comparing NVLink vs PCI-E with NVIDIA Tesla P100 GPUs on OpenPOWER Servers https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/ https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/#comments Thu, 26 Jan 2017 14:41:45 +0000 https://www.microway.com/?p=8492 The new NVIDIA Tesla P100 GPUs are available with both PCI-Express and NVLink connectivity. How do these two types of connectivity compare? This post provides a rundown of NVLink vs PCI-E and explores the benefits of NVIDIA’s new NVLink technology. Considering the variety of options for Tesla P100 GPUs, you may wish to review our […]

The post Comparing NVLink vs PCI-E with NVIDIA Tesla P100 GPUs on OpenPOWER Servers appeared first on Microway.

]]>
The new NVIDIA Tesla P100 GPUs are available with both PCI-Express and NVLink connectivity. How do these two types of connectivity compare? This post provides a rundown of NVLink vs PCI-E and explores the benefits of NVIDIA’s new NVLink technology.

Photo of NVIDIA Tesla P100 NVLink GPUs in an OpenPOWER server

Considering the variety of options for Tesla P100 GPUs, you may wish to review our other recent posts:

Primary considerations when comparing NVLink vs PCI-E

On systems with x86 CPUs (such as Intel Xeon), the connectivity to the GPU is only through PCI-Express (although the GPUs connect to each other through NVLink). On systems with POWER8 CPUs, the connectivity to the GPU is through NVLink (in addition to the NVLink between GPUs).

Nevertheless, the performance characteristics of the GPU itself (GPU cores, GPU memory, etc) do not vary. The Tesla P100 GPU itself will be performing at the same level. It’s the data flow and total system throughput that will determine the final performance for your workload. To review:

  • Full NVLink connectivity is only available with IBM POWER8 CPUs (not x86 CPUs)
  • GPU-to-GPU NVLink connectivity (without CPU-to-GPU) is available with x86 CPUs
  • Internal performance of an NVIDIA Tesla P100 SXM2 GPU will not vary between x86 and POWER8

With that in mind, let’s compare their throughput.

Tesla P100 with NVLink on OpenPOWER

The NVLink connections on Tesla P100 GPUs provide a theoretical peak throughput of 80GB/s (160GB/s bi-directional). However, those links are made up of several bricks, which can be split up to connect to a number of other devices. For example, one GPU might dedicate 40GB/s for a link to a CPU and 40GB/s for a link to a nearby GPU.

Device <-> Device NVLink Performance

Below is the output from NVIDIA’s GPU peer-to-peer (P2P) utility, which is included with CUDA 8.0. The results summarize the throughput (in gigabytes per second) and latency (in microseconds) when sending messages between pairs of Tesla P100 GPUs in our OperPOWER system.

It’s important to understand that the test below was run on a system with four Tesla GPUs. On each GPU, the available 80GB/s bandwidth was divided in half. One link goes to a POWER8 CPU and one link goes to the adjacent P100 GPU (see diagram below).

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla P100-SXM2-16GB, pciBusID: 1, pciDeviceID: 0, pciDomainID:2
Device: 1, Tesla P100-SXM2-16GB, pciBusID: 1, pciDeviceID: 0, pciDomainID:3
Device: 2, Tesla P100-SXM2-16GB, pciBusID: 1, pciDeviceID: 0, pciDomainID:a
Device: 3, Tesla P100-SXM2-16GB, pciBusID: 1, pciDeviceID: 0, pciDomainID:b

...

Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 457.93  35.30  20.37  20.40
     1  35.30 454.78  20.16  20.14
     2  20.19  20.16 454.56  35.29
     3  18.36  18.42  35.29 454.07

...

P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   4.99   7.92  15.56  15.43
     1   8.06   5.00  15.40  15.40
     2  15.47  15.52   5.04   8.07
     3  15.43  15.49   8.04   4.97

As the results show, each 40GB/s Tesla P100 NVLink will provide ~35GB/s in practice. Communications between GPUs on a remote CPU offer throughput of ~20GB/s. Latency between GPUs is 8~16 microseconds. The results were gathered on our 2U OpenPOWER GPU server with Tesla P100 NVLink GPUs, which is available to benchmark in our Test Drive cluster. The architectural design of this particular platform is:

Block diagram drawing of the Microway OpenPOWER GPU Server with NVLink GPUs
Block diagram of the 2U Microway OpenPOWER GPU server with Tesla P100 NVLink GPUs

Device <-> Device PCI-E Performance

A similar test, run on GPUs connected by standard PCI-Express, will result in the following performance:

Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 452.19  10.19  10.73  10.74
     1  10.19 450.04  10.76  10.75
     2  10.91  10.90 450.94  10.21
     3  10.90  10.91  10.18 450.95

...

P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   3.22   7.86  16.90  17.05
     1   7.85   3.21  17.08  17.22
     2  16.32  16.37   3.07   7.85
     3  16.26  16.35   7.84   3.07

The latencies between GPUs are about the same (although there is a larger latency when traveling to GPUs on remote CPUs. However, transfer bandwidth is significantly higher for NVlink vs PCI-E (two to three times higher). This increased throughput gives NVLink an advantage for fine-grained applications and others which send data between GPUs.

NVLink vs PCI-E: Host <-> Device Performance

CPU-to-GPU data transfers occur whenever data must be transferred into or out of the GPU. These are typically called host-to-device and device-to-host transfers. Traditional systems with x86 CPUs are only able to communicate with the GPUs over PCI-Express, which provides lower throughput. Our OpenPOWER systems provide full NVLink connectivity to the GPUs. Here’s the achieved performance:

Host <-> Device across NVLink

[root@openpower8 ~]# ./bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P100-SXM2-16GB
 Quick Mode

Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			33236.9

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			32322.6

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			448515.9

Result = PASS

Host <-> Device across PCI-E

A similar test, run on an x86 system with GPUs connected by PCI-Express, will result in the following performance:

...

 Device 0: Tesla P100-PCIE-16GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11658.4

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12882.0

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			446125.2

Just as with the GPU-to-GPU transfers, we see that NVLink enables much faster performance. Remember that this type of transfer occurs whenever you load a dataset into main memory and then process some of that data on the GPUs. These transfer times are often a factor in overall application performance, so a 3X speedup is welcome. This increased performance may also enable applications which were previously too data-movement-intensive.

Finally, consider that NVIDIA CUDA 8.0 (together with the Tesla P100 GPUs) allows for fully Unified Memory. You will be able to load datasets larger than GPU memory and let the system automatically manage data movement. In other words, the size of GPU memory no longer limits the size of your jobs. On such runs, having a wider pipe between CPU and GPU memory is of immense importance.

NVIDIA deviceQuery on OpenPOWER server with Tesla P100 GPUs and NVLink

Each new GPU generation brings tweaks to the design. The output below, from the CUDA 8.0 SDK samples, shows additional details of the architecture and capabilities of the “Pascal” Tesla P100 GPU accelerators with NVLink. Take note of the new Compute Capability 6.0, which is what you’ll want to target if you’re compiling your own CUDA code. Also note that in this platform there are three DMA copy engines per GPU.

deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla P100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16281 MBytes (17071669248 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1481 MHz (1.48 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   2 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

  [...]

> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU2) : No
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU3) : No
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU2) : No
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU3) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU0) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU1) : No
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU0) : No
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU1) : No
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = Tesla P100-SXM2-16GB, Device1 = Tesla P100-SXM2-16GB, Device2 = Tesla P100-SXM2-16GB, Device3 = Tesla P100-SXM2-16GB
Result = PASS

How to move forward – GPU systems with Host-to-Device NVLink

Due to the new high-speed NVLink connection, there is only one server on the market with both Host-to-Device and Device-to-Device NVLink connectivity. This system, leveraging IBM’s POWER8 CPUs and innovation from the OpenPOWER foundation (including NVIDIA and Mellanox), began shipments in fall 2016. Please contact us to learn more, or read about this OpenPOWER server. Academic discounts are available.

To learn more about the available NVIDIA Tesla “Pascal” GPUs and to compare with other versions of the Tesla product line, please review our “Pascal” Tesla GPU knowledge center article:

If you’re thinking about using GPUs for the first time, please consider getting in touch with us. We’ve been implementing GPU-accelerated systems for nearly a decade and have the expertise to help make your project a success!

The post Comparing NVLink vs PCI-E with NVIDIA Tesla P100 GPUs on OpenPOWER Servers appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/feed/ 1
NVIDIA Tesla P100 NVLink 16GB GPU Accelerator (Pascal GP100 SXM2) Up Close https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-nvlink-16gb-gpu-accelerator-pascal-gp100-sxm2-close/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-nvlink-16gb-gpu-accelerator-pascal-gp100-sxm2-close/#comments Wed, 18 Jan 2017 13:10:46 +0000 https://www.microway.com/?p=8398 The NVIDIA Tesla P100 NVLink GPUs are a big advancement. For the first time, the GPU is stepping outside the traditional “add in card” design. No longer tied to the fixed specifications of PCI-Express cards, NVIDIA’s engineers have designed a new form factor that best suits the needs of the GPU. With their SXM2 design, […]

The post NVIDIA Tesla P100 NVLink 16GB GPU Accelerator (Pascal GP100 SXM2) Up Close appeared first on Microway.

]]>
The NVIDIA Tesla P100 NVLink GPUs are a big advancement. For the first time, the GPU is stepping outside the traditional “add in card” design. No longer tied to the fixed specifications of PCI-Express cards, NVIDIA’s engineers have designed a new form factor that best suits the needs of the GPU. With their SXM2 design, NVIDIA can run GPUs to their full potential.

One of the biggest changes this allows is the NVLink interconnect, which allows GPUs to operate beyond the restrictions of the PCI-Express bus. Instead, the GPUs communicate with one another over this high-speed link. Additionally, these new “Pascal” architecture GPUs bring improvements including higher performance, faster connectivity, and more flexibility for users & programmers.

Close-Up Photo of the NVIDIA Tesla P100 NVLink GPU

There is variety in the new line-up of GPU products. For the Tesla P100 GPU model, there are three separate paths to be considered:

Highlights of the new Tesla P100 NVLink GPUs include:

  • Up to 5.3 TFLOPS double- and 10.6 TFLOPS single-precision floating-point performance
  • 16GB of on-die HBM2 CoWoS GPU memory, with bandwidths up to 732GB/s
  • 80GB/s NVLink between GPUs boosts bandwidth between the Tesla P100 GPUs
  • High-speed, on-die GPU memory provides a 3X improvement over older GPUs
  • Pascal Unified Memory allows applications to directly access the memory of all GPUs and all of system memory

Improved Data Transfer Speeds

The NVLink connection on Tesla P100 GPUs has a theoretical peak throughput of 80GB/s (160GB/s bi-directional). However, that connectivity is only between GPUs. The GPUs still communicate via PCI-Express when transferring data to and from the host (via PCI-E x16 generation 3.0). The high-speed NVLink connection is only for data transfers directly between the GPUs.

Device <-> Device Tesla P100 NVLink Performance

Below is a section of output from NVIDIA’s GPU peer-to-peer (P2P) utility, which is included with CUDA 8.0. The results summarize the throughput (in gigabytes per second) and latency (in microseconds) when sending messages between any pair of GPUs.

It’s important to understand that the test below was run on a system with four Tesla GPUs. On each GPU, the available 80GB/s bandwidth was divided up so that connections could be made to the three other GPUs. The links are divided such that each GPU has two 20GB/s links and one 40GB/s link (see diagram below).

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla P100-SXM2-16GB, pciBusID: 6, pciDeviceID: 0, pciDomainID:0
Device: 1, Tesla P100-SXM2-16GB, pciBusID: 7, pciDeviceID: 0, pciDomainID:0
Device: 2, Tesla P100-SXM2-16GB, pciBusID: 84, pciDeviceID: 0, pciDomainID:0
Device: 3, Tesla P100-SXM2-16GB, pciBusID: 85, pciDeviceID: 0, pciDomainID:0

...

Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 449.69  18.45  18.45  36.72
     1  18.44 450.92  36.70  18.44
     2  18.45  36.70 450.37  18.44
     3  36.71  18.44  18.44 447.34

...

P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   3.66   9.25   9.31   9.67
     1   9.49   3.65  10.04   9.05
     2   9.85  10.13   3.13   9.79
     3  10.06  11.41   9.97   3.54

As the results show, a 20GB/s Tesla P100 NVLink will provide ~18GB/s in practice. A 40GB/s Tesla P100 NVLink will provide ~36GB/s. Latency between GPUs is 9~10 microseconds. The results were gathered on our 1U NumberSmasher Server with four Tesla P100 NVLink GPUs, which is also available in our Test Drive cluster. The architectural design of this particular platform is:

NumberSmasher 1U NVLink with Tesla P100-SYS-1028GQ-TXR

Host <-> Device Performance

Transfers between system memory and the GPU are still via PCI-Express and will perform similarly to previous-generation “Kepler” and “Maxwell” GPUs. With Tesla P100, you will be able to achieve transfers up to ~12.8GB/s between the host and the GPU:

[root@node2 ~]# ./bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P100-SXM2-16GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11463.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12868.0

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			446271.0

Result = PASS

Technical Details

Below are the technical details reported by nvidia-smi. Note that “Pascal” Tesla P100 GPUs now include fully integrated memory ECC support that is always enabled (memory performance in previous generations could be improved by disabling ECC).

[root@node2 ~]# nvidia-smi -a -i 0

==============NVSMI LOG==============

Timestamp                           : Tue Dec  6 16:30:58 2016
Driver Version                      : 367.48

Attached GPUs                       : 1
GPU 0000:06:00.0
    Product Name                    : Tesla P100-SXM2-16GB
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Enabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 032311609xxxx
    GPU UUID                        : GPU-70ba5857-9613-1213-c5f5-3b201233xxxx
    Minor Number                    : 0
    VBIOS Version                   : 86.00.26.00.02
    MultiGPU Board                  : No
    Board ID                        : 0x600
    GPU Part Number                 : 900-2H403-0000-000
    Inforom Version
        Image Version               : H403.0201.00.04
        OEM Object                  : 1.1
        ECC Object                  : 4.1
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x06
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x15F910DE
        Bus Id                      : 0000:06:00.0
        Sub System Id               : 0x116B10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 16276 MiB
        Used                        : 0 MiB
        Free                        : 16276 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 2 MiB
        Free                        : 16382 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 40 C
        GPU Shutdown Temp           : 85 C
        GPU Slowdown Temp           : 82 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 34.89 W
        Power Limit                 : 300.00 W
        Default Power Limit         : 300.00 W
        Enforced Power Limit        : 300.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 300.00 W
    Clocks
        Graphics                    : 405 MHz
        SM                          : 405 MHz
        Memory                      : 715 MHz
        Video                       : 835 MHz
    Applications Clocks
        Graphics                    : 1480 MHz
        Memory                      : 715 MHz
    Default Applications Clocks
        Graphics                    : 1328 MHz
        Memory                      : 715 MHz
    Max Clocks
        Graphics                    : 1480 MHz
        SM                          : 1480 MHz
        Memory                      : 715 MHz
        Video                       : 1480 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

The latest NVIDIA GPU architectures support large numbers of clock speeds, as well as automated boosting of the clock speed (when power and thermals allow). Administrators can also set specific power consumption limits and monitor the clock speeds (including explanations for any reasons the clocks are running at a lower speed).

[root@node2 ~]# nvidia-smi -q -d SUPPORTED_CLOCKS -i 0

==============NVSMI LOG==============

Timestamp                           : Tue Dec  6 16:39:20 2016
Driver Version                      : 367.48

Attached GPUs                       : 4
GPU 0000:06:00.0
    Supported Clocks
        Memory                      : 715 MHz
            Graphics                : 1480 MHz
            Graphics                : 1468 MHz
            Graphics                : 1455 MHz
            Graphics                : 1442 MHz
            Graphics                : 1430 MHz
            Graphics                : 1417 MHz
            Graphics                : 1404 MHz
            Graphics                : 1392 MHz
            Graphics                : 1379 MHz
            Graphics                : 1366 MHz
            Graphics                : 1354 MHz
            Graphics                : 1341 MHz
            Graphics                : 1328 MHz
            Graphics                : 1316 MHz
            Graphics                : 1303 MHz
            Graphics                : 1290 MHz
            Graphics                : 1278 MHz
            Graphics                : 1265 MHz
            Graphics                : 1252 MHz
            Graphics                : 1240 MHz
            Graphics                : 1227 MHz
            Graphics                : 1215 MHz
            Graphics                : 1202 MHz
            Graphics                : 1189 MHz
            Graphics                : 1177 MHz
            Graphics                : 1164 MHz
            Graphics                : 1151 MHz
            Graphics                : 1139 MHz
            Graphics                : 1126 MHz
            Graphics                : 1113 MHz
            Graphics                : 1101 MHz
            Graphics                : 1088 MHz
            Graphics                : 1075 MHz
            Graphics                : 1063 MHz
            Graphics                : 1050 MHz
            Graphics                : 1037 MHz
            Graphics                : 1025 MHz
            Graphics                : 1012 MHz

NVIDIA deviceQuery on Tesla P100 NVLink 16GB GPU

Each new GPU generation brings tweaks to the design. The output below, from the CUDA 8.0 SDK samples, shows additional details of the architecture and capabilities of the “Pascal” Tesla P100 NVLink GPU accelerators. Take note of the new Compute Capability 6.0, which is what you’ll want to target if you’re compiling your own CUDA code.

deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla P100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16276 MBytes (17066885120 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            405 MHz (0.41 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 6 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

  [...]

> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU0) -> Tesla P100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU1) -> Tesla P100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU2) -> Tesla P100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla P100-SXM2-16GB (GPU3) -> Tesla P100-SXM2-16GB (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = Tesla P100-SXM2-16GB, 
Device1 = Tesla P100-SXM2-16GB, Device2 = Tesla P100-SXM2-16GB, Device3 = Tesla P100-SXM2-16GB
Result = PASS

Additional Information on Tesla P100 NVLink GPUs

To learn more about the available P100 GPUs and to compare with other versions of the Tesla product line, please review our “Pascal” Tesla GPU knowledge center article:

If you’re thinking about using GPUs for the first time, please consider getting in touch with us. We’ve been implementing GPU-accelerated systems for nearly a decade and have the expertise to help make your project a success!

Due to their novel design, Tesla P100 NVLink GPUs cannot be installed into existing GPU systems. Platforms with the NVLink-connected SXM2 sockets are required. For several options, have a look at our list of P100 GPU-accelerated systems. You may also wish to review our post on PCI-Express connected Tesla P100 GPUs.

Photo of the back side of the NVIDIA Tesla P100 NVLink GPU

The post NVIDIA Tesla P100 NVLink 16GB GPU Accelerator (Pascal GP100 SXM2) Up Close appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-nvlink-16gb-gpu-accelerator-pascal-gp100-sxm2-close/feed/ 2
NVIDIA Tesla P100 PCI-E 16GB GPU Accelerator (Pascal GP100) Up Close https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-pci-e-16gb-gpu-accelerator-pascal-gp100-close/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-pci-e-16gb-gpu-accelerator-pascal-gp100-close/#respond Wed, 28 Dec 2016 22:12:07 +0000 https://www.microway.com/?p=8376 NVIDIA’s new Tesla P100 PCI-E GPU is a big step up for HPC users, and for GPU users in general. Although other workloads have been leveraging the newer “Maxwell” architecture, HPC applications have been using “Kepler” GPUs for a couple years. The new GPUs bring many improvements, including higher performance, faster connectivity, and more flexibility […]

The post NVIDIA Tesla P100 PCI-E 16GB GPU Accelerator (Pascal GP100) Up Close appeared first on Microway.

]]>
NVIDIA’s new Tesla P100 PCI-E GPU is a big step up for HPC users, and for GPU users in general. Although other workloads have been leveraging the newer “Maxwell” architecture, HPC applications have been using “Kepler” GPUs for a couple years. The new GPUs bring many improvements, including higher performance, faster connectivity, and more flexibility for users & programmers.

Close-up photo of the NVIDIA Tesla P100 PCI-E GPU

Because GPUs have proven themselves so well, there are now GPUs optimized for particular applications. For example, a video transcoding project would be unlikely to use the same GPU as a computational chemistry project. However, the Tesla P100 serves as the best all-round choice for those who need to support a variety of applications. With that in mind, there are three separate paths to be considered:

Highlights of the new Tesla P100 PCI-E GPUs include:

  • Up to 4.7 TFLOPS double- and 9.3 TFLOPS single-precision floating-point performance
  • 16GB of on-die HBM2 CoWoS GPU memory, with bandwidths up to 732GB/s
  • High-speed, on-die GPU memory provides a 3X improvement over older GPUs
  • Pascal Unified Memory allows applications to directly access the memory of all GPUs and all of system memory

Improved Data Transfer Speeds

Although the Tesla P100 GPU uses the same generation of PCI-E connectivity, some optimizations have been made since the Kepler generation. With P100, you will be able to achieve transfers up to ~12.8GB/s between the host and the GPU:

[root@node6 ~]# ./bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P100-PCIE-16GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11688.2

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12886.0

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			444927.2*

Result = PASS

* Note that there is also a 12GB version of the Tesla P100 PCI-E GPU – the memory operates 25% slower

Technical Details

Below are the technical details reported by nvidia-smi. Note that “Pascal” Tesla GPUs now include fully integrated memory ECC support that is always enabled (memory performance in previous generations could be improved by disabling ECC).

[root@node6 ~]# nvidia-smi -a -i 0

==============NVSMI LOG==============

Timestamp                           : Wed Sep 28 11:03:51 2016
Driver Version                      : 367.44

Attached GPUs                       : 1
GPU 0000:02:00.0
    Product Name                    : Tesla P100-PCIE-16GB
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Enabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 032301607xxxx
    GPU UUID                        : GPU-de136156-7f6d-ced1-869c-4dc56e09xxxx
    Minor Number                    : 1
    VBIOS Version                   : 86.00.26.00.01
    MultiGPU Board                  : No
    Board ID                        : 0x200
    GPU Part Number                 : 900-2H400-0000-000
    Inforom Version
        Image Version               : H400.0201.00.06
        OEM Object                  : 1.1
        ECC Object                  : 4.1
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x15F810DE
        Bus Id                      : 0000:02:00.0
        Sub System Id               : 0x118F10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 16276 MiB
        Used                        : 0 MiB
        Free                        : 16276 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 2 MiB
        Free                        : 16382 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : N/A
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 36 C
        GPU Shutdown Temp           : 85 C
        GPU Slowdown Temp           : 82 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 26.39 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 250.00 W
    Clocks
        Graphics                    : 405 MHz
        SM                          : 405 MHz
        Memory                      : 715 MHz
        Video                       : 835 MHz
    Applications Clocks
        Graphics                    : 1328 MHz
        Memory                      : 715 MHz
    Default Applications Clocks
        Graphics                    : 1189 MHz
        Memory                      : 715 MHz
    Max Clocks
        Graphics                    : 1328 MHz
        SM                          : 1328 MHz
        Memory                      : 715 MHz
        Video                       : 1328 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

The latest NVIDIA GPU architectures support large numbers of clock speeds, as well as automated boosting of the clock speed (when power and thermals allow). Administrators can also set specific power consumption limits and monitor the clock speeds (including explanations for any reasons the clocks are running at a lower speed).

[root@node6 ~]# nvidia-smi -q -d SUPPORTED_CLOCKS -i 0

==============NVSMI LOG==============

Timestamp                           : Wed Sep 28 11:05:36 2016
Driver Version                      : 367.44

Attached GPUs                       : 2
GPU 0000:02:00.0
    Supported Clocks
        Memory                      : 715 MHz
            Graphics                : 1328 MHz
            Graphics                : 1316 MHz
            Graphics                : 1303 MHz
            Graphics                : 1290 MHz
            Graphics                : 1278 MHz
            Graphics                : 1265 MHz
            Graphics                : 1252 MHz
            Graphics                : 1240 MHz
            Graphics                : 1227 MHz
            Graphics                : 1215 MHz
            Graphics                : 1202 MHz
            Graphics                : 1189 MHz
            Graphics                : 1177 MHz
            Graphics                : 1164 MHz
            Graphics                : 1151 MHz
            Graphics                : 1139 MHz
            Graphics                : 1126 MHz
            Graphics                : 1113 MHz
            Graphics                : 1101 MHz
            Graphics                : 1088 MHz
            Graphics                : 1075 MHz
            Graphics                : 1063 MHz
            Graphics                : 1050 MHz
            Graphics                : 1037 MHz
            Graphics                : 1025 MHz
            Graphics                : 1012 MHz

NVIDIA deviceQuery on Tesla P100 PCI-E 16GB GPU

Each new GPU generation brings tweaks to the design. The output below, from the CUDA 8.0 SDK samples, shows additional details of the architecture and capabilities of the “Pascal” Tesla P100 PCI-E GPU accelerators. Take note of the new Compute Capability 6.0, which is what you’ll want to target if you’re compiling your own CUDA code.

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla P100-PCIE-16GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.0
  Total amount of global memory:                 16276 MBytes (17066885120 bytes)
  (56) Multiprocessors, ( 64) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            405 MHz (0.41 GHz)
  Memory Clock rate:                             715 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla P100-PCIE-16GB
Result = PASS

Additional Information on Tesla P100 PCI-E GPUs

To learn more about the available P100 GPUs and to compare with other versions of the Tesla product line, please review our “Pascal” Tesla GPU knowledge center article:

If you’re thinking about using GPUs for the first time, please consider getting in touch with us. We’ve been implementing GPU-accelerated systems for nearly a decade and have the expertise to help make your project a success!

If you’re thinking about upgrading to P100, have a look at our list of P100 GPU-accelerated systems. You may also wish to review our posts on NVLink-connected Tesla P100 GPUs. If you’re hoping to install Tesla P100 PCI-E GPUs in your existing systems, take note that you’ll need a compatible server platform – one of our experts can help you review.

Photo of the rear side of the NVIDIA Tesla P100 PCI-E GPU

The post NVIDIA Tesla P100 PCI-E 16GB GPU Accelerator (Pascal GP100) Up Close appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-pci-e-16gb-gpu-accelerator-pascal-gp100-close/feed/ 0
NVIDIA Tesla P100 Price Analysis https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-price-analysis/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-price-analysis/#respond Mon, 01 Aug 2016 14:15:02 +0000 https://www.microway.com/?p=8060 Now that NVIDIA has launched their new Pascal GPUs, the next question is “What is the Tesla P100 Price?” Although it’s still a month or two before shipments of P100 start, the specifications and pricing of Microway’s Tesla P100 GPU-accelerated systems are available. If you’re planning a new project for delivery later this year, we’d […]

The post NVIDIA Tesla P100 Price Analysis appeared first on Microway.

]]>
Now that NVIDIA has launched their new Pascal GPUs, the next question is “What is the Tesla P100 Price?”

Although it’s still a month or two before shipments of P100 start, the specifications and pricing of Microway’s Tesla P100 GPU-accelerated systems are available. If you’re planning a new project for delivery later this year, we’d be happy to help you get on board. These new GPUs are exceptionally powerful.

Tesla P100 Price

The table below gives a quick breakdown of the Tesla P100 GPU price, performance and cost-effectiveness:

Tesla GPU modelPriceDouble-Precision Performance (FP64)Dollars per TFLOPS
Tesla P100 PCI-E 12GB$5,899*4.7 TFLOPS$1,255
Tesla P100 PCI-E 16GB$7,374*4.7 TFLOPS$1,569
Tesla P100 SXM2 16GB$9,428*5.3 TFLOPS$1,779

* single-unit price before any applicable discounts

As one would expect, the price does increase for the higher-end models with more memory and NVLink connectivity. However, the cost-effectiveness of these new P100 GPUs is quite clear: the dollars per TFLOPS of the previous-generation Tesla K40 and K80 GPUs are $2,342 and $1,807 (respectively). That makes any of the Tesla P100 GPUs an excellent choice. Depending upon the comparison, HPC centers should expect the new “Pascal” Tesla GPUs to be as much as twice as cost-effective as the previous generation. Additionally, the Tesla P100 GPUs provide much faster memory and include a number of powerful new features.

You may wish to reference our Comparison of Tesla “Pascal” GPUs, which summarizes the technical improvements made in these new GPUs and compares each of the new Tesla P100 GPU models. If you’re looking to see how these GPUs will be deployed in production, read our Tesla GPU Clusters page. As always, please feel free to reach out to us if you’d like to get a better understanding of these latest HPC systems and what they can do for you.

The post NVIDIA Tesla P100 Price Analysis appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-price-analysis/feed/ 0