Knowledge Center Archives - Microway https://www.microway.com/category/knowledge-center-articles/ We Speak HPC & AI Tue, 28 May 2024 17:07:57 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 Common Maintenance Tasks (Clusters) https://www.microway.com/knowledge-center-articles/test-knowledge-center-article/ https://www.microway.com/knowledge-center-articles/test-knowledge-center-article/#respond Tue, 05 Mar 2024 18:02:47 +0000 https://www.microway.com/?p=57 The following items should be completed to maintain the health of your Linux cluster. For servers and workstations, please see Common Maintenance Tasks (Workstations and Servers). Backup non-replaceable data Remember that RAID is not a replacement for backups. If your system is stolen, hacked or started on fire, your data will be gone forever. Automate this […]

The post Common Maintenance Tasks (Clusters) appeared first on Microway.

]]>
The following items should be completed to maintain the health of your Linux cluster. For servers and workstations, please see Common Maintenance Tasks (Workstations and Servers).

Backup non-replaceable data

Remember that RAID is not a replacement for backups. If your system is stolen, hacked or started on fire, your data will be gone forever. Automate this task or you will forget.

Compute clusters are built from a large group of computers, so there are many different places for data to hide. Make users aware of your backup policies and be certain they aren’t storing vital data on the compute nodes. Let them know which areas are scratch space (for temporary files) and which areas are regularly backed up and designed for user data.

Strongly consider keeping a backup image of the entire head node installation (including a copy of the compute node software image). Bare-metal recovery software is available if you’re not certain how to do this yourself.

As for the user data:

  • For many groups, a weekly or monthly cron job is fine. Write a script calling rsync or tar which writes the files to a separate server, NAS or SAN. Place the script in /etc/cron.weekly/ or /etc/cron.monthly/
  • Users with more complex requirements should look at AMANDA or Bacula
  • Tape backup systems are still available for those who prefer them. Contact us.

Verify the health of your Storage

Drive sectors can go bad silently. Scheduling regular verifies will weed out any issues before they occur. Automate them or you will forget.

  • Linux Software RAID (mdadm) arrays can be easily kicked into verify mode. Many distributions (Red Hat, CentOS, Ubuntu) come with their own utilities. To manually start a verify, run this line for each RAID (as root):
    echo check > /sys/block/md#/md/sync_action
    Watch the text file /proc/mdstat and the output of dmesg to watch the status of each verify.
  • Hardware RAID controllers provide their own methods for automated verifies and alert notification. Reference the controller’s manual.
  • Enterprise and parallel storage systems typically provide their own management interfaces (separate from your cluster management software). Familiarize yourself with these interfaces and enable e-mail alerts.

Monitor system alarms and system health

If Microway provided you with a preconfigured cluster, then we performed the software integration before the cluster arrived at your site. The cluster can monitor its own health (via MCMS™ or Bright Cluster Manager), but you should familiarize yourself with the user interface and double-check that e-mail alerts are being sent to the correct e-mail address.

Each system in the cluster also supports traditional monitoring and management features:

  • Preferred: learn how to use the IPMI capability for remote monitoring and management. You’ll spend a lot less time trekking to the datacenter.
  • Alternative: listen for system alarms and check for warning LEDs.

Don’t ignore alarms! If you put it off, you’ll soon find that something else is wrong and your cluster needs to be rebuilt from scratch.

Schedule and Test System Software Updates

Although modern Linux distributions have made it very easy to keep software packages up-to-date, there are some pitfalls an administrator might encounter when updating software on a compute cluster.

Cluster software packages are usually not managed from the same software repository as the standard Linux packages, so the updater may unknowingly break compatibility. In particular, upgrading or changing the Linux kernel on your cluster may require manual re-configuration – particularly for systems with large/parallel storage, InfiniBand and/or GPU compute processor components. These types of systems usually require that kernel modules or other packages be recompiled against the new kernel. Test updates on a single system before making such changes on the entire cluster!

Please keep in mind that updating the software on your cluster may break existing functionality, so don’t update just for the sake of updating! Plan an update schedule and notify users in case there is downtime from unexpected snags.

The post Common Maintenance Tasks (Clusters) appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/test-knowledge-center-article/feed/ 0
Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-ice-lake-sp-intel-xeon-processor-scalable-family-cpus-2/ https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-ice-lake-sp-intel-xeon-processor-scalable-family-cpus-2/#respond Tue, 06 Apr 2021 15:00:23 +0000 https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-ice-lake-sp-intel-xeon-processor-scalable-family-cpus-2/ This article provides in-depth discussion and analysis of the 10nm Intel Xeon Processor Scalable Family (formerly codenamed “Ice Lake-SP” or “Ice Lake Scalable Processor”). These processors replace the previous 14nm “Cascade Lake-SP” microarchitecture and are available for sale as of April 6, 2021. The “Ice Lake SP” CPUs are the 3rd generation of Intel’s Xeon […]

The post Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs appeared first on Microway.

]]>
This article provides in-depth discussion and analysis of the 10nm Intel Xeon Processor Scalable Family (formerly codenamed “Ice Lake-SP” or “Ice Lake Scalable Processor”). These processors replace the previous 14nm “Cascade Lake-SP” microarchitecture and are available for sale as of April 6, 2021.

The “Ice Lake SP” CPUs are the 3rd generation of Intel’s Xeon Scalable Processor family. This generation brings new features, increased performance, and new server/workstation platforms. The Xeon ‘Ice Lake SP’ CPUs cannot be installed into previous-generation systems. Those considering a new deployment are encouraged to review with one of our experts.

Highlights of the features in Xeon Scalable Processor Family “Ice Lake SP” CPUs include:

  • Up to 40 processor cores per socket (with options for 8-, 12-, 16-, 18-, 20-, 24-, 26-, 28-, 32-, 36-, and 38-cores)
  • Up to 38% higher per-core performance through micro-architecture improvements (at same clock speed vs “Cascade Lake SP”)
  • Significant memory performance & capacity increases:
    • Eight-channel memory controller on each CPU (up from six)
    • Support for DDR4 memory speeds up to 3200MHz (up from 2933MHz)
    • Large-memory capacity with Intel Optane Persistent Memory
    • All CPU models support up to 6TB per socket (combined system memory and Optane persistent memory)
  • Increased link speed between CPU sockets: 11.2GT/s UPI links (up from 10.4GT/s)
  • I/O Performance Improvements – more than twice the throughput of “Cascade Lake SP”:
    • PCI-Express generation 4.0 doubles the throughput of each PCI-E lane (compared to gen 3.0)
    • Support for 64 PCI-E lanes per CPU socket (up from 48 lanes)
  • Continued high performance with the AVX-512 instruction capabilities of the previous generation:
    • AVX-512 instructions (up to 16 double-precision FLOPS per cycle per AVX-512 FMA unit)
    • Two AVX-512 FMA units per CPU core (available in all Ice Lake-SP CPU SKUs)
  • Continued support for deep learning inference with AVX-512 VNNI instruction:
    • Intel Deep Learning Boost (VNNI) provides significant, more efficient deep learning inference acceleration
    • Combines three AVX-512 instructions (VPMADDUBSW, VPMADDWD, VPADDD) into a single VPDPBUSD operation
  • Improvements to Intel Speed Select processor configurability:
    • Performance Profiles: certain processors support three distinct core count/clock speed operating points
    • Base Frequency: specific CPU cores are given higher base clock speeds; the remaining cores run at lower speeds
    • Turbo Frequency: specific CPU cores are given higher turbo-boost speeds; the remaining cores run at lower speeds
    • Core Power: each CPU core is prioritized; when surplus frequency is available, it is given to high-priority cores
  • Integrated hardware-based security improvements and total memory encryption

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC & AI applications.

Continued Specialization of Xeon CPU SKUs

Those already familiar with Intel Xeon will see this processor family is divided into familiar tiers: Silver, Gold, and Platinum. The Silver and Gold models are in the price/performance range familiar to HPC/AI teams. Platinum models are in a higher price range. The low-end Bronze tier present in previous generations has been dropped.

Further, Intel continues to add new specialized CPU models that are optimized for particular workloads and environments. Many of these specialized SKUs are not relevant to readers here, but we summarize them briefly:

  • N: network function virtualization (NFV) optimized
  • P: virtualization-optimized (with a focus on clock frequency)
  • S: max SGX enclave size
  • T: designed for higher-temperature environments (NEBS)
  • V: virtualization-optimized (with focus on high-density/low-power)

Targeting specific workloads and environments provides the best performance and efficiency for those use cases. However, using these CPUs for other workloads may reduce performance, as the CPU clock frequencies and Turbo Boost speeds are guaranteed only for those specific workloads. Running other workloads on these optimized CPUs will likely lead to CPU throttling, which would be undesirable. Considering these limitations, the above workload-optimized models will not be included in our review.

Four Xeon CPU specializations relevant to HPC & AI use cases

There are several specialized Xeon CPU options which are relevant to high performance computationally-intensive workloads. Each capability is summarized below and included in our analysis.

  • Liquid-cooled – Xeon 8368Q CPU: optimized for liquid-cooled deployment, this CPU SKU offers high core counts along with higher CPU clock frequencies. The high clock frequencies are made possible only through the more effective cooling provided by liquid-cooled datacenters.
  • Media, AI, and HPC – Xeon 8352M CPU: optimized for AVX-heavy vector instruction workloads as found in media processing, AI, and HPC; this CPU SKU offers improved performance per watt.
  • Performance Profiles – Y: a set of CPU SKUs with support for Intel Speed Select Technology – Performance Profiles. These CPUs are indicated with a Y suffix in the model name (e.g., Xeon 8352Y) and provide flexibility for those with mixed workloads. Each CPU supports three different operating profiles with separate CPU core count, base clock and turbo boost frequencies, as well as operating wattages (TDP). In other words, each CPU could be thought of as three different CPUs. Administrators switch between profiles via system BIOS, or through Operating Systems with support for this capability (Intel SST-PP). Note that several of the other specialized CPU SKUs also support multiple Performance Profiles (e.g., Xeon 8352M).
  • Single Socket – U: single-socket optimized. The CPUs designed for a single socket are indicated with a U suffix in the model name (e.g., Xeon 6312U). These CPUs are more cost-effective. However, they do not include UPI links and thus can only be installed in systems with a single processor.

Summary of Xeon “Ice Lake-SP” CPU tiers

With the Bronze CPU tier no longer present, all models in this CPU family are well-suited to HPC and AI (though some will offer more performance than others). Before diving into the details, we provide a high-level summary of this Xeon processor family:

  • Intel Xeon Silver – suitable for entry-level HPC
    The Xeon Silver 4300-series CPU models provide higher core counts and increased memory throughput compared to previous generations. However, their performance is limited compared to Gold and Platinum (particularly on Core Count, Clock Speed, Memory Performance, and UPI speed).
  • Intel Xeon Gold – recommended for most HPC workloads
    Xeon Gold 5300- and 6300-series CPUs provide the best balance of performance and price. In particular, the 6300-series models should be preferred over the 5300-series models, because the 6300-series CPUs offer improved Clock Speeds and Memory Performance.
  • Intel Xeon Platinum – only for specific HPC workloads
    Although 8300-series models provide the highest performance, their higher price makes them suitable only for particular workloads which require their specific capabilities (e.g., highest core count, large L3 cache).

Xeon “Ice Lake SP” Computational Performance

With this new family of Xeon processors, Intel once again delivers unprecedented performance. Nearly every model provides over 1 TFLOPS (one teraflop of double-precision 64-bit performance per second), many models exceed 2 TFLOPS, and a few touch 3 TFLOPS. These performance levels are achieved through high core counts and AVX-512 instructions with FMA (as in the first and second Xeon Scalable generations). The plots in the tabs below compare the performance ranges for these new CPUs:
[tabby title=”AVX-512 Instruction Performance”]
Comparison chart of Intel Xeon Ice Lake SP CPU theoretical GFLOPS performance with AVX-512 instructions

[tabby title=”AVX2 Instruction Performance”]
Comparison chart of Intel Xeon Ice Lake SP CPU theoretical GFLOPS performance with AVX2 instructions
[tabbyending]

In the charts above, the shaded/colored bars indicate the expected performance range for each CPU model. The performance is a range rather than a specific value, because CPU clock frequencies scale up and down on a second-by-second basis. The precise achieved performance depends upon a variety of factors including temperature, power envelope, type of cooling technology, the load on each CPU core, and the type(s) of CPU instructions being issued to each core.

The first tab shows performance when using Intel’s AVX-512 instructions with FMA. Note that only a small set of codes will be capable of issuing exclusively AVX-512 FMA instructions (e.g., HPL LINPACK). Most applications issue a mix of instructions and will achieve lower than peak FLOPS. Further, applications which have not been re-compiled with an appropriate compiler will not include AVX-512 instructions and thus achieve lower performance. Computational applications which do not utilize AVX-512 instructions will most likely utilize AVX2 instructions (as shown in the second tab with AVX2 Instruction performance.

Intel Xeon “Ice Lake SP” Price Ranges

The pricing of the 3rd-generation Xeon Processor Scalable Family spans a wide range, so budget must be kept in mind when selecting options. It would be frustrating to plan on 38-core processors when the budget cannot support a price of more than $10,000 per CPU. The plot below compares the prices of the Xeon “Cascade Lake SP” processors:

As shown in the above plot, the CPUs in this article have been sorted by tier and by price. Most HPC users are expected to select CPU models from the Gold Xeon 6300-series. These models provide close to peak performance for a price around $3,000 per processor. Certain specialized applications will leverage the Platinum Xeon 8300-series

To ease comparisons, all of the plots in this article are ordered to match the above plot. Keep this pricing in mind as you review this article and plan your system architecture.

Recommended Xeon CPU Models for HPC & AI/Deep Learning

As stated at the top, most of this new CPU family offers excellent performance. However, it is common for HPC sites to set a minimum floor on CPU clock speeds (usually around 2.5GHz), with the intent that no workload suffers too low of a performance. While there are users who would demand even higher clock speeds, experience shows that most groups settle on a minimum clock speed in the 2.5GHz to 2.6GHz range. With that in mind, the comparisons below highlight only those CPU models which offer 2.5+GHz performance.

[tabby title=”2.5+GHz Core Counts”]
Comparison chart of Intel Xeon Ice Lake SP CPU core counts (for models with 2.5+GHz clock speed)

[tabby title=”AVX-512 Performance”]
Comparison chart of Intel Xeon Ice Lake SP CPU throughput with AVX-512 instructions (models with 2.5+GHz clock speeds)

[tabby title=”AVX2 Performance”]
Comparison chart of Intel Xeon Ice Lake SP CPU throughput with AVX2 instructions (models with 2.5+GHz clock speeds)

[tabby title=”2.5+GHz Cost-Effectiveness”]
Comparison chart of Intel Xeon Ice Lake SP cost-effectiveness (models with 2.5+GHz clock speeds)

[tabbyending]

The post Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-ice-lake-sp-intel-xeon-processor-scalable-family-cpus-2/feed/ 0
Detailed Specifications of the AMD EPYC “Milan” CPUs https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-amd-epyc-milan-cpus/ https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-amd-epyc-milan-cpus/#respond Mon, 15 Mar 2021 15:00:23 +0000 https://www.microway.com/?post_type=incsub_wiki&p=13484 This article provides in-depth discussion and analysis of the 7nm AMD EPYC processor (codenamed “Milan” and based on AMD’s Zen3 architecture). EPYC “Milan” processors replace the previous “Rome” processors and are available for sale as of March 15th, 2021. These new CPUs are the third iteration of AMD’s EPYC server processor family. They are compatible […]

The post Detailed Specifications of the AMD EPYC “Milan” CPUs appeared first on Microway.

]]>
This article provides in-depth discussion and analysis of the 7nm AMD EPYC processor (codenamed “Milan” and based on AMD’s Zen3 architecture). EPYC “Milan” processors replace the previous “Rome” processors and are available for sale as of March 15th, 2021.

These new CPUs are the third iteration of AMD’s EPYC server processor family. They are compatible with existing workstation and server platforms that supported “Rome”, but include new performance and security improvements. If you’re looking to upgrade to or deploy these new CPUs, please speak with one of our experts to learn more.

Important features/changes in EPYC “Milan” CPUs include:

  • Up to 64 processor cores per socket (with options for 8-, 16-, 24-, 28-, 32-, 48-, and 56-cores)
  • Improved CPU clock speeds up to 3.7GHz (with Max Boost speeds up to 4.1GHz)
  • Unified 32MB L3 cache shared between each set of 8 cores (instead of two separate 16MB caches)
  • Increase in instructions completed per clock cycle (IPC)
  • IOMMU for improved IO performance in virtualized environments
  • The security/memory encryption features present in “Rome”, along with SEV-SNP support (protecting against malicious hypervisors)
  • Plus all the advantages of the previous “Rome” generation:
    • Full support for 256-bit AVX2 instructions with two 256-bit FMA units per CPU core
    • Up to 16 double-precision FLOPS per cycle per core
    • Eight-channel memory controller on each CPU
    • Support for DDR4 memory speeds up to 3200MHz
    • Up to 4TB memory per CPU socket
    • Up to 256MB L3 cache per CPU
    • Support for PCI-Express generation 4.0 (which doubles the throughput of gen 3.0)
    • 128 lanes of PCI-Express 4.0 per CPU socket

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC & AI applications.

Before diving into the details, it helps to keep in mind the following recommendations. Based on our experience with HPC and Deep Learning deployments, our general guidance for selecting among the EPYC options is shown below. Note that certain applications may deviate from this general advice (e.g., software which benefits from particularly high clock speeds or larger L3 cache per core).

  • 8-core EPYC CPUs – not recommended for HPC
    While perfect for particular applications, these models are not as cost-effective as many of the higher core count options.
  • 16-core to 28-core EPYC CPUs – suitable for most HPC workloads
    While not typically offering the best cost-effectiveness, they provide excellent performance at lower price points.
  • 32-core EPYC CPUs – excellent for HPC workloads
    These models offer excellent price/performance along with higher clock speeds and core counts
  • 48-core to 64-core EPYC CPUs – suitable for certain HPC workloads
    Although these models with high core counts may provide the highest cost-effectiveness and power efficiency, some applications exhibit diminishing returns at the highest core counts. Scalable applications that are not memory bandwidth bound will benefit the most from these EPYC CPUs.

Microway provides a Test Drive cluster to assist in evaluating and comparing products as users determine the ideal specifications for their new HPC & AI deployments. We would be happy to help you evaluate AMD EPYC processors as you plan your next deployment.

AMD EPYC “Milan” Computational Performance

This latest iteration of EPYC CPUs offers excellent performance. However, many of the on-paper comparisons between this generation and the previous generation do not demonstrate large gains. Application benchmarking will be needed to demonstrate many of the gains (such as those provided by the larger/unified L3 cache and the IPC improvements). That being said, most models in this generation provide at least 1 TFLOPs (one teraflop of double-precision 64-bit compute per second) and the 64-core CPUs provide over 2 TFLOPS. The plot below shows the expected performance across this new CPU line-up:
Chart comparing the AMD EPYC 'Milan' CPU theoretical GFLOPS performance with AVX2 instructions

In the chart above, shaded/colored bars indicate the expected performance ranges for each CPU model on traditional HPC applications that use double-precision 64-bit math operations. Peak performance numbers are achieved when executing 256-bit AVX2 instructions with FMA. Note that only a small set of applications are able to use exclusively AVX2 FMA instructions (e.g., LINPACK). Most applications issue a variety of instructions and will achieve lower than the peak FLOPS values shown above. Applications which have not been re-compiled in recent years (with a compiler supporting AVX2 instructions) would achieve lower performance.

The dotted lines above each bar indicate the possible peak performance were all CPU cores operating at boosted clock speeds. While theoretically possible for short amounts of time, sustained performance at these increased CPU frequencies is not expected. Sections of code with dense, vectorized instructions are very demanding, and typically result in each core slightly lowering clock speeds (a behavior not unique to AMD CPUs). While AMD has not published specific clock speed expectations for such codes, Microway expects the EPYC “Milan” CPUs to operate near their standard/published clock speed values when all cores are in use.

Throughout this article, the CPU models are sorted largely by price. The lowest-performance models provide fewer numbers of CPU cores and less L3 cache memory. Higher-end models offer high core counts for the increased performance. HPC and AI groups are generally expected to favor the processor models in the middle of the pack, as the highest core count CPUs are priced at a premium.

Note that those models which only support single-CPU installations are separated on the left side of each plot.

AMD “Milan” EPYC Processor Specifications

The tabs below compare the features and specifications of this 3rd iteration of the EPYC processors. Please notice that CPU models ending with a P suffix are designed for single-socket systems (and do not operate in dual-socket systems). All other CPU models are compatible with both single- or dual-socket systems. The P-series EPYC processors tend to be priced lower and can thus be quite cost-effective, however they are not available in dual-CPU systems.

Editor’s note: complete pricing was not available at time of publication, so additional analysis of price and cost-effectiveness of each CPU SKU will be added to this article when available.

The post Detailed Specifications of the AMD EPYC “Milan” CPUs appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-amd-epyc-milan-cpus/feed/ 0
In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-ampere-gpu-accelerators/ https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-ampere-gpu-accelerators/#respond Sat, 20 Jun 2020 22:09:44 +0000 https://www.microway.com/?post_type=incsub_wiki&p=12599 This article provides details on the NVIDIA A-series GPUs (codenamed “Ampere”). “Ampere” GPUs improve upon the previous-generation “Volta” and “Turing” architectures. Ampere A100 GPUs began shipping in May 2020 (with other variants shipping by end of 2020). Note that not all “Ampere” generation GPUs provide the same capabilities and feature sets. Broadly-speaking, there is one […]

The post In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators appeared first on Microway.

]]>
This article provides details on the NVIDIA A-series GPUs (codenamed “Ampere”). “Ampere” GPUs improve upon the previous-generation “Volta” and “Turing” architectures. Ampere A100 GPUs began shipping in May 2020 (with other variants shipping by end of 2020).

Note that not all “Ampere” generation GPUs provide the same capabilities and feature sets. Broadly-speaking, there is one version dedicated solely to computation and a second version dedicated to a mixture of graphics/visualization and compute. The specifications of both versions are shown below – speak with one of our GPU experts for a personalized summary of the options best suited to your needs.

Computational “Ampere” GPU architecture – important features and changes:

  • Exceptional HPC performance:
    • 9.7 TFLOPS FP64 double-precision floating-point performance
    • Up to 19.5 TFLOPS FP64 double-precision via Tensor Core FP64 instruction support
    • 19.5 TFLOPS FP32 single-precision floating-point performance
  • Exceptional AI deep learning training and inference performance:
    • TensorFloat 32 (TF32) instructions improve performance without loss of accuracy
    • Sparse matrix optimizations potentially double training and inference performance
    • Speedups of 3x~20x for network training, with sparse TF32 TensorCores (vs Tesla V100)
    • Speedups of 7x~20x for inference, with sparse INT8 TensorCores (vs Tesla V100)
    • Tensor Cores support many instruction types: FP64, TF32, BF16, FP16, I8, I4, B1
  • High-speed HBM2 Memory delivers 40GB or 80GB capacity at 1.6TB/s or 2TB/s throughput
  • Multi-Instance GPU allows each A100 GPU to run seven separate/isolated applications
  • 3rd-generation NVLink doubles transfer speeds between GPUs
  • 4th-generation PCI-Express doubles transfer speeds between the system and each GPU
  • Native ECC Memory detects and corrects memory errors without any capacity or performance overhead
  • Larger and Faster L1 Cache and Shared Memory for improved performance
  • Improved L2 Cache is twice as fast and nearly seven times as large as L2 on Tesla V100
  • Compute Data Compression accelerates compressible data patterns, resulting in up to 4x faster DRAM bandwidth, up to 4x faster L2 read bandwidth, and up to 2x increase in L2 capacity.

Visualization “Ampere” GPU architecture – important features and changes:

  • Double FP32 processing throughput with upgraded Streaming Multiprocessors (SM) that support FP32 computation on both datapaths
    (previous generations provided one dedicated FP32 path and one dedicated Integer path)
  • 2nd-generation RT cores provide up to a 2x increase in raytracing performance
  • 3rd-generation Tensor Cores with TF32 and support for sparsity optimizations
  • 3rd-generation NVLink provides up to 56.25 GB/sec bandwidth between pairs of GPUs in each direction
  • GDDR6X memory providing up to 768 GB/s of GPU memory throughput
  • 4th-generation PCI-Express doubles transfer speeds between the system and each GPU

As stated above, the feature sets vary between the “computational” and the “visualization” GPU models. Additional details on each are shared in the tabs below, and the best choice will depend upon your mix of workloads. Please contact our team for additional review and discussion.

NVIDIA “Ampere” GPU Specifications

[tabby title=”High Performance Computing & Deep Learning GPUs”]
The table below summarizes the features of the NVIDIA Ampere GPU Accelerators designed for computation and deep learning/AI/ML. Note that the PCI-Express version of the NVIDIA A100 GPU features a much lower TDP than the SXM4 version of the A100 GPU (250W vs 400W). For this reason, the PCI-Express GPU is not able to sustain peak performance in the same way as the higher-power part. Thus, the performance values of the PCI-E A100 GPU are shown as a range and actual performance will vary by workload.

To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC/AI expert.

Feature NVIDIA A30 PCI-E NVIDIA A100 40GB PCI-E NVIDIA A100 80GB PCI-E NVIDIA A100 SXM4
GPU Chip Ampere GA100
TensorCore Performance*
10.3 TFLOPS FP64
82 TFLOPS † TF32
165 TFLOPS † FP16/BF16
330 TOPS † INT8
661 TOPS † INT4
17.6 ~ 19.5 TFLOPS FP64
140 ~ 156 TFLOPS † TF32
281 ~ 312 TFLOPS † FP16/BF16
562 ~ 624 TOPS † INT8
1,123 ~ 1,248 TOPS † INT4
19.5 TFLOPS FP64
156 TFLOPS † TF32
312 TFLOPS † FP16/BF16
624 TOPS † INT8
1,248 TOPS † INT4
Double Precision (FP64) Performance* 5.2 TFLOPS 8.7 ~ 9.7 TFLOPS 9.7 TFLOPS
Single Precision (FP32) Performance* 10.3 TFLOPS 17.6 ~ 19.5 TFLOPS 19.5 TFLOPS
Half Precision (FP16) Performance* 41 TFLOPS 70 ~ 78 TFLOPS 78 TFLOPS
Brain Floating Point (BF16) Performance* 20 TFLOPS 35 ~ 39 TFLOPS 39 TFLOPS
On-die Memory 24GB HBM2 40GB HBM2 80GB HBM2 40GB HBM2 or 80GB HBM2e
Memory Bandwidth 933 GB/s 1,555 GB/s 1,940 GB/s 1,555 GB/s for 40GB
2,039 GB/s for 80GB
L2 Cache 40MB
Interconnect NVLink 3.0 (4 bricks) + PCI-E 4.0
NVLink is limited to pairs of directly-linked cards
NVLink 3.0 (12 bricks) + PCI-E 4.0
NVLink is limited to pairs of directly-linked cards
NVLink 3.0 (12 bricks) + PCI-E 4.0
GPU-to-GPU transfer bandwidth (bidirectional) 200 GB/s 600 GB/s
Host-to-GPU transfer bandwidth (bidirectional) 64 GB/s
# of MIG instances supported up to 4 up to 7
# of SM Units 56 108
# of Tensor Cores 224 432
# of integer INT32 CUDA Cores 3,584 6,912
# of single-precision FP32 CUDA Cores 3,584 6,912
# of double-precision FP64 CUDA Cores 1,792 3,456
GPU Base Clock 930 MHz 765 MHz 1065 MHz 1095 MHz
GPU Boost Support Yes – Dynamic
GPU Boost Clock 1440 MHz 1410 MHz
Compute Capability 8.0
Workstation Support no
Server Support yes
Cooling Type Passive
Wattage (TDP) 165 250W 300W 400W

* theoretical peak performance based on GPU boost clock
† an additional 2X performance can be achieved via NVIDIA’s new sparsity feature

[tabby title=”Visualization & Ray Tracing GPUs”]
The table below summarizes the features of the NVIDIA Ampere GPU Accelerators designed for visualization and ray tracing. Note that these GPUs would not necessarily be connecting directly to a display device, but might be performing remote rendering from a datacenter.

To learn more about these GPUs and to review which are the best options for you, please speak with a GPU expert.

Feature NVIDIA RTX A5000 NVIDIA RTX A6000 NVIDIA A40
GPU Chip Ampere GA102
TensorCore Performance*
55.6 TFLOPS † TF32
111.1 TFLOPS † FP16/BF16
222.2 TOPS † INT8
444.4 TOPS † INT4
77.4 TFLOPS † TF32
154.8 TFLOPS † FP16/BF16
309.7 TOPS † INT8
619.3 TOPS † INT4
74.8 TFLOPS † TF32
149.7 TFLOPS † FP16/BF16
299.3 TOPS † INT8
598.7 TOPS † INT4
Double Precision (FP64) Performance* 0.4 TFLOPS 0.6 TFLOPS 0.6 TFLOPS
Single Precision (FP32) Performance* 27.8 TFLOPS 38.7 TFLOPS 37.4 TFLOPS
Integer (INT32) Performance* 13.9 TOPS 19.4 TOPS 18.7 TOPS
GPU Memory 24GB 48GB 48GB
Memory Bandwidth 768 GB/s 768 GB/s 696 GB/s
L2 Cache 6MB
Interconnect NVLink 3.0 + PCI-E 4.0
NVLink is limited to pairs of directly-linked cards
GPU-to-GPU transfer bandwidth (bidirectional) 112.5 GB/s
Host-to-GPU transfer bandwidth (bidirectional) 64 GB/s
# of MIG instances supported N/A
# of SM Units 64 84
# of RT Cores 64 84
# of Tensor Cores 256 336
# of integer INT32 CUDA Cores 8,192 10,752
# of single-precision FP32 CUDA Cores 8,192 10,752
# of double-precision FP64 CUDA Cores 128 168
GPU Base Clock not published
GPU Boost Support Yes – Dynamic
GPU Boost Clock not published
Compute Capability 8.6
Workstation Support yes no
Server Support no yes
Cooling Type Active Passive
Wattage (TDP) 230W 300W 300W

* theoretical peak performance based on GPU boost clock
† an additional 2X performance can be achieved via NVIDIA’s new sparsity feature

Several lower-end graphics cards and datacenter GPUs are also available including RTX A2000, RTX A4000, A10, and A16. These GPUs offer similar capabilities, but with lower levels of performance and available at lower price points.
[tabbyending]

Comparison between “Pascal”, “Volta”, and “Ampere” GPU Architectures

Feature Pascal GP100 Volta GV100 Ampere GA100
Compute Capability* 6.0 7.0 8.0
Threads per Warp 32
Max Warps per SM 64
Max Threads per SM 2048
Max Thread Blocks per SM 32
Max Concurrent Kernels 128
32-bit Registers per SM 64 K
Max Registers per Block 64 K
Max Registers per Thread 255
Max Threads per Block 1024
L1 Cache Configuration 24KB
dedicated cache
32KB ~ 128KB
dynamic with shared memory
28KB ~ 192KB
dynamic with shared memory
Shared Memory Configurations 64KB configurable up to 96KB;
remainder for L1 Cache
(128KB total)
configurable up to 164KB;
remainder for L1 Cache
(192KB total)
Max Shared Memory per SM 64KB 96KB 164KB
Max Shared Memory per Thread Block 48KB 96KB 160KB
Max X Grid Dimension 232-1
Tensor Cores No Yes
Mixed Precision Warp-Matrix Functions No Yes
Hardware-accelerated async-copy No Yes
L2 Cache Residency Management No Yes
Dynamic Parallelism Yes
Unified Memory Yes
Preemption Yes

* For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation

Hardware-accelerated raytracing, video encoding, video decoding, and image decoding

The NVIDIA “Ampere” Datacenter GPUs that are designed for computational workloads do not include graphics acceleration features such as RT cores and hardware-accelerated video encoders. For example, RT cores for accelerated raytracing are not included in the A30 and A100 GPUs. Similarly, video encoding units (NVENC) are not included in these GPUs.

To accelerate computational workloads that require processing of image or video files, five JPEG decoding (NVJPG) units and five video decoding units (NVDEC) are included in A100. Details are described on NVIDIA’s A100 for computer vision blog post.

For additional details on NVENC and NVDEC, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.

The post In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-ampere-gpu-accelerators/feed/ 0
Detailed Specifications of the AMD EPYC “Rome” CPUs https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-amd-epyc-rome-cpus/ https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-amd-epyc-rome-cpus/#respond Wed, 07 Aug 2019 23:00:34 +0000 https://www.microway.com/?post_type=incsub_wiki&p=11764 This article provides in-depth discussion and analysis of the 7nm AMD EPYC processor (codenamed “Rome” and based on AMD’s Zen2 architecture). EPYC “Rome” processors replace the previous “Naples” processors and are available for sale as of August 7th, 2019. We also have provided an AMD EPYC “Rome” CPU Review that you may wish to review. […]

The post Detailed Specifications of the AMD EPYC “Rome” CPUs appeared first on Microway.

]]>
This article provides in-depth discussion and analysis of the 7nm AMD EPYC processor (codenamed “Rome” and based on AMD’s Zen2 architecture). EPYC “Rome” processors replace the previous “Naples” processors and are available for sale as of August 7th, 2019. We also have provided an AMD EPYC “Rome” CPU Review that you may wish to review. Note: these have since been superseded by the “Milan” AMD EPYC CPUs.

These new CPUs are the second iteration of AMD’s EPYC server processor family. They remain compatible with the existing workstation and server platforms, but feature significant feature and performance improvements. Some of the new features (e.g., PCI-E 4.0) will require updated/revised platforms. If you’re looking to upgrade to or deploy these new CPUs, please speak with one of our experts to learn more.

Important features/changes in EPYC “Rome” CPUs include:

  • Up to 64 processor cores per socket (with options for 8-, 12-, 16-, 24-, 32-, and 48-cores)
  • Improved CPU clock speeds up to 3.1GHz (with Boost speeds up to 3.4GHz)
  • Increased computational performance:
    • Full support for 256-bit AVX2 instructions with two 256-bit FMA units per CPU core
      The previous “Naples” architecture split 256-bit instructions into two separate 128-bit operations
    • Up to 16 double-precision FLOPS per cycle per core
    • Double-precision floating point multiplies complete in 3 cycles (down from 4)
    • 15% increase in instructions completed per clock cycle (IPC) for integer operations
  • Memory capacity & performance features:
    • Eight-channel memory controller on each CPU
    • Support for DDR4 memory speeds up to 3200MHz (up from 2666MHz)
    • Up to 4TB memory per CPU socket
  • Up to 256MB L3 cache per CPU (up from 64MB)
  • Support for PCI-Express generation 4.0 (which doubles the throughput of gen 3.0)
  • Up to 128 lanes of PCI-Express per CPU socket
  • Improvements to NUMA architecture:
    • Simplified design with one NUMA domain per CPU Socket
    • Uniform latencies between CPU dies (plus fewer hops between cores)
    • Improved InfinityFabric performance (read speed per clock is doubled to 32 bytes)
  • Integrated in-silicon security mitigations for Spectre

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC & AI applications.

Before diving into the details, it helps to keep in mind the following recommendations. Based on our experience with HPC and Deep Learning deployments, our guidance for selecting among the EPYC options is as follows:

  • 8-core EPYC CPUs – not recommended for HPC
    While available for a low price, these models are not as cost-effective as many of the higher core count models.
  • 12-, 16-, and 24-core EPYC CPUs – suitable for most HPC workloads
    While not typically offering the best cost-effectiveness, they provide excellent performance at lower price points.
  • 32-core EPYC CPUs – excellent for HPC workloads
    These models offer excellent price/performance along with relatively high clock speeds and core counts
  • 48-core and 64-core EPYC CPUs – suitable for certain HPC workloads
    Although the highest core count models appear to provide the best cost-effectiveness and power efficiency, many applications exhibit diminishing returns at the highest core counts. For scalable applications that are not memory bandwidth bound, these EPYC CPUs will be excellent choices.

Microway operates a Test Drive cluster to assist in evaluating and comparing these options as users develop the specifications for their new HPC & AI deployments. We would be happy to help you evaluate AMD EPYC processors as you plan your purchase.

Unprecedented Computational Performance

The EPYC “Rome” processors deliver new capabilities and exceptional performance. Many models provide over 1 TFLOPS (one teraflop of double-precision 64-bit performance per second) and several models provide 2 TFLOPS. This performance is achieved by doubling the computational power of each core and doubling the number of cores. The plot below shows the performance range across this new CPU line-up:
Chart comparing the AMD EPYC "Rome" CPU theoretical GFLOPS performance with AVX2 instructions

As shown above, the shaded/colored bars indicate the expected performance ranges for each CPU model. These peak performance numbers are achieved when executing 256-bit AVX2 instructions with FMA. Note that only a small set of codes issue almost exclusively AVX2 FMA instructions (e.g., LINPACK). Most applications issue a variety of instructions and will achieve lower than the peak FLOPS values shown above. Applications which have not been re-compiled with an appropriate compiler would not include AVX2 instructions and would thus achieve lower performance.

The dotted lines indicate the possible peak performance if all cores are operating at boosted clock speeds. While theoretically possible for short bursts, sustained performance at these levels is not expected. Sections of code with dense, vectorized instructions are very demanding, and typically result in the processor core slightly lowering its clock speed (this behavior is not unique to AMD CPUs). While AMD has not published specific clock speed expectations for such codes, Microway expects the EPYC “Rome” CPUs to operate near their “base” clock speed values even when executing code with intensive instructions.

The CPU models above are sorted by price (as discussed in the next section). The lowest-performance models provide fewer numbers of CPU cores, less cache, and slower memory speeds. Higher-end models offer high core counts for the best performance. HPC and AI groups are generally expected to favor the mid-range processor models, as the highest core count CPUs are priced at a premium.

Note that those models which only support single-CPU installations are separated on the left side of each plot.

AMD EPYC “Rome” Price Ranges

The new EPYC “Rome” processors span a fairly wide range of prices, so budget must be considered when selecting a CPU. While the entry-level models are under $1,000, the highest-end EPYC processors cost nearly $10,000 each. It would be frustrating to plan for 64-core processors when the budget cannot support the price. The plot below compares the prices of the EPYC “Rome” processors:
Chart comparing the pricing of the AMD EPYC "Rome" processors

All the CPUs in this article are sorted by price (as shown in the plot above). To ease comparisons, all of the plots in this article are ordered to match the above plot. Keep this pricing in mind as you review this article and plan your system architecture. The color of each bar indicates the expected customer price per CPU:

  • Low price tier: prices below $1,000 per CPU
  • Mid price tier: prices between $1,000 and $2,000
  • High price tier: prices between $2,000 and $4,000
  • Premium price tier: prices above $4,000 per EPYC CPU

Most HPC users are expected to select CPU models around the high price tier. These models provide industry-leading performance (and excellent performance per dollar) for a price under $4,000 per processor. Applications can certainly leverage the premium EPYC processor models, but they will come at a higher price.

AMD “Rome” EPYC Processor Specifications

The set of tabs below compares the features and specifications of this new EPYC processor family. Take note that certain CPU SKUs are designed for single-socket systems (indicated with a P suffix on the part number). All other models may be used in either a single- or dual-socket system. The P-series AMD EPYC CPUs have a lower price and are thus the most cost-effective models, but remember that they are not available in dual-CPU systems.

Cost-Effectiveness and Power Efficiency of EPYC “Rome” CPUs

Overall, the AMD EPYC processors provide great value in price spent versus performance achieved. However, there is a spectrum of efficiency, with certain CPU models offering particularly compelling value. Also remember that the prices and power requirements for some of the top models are fairly high. Savvy readers may find the following facts useful:

  • The most cost-effective CPUs likely to be selected are EPYC 7452 and EPYC 7552
  • If a balance of cost-effectiveness and higher clock speed are needed, look to EPYC 7502
  • While the EPYC 7702 looks to be the most cost-effective on paper, it is important to consider that many applications may not be able to scale efficiently to 64 cores. Benchmark before making the selection.
  • Applications which can be satisfied by a single CPU will benefit greatly from the single-socket EPYC 7xx2P models

The plots below compare the cost-effectiveness and power efficiency of these CPU models. The intent is to go beyond the raw “speeds and feeds” of the processors to determine which models will be most attractive for HPC and Deep Learning/AI deployments.

Recommended CPU Models for HPC & AI/Deep Learning

Although most of the EPYC CPUs will offer excellent performance, it is common for computationally-demanding sites to set a floor on CPU clock speeds (usually around 2.5GHz). The intent is to ensure that no workload suffers too low of a performance, as not all are well parallelized. While there are users who would prefer higher clock speeds, experience shows that most groups settle on a minimum clock speed around ~2.5GHz. With that in mind, the comparisons below highlight only those CPU models which offer 2.5+GHz performance.

The post Detailed Specifications of the AMD EPYC “Rome” CPUs appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-amd-epyc-rome-cpus/feed/ 0
Detailed Specifications of the “Cascade Lake SP” Intel Xeon Processor Scalable Family CPUs https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-cascade-lake-sp-intel-xeon-processor-scalable-family-cpus/ https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-cascade-lake-sp-intel-xeon-processor-scalable-family-cpus/#respond Tue, 02 Apr 2019 17:00:03 +0000 https://www.microway.com/?post_type=incsub_wiki&p=11256 This article provides in-depth discussion and analysis of the 14nm Intel Xeon Processor Scalable Family (formerly codenamed “Cascade Lake-SP” or “Cascade Lake Scalable Processor”). “Cascade Lake-SP” processors replace the previous 14nm “Skylake-SP” microarchitecture and are available for sale as of April 2, 2019. On February 24, 2020, a set of “Cascade Lake Refresh” Xeon models […]

The post Detailed Specifications of the “Cascade Lake SP” Intel Xeon Processor Scalable Family CPUs appeared first on Microway.

]]>
This article provides in-depth discussion and analysis of the 14nm Intel Xeon Processor Scalable Family (formerly codenamed “Cascade Lake-SP” or “Cascade Lake Scalable Processor”). “Cascade Lake-SP” processors replace the previous 14nm “Skylake-SP” microarchitecture and are available for sale as of April 2, 2019. On February 24, 2020, a set of “Cascade Lake Refresh” Xeon models were released with increased clock speeds and improved cost/performance. These Xeon CPUs have been superseded by the 3rd-generation Intel Xeon ‘Ice Lake SP’ scalable processors.

These new CPUs are the second iteration of Intel’s Xeon Processor Scalable Family. They remain compatible with the existing workstation and server platforms, but bring incremental performance along with additional capabilities and options.

Important features/changes in Xeon Scalable Processor Family “Cascade Lake SP” CPUs include:

  • Up to 28 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 16-, 18-, 20-, 24-, and 26-cores)
  • Improved CPU clock speeds (with Turbo Boost up to 4.4GHz)
  • Continued high performance with the AVX-512 instruction capabilities of the previous generation:
    • AVX-512 instructions (up to 16 double-precision FLOPS per cycle per AVX-512 FMA unit)
    • Up to two AVX-512 FMA units per CPU core (depends upon CPU SKU)
  • Introduction of new AVX-512 VNNI instruction:
    • Intel Deep Learning Boost (VNNI) provides significant, more efficient deep learning inference acceleration
    • Combines three AVX-512 instructions (VPMADDUBSW, VPMADDWD, VPADDD) into a single VPDPBUSD operation
  • Memory capacity & performance features:
    • Six-channel memory controller on each CPU
    • Support for DDR4 memory speeds up to 2933MHz (up from 2666MHz)
    • Large-memory capabilities with Intel Optane DC Persistent Memory
    • All CPU models support up to 1TB-per-socket system memory
    • Optional CPUs support up to 4.5TB-per-socket system memory (only available on certain SKUs)
  • Introduction of Intel Speed Select processor models:
    • Certain processors support three distinct operating points
    • Each operating point provides a different number of CPU cores
    • CPU clock and Turbo Boost speeds optimized for each core count
  • Integrated hardware-based security mitigations against side-channel attacks

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC & AI applications.

Specialization of Intel Xeon CPUs

The new “Cascade Lake-SP” processors will be be familiar to existing users. Just as in the previous generation, the processor family is divided into four tiers: Bronze, Silver, Gold, and Platinum. Bronze provides modest performance for a low price. The Silver and Gold models are in the price/performance range familiar to HPC users/architects. Platinum models are in a higher price range than HPC groups are typically accustomed to (Platinum tier targets Enterprise workloads, and is priced accordingly).

However, this new generation is not simply a revision of the previous models. Increasingly, we are seeing processors that have been designed with a particular workload in mind. The “Cascade Lake SP” Xeons introduce several new specialized CPU models:

  • S: search optimized
  • N: network function virtualization (NFV) optimized
  • V: virtualization density optimized
  • Y: Intel speed select
  • U: single-socket optimized

In the case of the first two specializations (search and NFV), specific CPU clock frequencies and Turbo Boost speeds are guaranteed only for those specific workloads. Running other workloads on these optimized CPUs will likely lead to CPU throttling, which would be undesirable. The virtualization density optimized models provide high CPU core counts within relatively modest power envelopes. However, the processor clock and memory clock frequencies are reduced to accomplish this. Considering these limitations, the search-, NFV-, and virtualization-optimized models will not be included in our review

The single-socket optimized CPUs are indicated with a U suffix in the model name (e.g., Xeon 6210U). These CPUs are quite cost-effective for what they offer (a 6200-series CPU for a 5200-series price). However, they do not include UPI links and thus can only be installed in systems with a single processor.

Intel Speed Select CPUs are indicated with a Y suffix in the model name (e.g., Xeon 6240Y). Each of these three CPUs offers the same core count and clock speed as their non-Y counterpart. However, the system can be rebooted into a lower core-count mode which boosts the CPU clock and Turbo Boost speeds. The Speed Select models available in this generation are: 8260Y, 6240Y, and 4214Y. Although these models are not called out by name below, understand that alternate versions of Xeon 8260, 6240, and 4214 are available if you need core count & clock speed flexibility.

Before diving into the details, it helps to keep in mind the following recommendations. Based on our experience with HPC and Deep Learning deployments, our guidance for selecting Xeon tiers is as follows:

  • Intel Xeon Bronze – not recommended for HPC
    Base-level model with low performance.
  • Intel Xeon Silver – suitable for entry-level HPC
    4200-series models offer slightly improved performance over previous generations.
  • Intel Xeon Gold – recommended for most HPC workloads
    The best balance of performance and price. In particular, the 6200-series models should be preferred over the 5200-series models, because they have twice the number of AVX-512 units
  • Intel Xeon Platinum – recommended only for specific HPC workloads
    Although these 8200-series models provide the highest performance, their higher price makes them suitable only for particular workloads which require their specific capabilities (e.g., high core count, large SMP, and large-memory Compute Nodes).

* Given their positioning, the Intel Xeon Bronze products will not be included in our review

Exceptional Computational Performance

The Xeon “Cascade Lake SP” processors deliver new capabilities and unprecedented performance. Most models provide over 1 TFLOPS (one teraflop of double-precision 64-bit performance per second) and a couple models provide 2 TFLOPS. This performance is achieved with high core counts and AVX-512 instructions with FMA (just as in the previous generation). The plots in the tabs below compare the performance ranges for these new CPUs:

As shown above, the shaded/colored bars indicate the expected performance ranges for each CPU model. The first plot shows performance when using Intel’s AVX-512 instructions with FMA. Note that only a small set of codes will be capable of issuing exclusively AVX-512 FMA instructions (e.g., LINPACK). Most applications issue a variety of instructions and will achieve lower than peak FLOPS. Applications which have not been re-compiled with an appropriate compiler will not include AVX-512 instructions and thus achieve lower performance. Those expected performance ranges are shown in the plot of AVX2 Instruction performance.

Although the ordering of the above plots may seem arbitrary, they are sorted by price (as discussed in the next section). The lowest-performance models provide fewer numbers of CPU cores and fewer AVX math units. Higher-end models provide a mix of higher core counts and higher clock speeds. A few CPU models, such as Xeon 6244 and Xeon 8256, strongly favor high clock speeds over CPU core count (which results in lower overall FLOPS throughput). HPC and AI groups are expected to favor the Intel Xeon Gold processor models.

Intel Xeon “Cascade Lake SP” Price Ranges

The pricing of the Xeon Processor Scalable Family spans a wide range, so budget must be kept at top of mind when selecting options. It would be frustrating to plan for 28-core processors when the budget cannot support a price of more than $10,000 per CPU. The plot below compares the prices of the Xeon “Cascade Lake SP” processors:

Comparison chart of Intel Xeon Cascade Lake SP CPU prices

As in the above plot, all the CPUs in this article are sorted by price. Most HPC users are expected to select CPU models from the Gold Xeon 6200-series. These models provide close to peak performance for a price under $4,000 per processor. Certain specialized applications will leverage the Platinum Xeon 8200-series, such as very large memory nodes (>3TB system memory).

To ease comparisons, all of the plots in this article are ordered to match the above plot. Keep this pricing in mind as you review this article and plan your system architecture.

Intel “Cascade Lake SP” Xeon Processor Scalable Family Specifications

The sets of tabs below compare the features and specifications of this new Xeon processor family. As you will see, the Silver (4200-series) and lower-end Gold (5200-series) CPU models offer fewer capabilities and lower performance. The higher-end Gold (6200-series) and Platinum (8200-series) offer more capabilities and higher performance. Additionally, certain CPU SKUs have special models integrating additional specializations:

  • Enabled for Intel Speed Select (indicated with a Y suffix on the part number)
  • Support for up to 4.5TB of memory per CPU socket (indicated with an L suffix on the part number)
    (these same CPUs have a lower-cost alternate SKU supporting 2TB memory per socket (indicated with an M suffix on the part number)
  • Designed for single CPU socket systems (indicated with a U suffix on the part number)
  • All Gold- and Platinum-series CPUs support Intel’s new Optane DC Persistent Memory

In addition to the specifications called out above, technical readers should note that the “Cascade Lake SP” CPU architecture inherits most of the architectural design of the previous “Skylake-SP” architecture, including the mesh processor layout, redesigned L2/L3 caches, greater UPI connectivity between CPU sockets, and improvements to the processor frequency speeds/turbo. A more comprehensive list of features is shown at the end of the article.

Clock Speeds & Turbo Boost

Just as in the previous generation, the “Cascade Lake-SP” architecture acknowledges that highly-parallel/vectorized applications place the highest load on the processor cores (requiring more power and generating more heat). While a CPU core is executing intensive vector tasks (AVX2 or AVX-512 instructions), the clock speed will be adjusted downwards to keep the processor within its power limits (TDP).

In effect, this will result in the processor running at a lower frequency than the standard clock speed advertised for each model. For that reason, each processor is assigned three frequency ranges:

  • AVX-512 mode: due to the high requirements of AVX-512/FMA instructions, clock speeds will be lowered while executing AVX-512 instructions *
  • AVX2 mode: due to the higher requirements of AVX2/FMA instructions, clock speeds will be somewhat lower while executing AVX instructions *
  • Non-AVX mode: while not executing “heavy” AVX instructions, the processor will operate at the “stock” frequency

Each of the “modes” above is actually a range of CPU clock speeds. The CPU will run at the highest speed possible for the particular set of CPU instructions that have been issued. It is worth noting that these modes are isolated to each core. Within a given CPU, some cores may be operating in AVX mode while others are operating in Non-AVX mode.

* Intel has applied a 1-millisecond hysteresis window on each CPU core to prevent rapid frequency transitions

AVX-512, AVX, and Non-AVX Turbo Boost in Xeon “Cascade Lake-SP” Scalable Family processors

Each CPU also includes the Turbo Boost feature which allows each processor core to operate well above the “base” clock speed during most operations. The precise clock speed increase depends upon the number & intensity of tasks running on each CPU. However, Turbo Boost speed increases also depend upon the types of instructions (AVX-512, AVX, Non-AVX).

The plots below demonstrate processor clock speeds under the following conditions:

  • All cores on the CPU actively running Non-AVX, AVX, or AVX-512 instructions
  • A single active core running Non-AVX, AVX, or AVX-512 instructions (all other cores on the CPU must be idle)

The dotted lines represent the range of clock speeds for Non-AVX instructions. The thin grey bars represent the range of clock speeds for AVX2/FMA instructions. The thicker shaded/colored bars represent the range of clock speeds for AVX-512/FMA instructions.

Note that despite the clear rules stated above, each Turbo Boost value is still a range of clock speeds. Because workloads are so diverse, Intel is unable to guarantee one specific clock speed for AVX-512, AVX, or Non-AVX instructions. Users are guaranteed that cores will run within a specific range, but each application will have to be benchmarked to determine which frequencies a CPU will operate at.

Despite the perceived reduction in performance when running these vector instructions, keep in mind that AVX-512 roughly doubles the number of operations which can be completed per cycle. Although the clock speeds might be reduced by nearly 1GHz, the overall throughput is increased. HPC users should expect their processors to be running in AVX or AVX-512 mode most of the time.

Cost-Effectiveness and Power Efficiency of Xeon “Cascade Lake SP” CPUs

Many of these new processors have the same price structure as earlier Xeon server CPU families. However, the prices and power requirements for some of the premium models are fairly high. Savvy readers may find the following facts useful:

  • HPC applications run best on the higher-end Gold and Platinum CPU models (6200- and 8200-series), as all of the lower-end CPUs provide only half the number of math units.
  • Applications which can be satisfied by a single CPU will benefit greatly from the single-socket Xeon 62xxU models
  • The Platinum models (8200-series) are generally targeted towards Enterprise and Finance – these carry higher prices than other models.

The plots below compare the cost-effectiveness and power efficiency of these CPU models. The intent is to go beyond the raw “speeds and feeds” of the processors to determine which models will be most attractive for HPC and Deep Learning/AI deployments.

Recommended CPU Models for HPC & AI/Deep Learning

Although many of these CPU models will offer excellent performance, it is common for HPC sites to set a floor on CPU clock speeds (usually around 2.5GHz). The intent is to ensure that no workload suffers too low of a performance. While there are users who would prefer higher clock speeds, experience shows that most groups settle on a value of 2.5GHz to 2.6GHz. With that in mind, the comparisons below highlight only those CPU models which offer 2.5+GHz performance.

Summary of features in Xeon Scalable Family “Cascade Lake-SP” processors

In addition to the capabilities mentioned at the top of this article, these processors include many of the successful features from earlier Xeon designs. They also include lower-level changes that may of interest to expert users. The list below provides a more detailed summary of relevant technology features in Cascade Lake-SP:

  • Up to 28 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 16-, 18-, 20-, 24-, and 26-cores)
  • Improved CPU clock speeds (with Turbo Boost up to 4.4GHz)
  • Continued high performance with the AVX-512 instruction capabilities of the previous generation:
    • AVX-512 instructions (up to 16 double-precision FLOPS per cycle per AVX-512 FMA unit)
    • Up to two AVX-512 FMA units per CPU core (depends upon CPU SKU)
  • Introduction of new AVX-512 VNNI instruction:
    • Intel Deep Learning Boost – the new 8-bit Vector Neural Network Instruction (VNNI) provides significant, more efficient deep learning inference acceleration
    • Combines three AVX-512 instructions (VPMADDUBSW, VPMADDWD, VPADDD) into a single VPDPBUSD operation
  • As introduced with “Haswell” and “Broadwell”, these CPUs continue to support 128-bit AVX and 256-bit AVX2 Advanced Vector Extensions with FMA3

Memory capacity & performance features:

  • Six-channel memory controller on each CPU
  • Support for DDR4 memory speeds up to 2933MHz (up from 2666MHz)
  • Single DIMM per channel operates at up to 2933MHz; two DIMMs per channel operate at up to 2666MHz
  • Large-memory capabilities with Intel Optane DC Persistent Memory
  • All CPU models support up to 1TB-per-socket system memory
  • Optional CPU support for 2TB- or 4.5TB-per-socket system memory (only available on certain SKUs)

Introduction of Intel Speed Select processor models:

  • Certain processors support three distinct operating points
  • Each operating point provides a different number of CPU cores
  • CPU clock and Turbo Boost speeds optimized for each core count
  • Integrated hardware-based security mitigations against side-channel attacks
  • Fast links between CPU sockets with up to three 10.4GT/s UPI links
  • I/O connectivity of 48 lanes of generation 3.0 PCI-Express per CPU
  • CPU cores are arranged in an “Uncore” mesh interconnect
  • Optimized Turbo Boost profiles allow higher frequencies even when many CPU cores are in use
  • Direct PCI-Express (generation 3.0) connections between each CPU and peripheral devices such as network adapters, GPUs and coprocessors (48 PCI-E lanes per socket)
  • Turbo Boost technology improves performance under peak loads by increasing processor clock speeds. Clock speeds are boosted higher even when many cores are in use. There are three tiers of Turbo Boost clock speeds depending upon the type of instructions being executed:
    • Non-AVX: Operations that are not math intensive, or “light” AVX/AVX2 instructions which don’t involve multiply/FMA
    • AVX: Operations that heavily use the AVX/AVX2 units, or that use the AVX-512 unit (but not the multiply/FMA instructions)
    • AVX-512: Operations that heavily use the AVX-512 units, including multiply/FMA instructions
  • Intel QuickData Technology (CBDMA) doubles memory-to-memory copy performance and supports MMIO memory copy
  • Intel Volume Management Device (VMD) provides CPU-level device aggregation for NVMe SSD storage devices
  • Intel Virtual RAID on CPU (VROC) provides RAID support for NVMe SSD storage devices
  • Intel C620-series PCH chipset (formerly codenamed “Lewisburg”) with improved connectivity:
    • Up to four Intel X722 10GbE/1GbE Ethernet ports with iWARP RDMA support
    • PCI-Express generation 3.0 x4 connection from the PCH to the CPUs
    • Support for more integrated SATA3 6Gbps ports (up to 14)
    • Support for more integrated USB 3.0 ports (up to 10)
    • Integrated Intel QuickAssist Technology, which accelerates many cryptographic and compression/decompression workloads
  • Enhanced CPU Core Microarchitecture:
    • Larger and improved branch predictor, higher throughput decoder, larger out-of-order window
    • Improved scheduler and execution engine; improved throughput and latency of divide/sqrt
    • More load/store bandwidth, deeper load/store buffers, improved prefetcher
    • One or Two AVX-512 512-bit FMA units per core
    • Support for the following AVX-512 instruction types:
      AVX512-F, AVX512-VL, AVX512-BW, AVX512-DQ, AVX512-CD
    • 1MB dedicated L2 cache per core
    • A 10% (geomean) improvement in instructions per cycle (IPC) versus the “Broadwell” generation CPUs
  • Re-architected L2/L3 cache hierarchy:
    • Each CPU core contains 1MB L2 private cache (up from 256KB)
    • Each core’s private L2 acts as primary cache
    • Each CPU contains >1.3MB/core of shared L3 cache (for when the private L2 cache is exhausted)
    • The shared L3 cache is non-inclusive (does not keep copies of the L2 caches)
    • Larger 64-entry L2 TLB for 1GB pages (up from 16 entries)
  • Dual or Triple Ultra Path Interconnect (UPI) links between processor sockets improve communication speeds for data-intensive applications
  • RDSEED instruction for high-quality, non-deterministic, random seed values
  • Hyper-Threading technology allows two threads to “share” a processor core for improved resource usage. Although useful for some workloads, it is frequently disabled for HPC applications.
  • Hardware Controlled Power Management for more rapid and efficient decisions on optimal P- and C-State operating point

The post Detailed Specifications of the “Cascade Lake SP” Intel Xeon Processor Scalable Family CPUs appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-cascade-lake-sp-intel-xeon-processor-scalable-family-cpus/feed/ 0
Check for memory errors on NVIDIA GPUs https://www.microway.com/knowledge-center-articles/check-for-memory-errors-on-nvidia-gpus/ https://www.microway.com/knowledge-center-articles/check-for-memory-errors-on-nvidia-gpus/#respond Thu, 14 Feb 2019 18:10:35 +0000 https://www.microway.com/?post_type=incsub_wiki&p=11141 Professional NVIDIA GPUs (the Tesla and Quadro products) are equipped with error-correcting code (ECC) memory, which allows the system to detect when memory errors occur. Smaller “single-bit” errors are transparently corrected. Larger “double-bit” memory errors will cause applications to crash, but are at least detected (GPUs without ECC memory would continue operating on the corrupted […]

The post Check for memory errors on NVIDIA GPUs appeared first on Microway.

]]>
Professional NVIDIA GPUs (the Tesla and Quadro products) are equipped with error-correcting code (ECC) memory, which allows the system to detect when memory errors occur. Smaller “single-bit” errors are transparently corrected. Larger “double-bit” memory errors will cause applications to crash, but are at least detected (GPUs without ECC memory would continue operating on the corrupted data).

There are conditions under which GPU events are reported to the Linux kernel, in which case you will see such errors in the system logs. However, the GPUs themselves will also store the type and date of the event.

It’s important to note that not all ECC errors are due to hardware failures. Stray cosmic rays are known to cause bit flips. For this reason, memory is not considered “bad” when a single error occurs (or even when a number of errors occurs). If you have a device reporting tens or hundreds of Double Bit errors, please contact Microway tech support for review. You may also wish to review the NVIDIA documentation

To review the current health of the GPUs in a system, use the nvidia-smi utility:

[root@node7 ~]# nvidia-smi -q -d PAGE_RETIREMENT

==============NVSMI LOG==============

Timestamp                           : Thu Feb 14 10:58:34 2019
Driver Version                      : 410.48

Attached GPUs                       : 4
GPU 00000000:18:00.0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No

GPU 00000000:3B:00.0
    Retired Pages
        Single Bit ECC              : 15
        Double Bit ECC              : 0
        Pending                     : No

The output above shows one card with no issues and one card with a minor quantity of single-bit errors (the card is still functional and in operation).

If the above report indicates that memory pages have been retired, then you may wish to see additional details (including when the pages were retired). If nvidia-smi reports Pending: Yes, then memory errors have occurred since the last time the system rebooted. In either case, there may be older page retirements that took place.

To review a complete listing of the GPU memory pages which have been retired (including the unique ID of each GPU), run:

[root@node7 ~]# nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv

gpu_uuid, retired_pages.address, retired_pages.cause
GPU-9fa5168d-97bf-98aa-33b9-45329682f627, 0x000000000005c05e, Single Bit ECC
GPU-9fa5168d-97bf-98aa-33b9-45329682f627, 0x000000000005ca0d, Single Bit ECC
GPU-9fa5168d-97bf-98aa-33b9-45329682f627, 0x000000000005c72e, Single Bit ECC
...

A different type of output must be selected in order to read the timestamps of page retirements. The output is in XML format and may require a bit more effort to parse. In short, try running a report such as shown below:

[root@node7 ~]# nvidia-smi -i 1 -q -x| grep -i -A1 retired_page_addr

<retired_page_address>0x000000000005c05e</retired_page_address>
<retired_page_timestamp>Mon Dec 18 06:25:25 2017</retired_page_timestamp>
--
<retired_page_address>0x000000000005ca0d</retired_page_address>
<retired_page_timestamp>Mon Dec 18 06:25:25 2017</retired_page_timestamp>
--
<retired_page_address>0x000000000005c72e</retired_page_address>
<retired_page_timestamp>Mon Dec 18 06:25:31 2017</retired_page_timestamp>
...

The post Check for memory errors on NVIDIA GPUs appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/check-for-memory-errors-on-nvidia-gpus/feed/ 0
Optimizing the Performance of System Memory https://www.microway.com/knowledge-center-articles/optimizing-the-performance-of-system-memory/ https://www.microway.com/knowledge-center-articles/optimizing-the-performance-of-system-memory/#respond Wed, 12 Sep 2018 13:37:17 +0000 https://www.microway.com/?post_type=incsub_wiki&p=10871 Compute-intensive applications typically require as much system memory bandwidth as can be provided. For this reason, it is very important that system memory be correctly configured and installed. Microway reviews all systems to ensure proper performance (both during the sales and production/integration stages), however we provide this resource as a reference for those who would […]

The post Optimizing the Performance of System Memory appeared first on Microway.

]]>
Compute-intensive applications typically require as much system memory bandwidth as can be provided. For this reason, it is very important that system memory be correctly configured and installed. Microway reviews all systems to ensure proper performance (both during the sales and production/integration stages), however we provide this resource as a reference for those who would like to understand the options.

Improperly-configured memory can result in significant performance reductions. For example, a misconfiguration on the latest Intel Xeon CPUs with 6-channel memory controllers can result in a 65% reduction in memory throughput. This can result in an application running at half the anticipated speed. As you’re considering a new system deployment, please work with our experts to ensure success.

The correct configuration depends upon several factors, including the type of CPUs, the product generation, and the design of the system motherboard. To use the tables below, first select which type and generation of system CPUs will be in use. Then look to the rows which show the optimal memory capacities.

It should be noted that we consider a 64GB DIMM to be the largest available capacity in a single memory slot. Although 128GB and 256GB DIMMs are available, their extreme price and limited availability have made them impractical for most customer use cases.

The post Optimizing the Performance of System Memory appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/optimizing-the-performance-of-system-memory/feed/ 0
In-Depth Comparison of NVIDIA Quadro “Turing” GPU Accelerators https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-quadro-turing-gpu-accelerators/ https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-quadro-turing-gpu-accelerators/#respond Tue, 21 Aug 2018 22:05:31 +0000 https://www.microway.com/?post_type=incsub_wiki&p=10694 This article provides in-depth details of the NVIDIA Quadro RTX “Turing” GPUs. NVIDIA “Turing” GPUs bring an evolved core architecture and add dedicated ray tracing units to the previous-generation “Volta” architecture. Turing GPUs began shipping in late 2018. Important features available in the “Turing” GPU architecture include: Quadro “Turing” GPU Specifications The table below summarizes […]

The post In-Depth Comparison of NVIDIA Quadro “Turing” GPU Accelerators appeared first on Microway.

]]>
This article provides in-depth details of the NVIDIA Quadro RTX “Turing” GPUs. NVIDIA “Turing” GPUs bring an evolved core architecture and add dedicated ray tracing units to the previous-generation “Volta” architecture. Turing GPUs began shipping in late 2018.

Important features available in the “Turing” GPU architecture include:

  • New RT Ray Tracing Cores for the first realtime ray-tracing performance
  • Evolved Deep Learning performance with over 130 Tensor TFLOPS (training) and and 500 TOPS Int4 (inference) throughput
  • NVLink 2.0 between GPUs—when optional NVLink bridges are added—supporting up to 2 bricks and up to 100GB/sec bidirectional bandwidth
  • New GDDR6 Memory with a substantial improvement in memory performance compared to previous-generation GPUs.

Quadro “Turing” GPU Specifications

The table below summarizes the features of the available Quadro Turing GPU Accelerators. To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

* FLOPS and TOPS calculations are presented at Max Boost
† Passively-cooled models are available with slightly reduced clock speeds

The post In-Depth Comparison of NVIDIA Quadro “Turing” GPU Accelerators appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-quadro-turing-gpu-accelerators/feed/ 0
In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-tesla-volta-gpu-accelerators/ https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-tesla-volta-gpu-accelerators/#respond Mon, 12 Mar 2018 22:25:26 +0000 https://www.microway.com/?post_type=incsub_wiki&p=10062 This article provides in-depth details of the NVIDIA Tesla V-series GPU accelerators (codenamed “Volta”). “Volta” GPUs improve upon the previous-generation “Pascal” architecture. Volta GPUs began shipping in September 2017 and were updated to 32GB of memory in March 2018; Tesla V100S was released in late 2019. Note: these have since been superseded by the NVIDIA […]

The post In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators appeared first on Microway.

]]>
This article provides in-depth details of the NVIDIA Tesla V-series GPU accelerators (codenamed “Volta”). “Volta” GPUs improve upon the previous-generation “Pascal” architecture. Volta GPUs began shipping in September 2017 and were updated to 32GB of memory in March 2018; Tesla V100S was released in late 2019. Note: these have since been superseded by the NVIDIA Ampere GPU architecture.

This page is intended to be a fast and easy reference of key specs for these GPUs. You may wish to browse our Tesla V100 Price Analysis and Tesla V100 GPU Review for more extended discussion.

Important features available in the “Volta” GPU architecture include:

  • Exceptional HPC performance with up to 8.2 TFLOPS double- and 16.4 TFLOPS single-precision floating-point performance.
  • Deep Learning training performance with up to 130 TFLOPS FP16 half-precision floating-point performance.
  • Deep Learning inference performance with up to 62.8 TeraOPS INT8 8-bit integer performance.
  • Simultaneous execution of FP32 and INT32 operations improves the overall computational throughput of the GPU
  • NVLink enables an 8~10X increase in bandwidth between the Tesla GPUs and from GPUs to supported system CPUs (compared with PCI-E).
  • High-bandwidth HBM2 memory provides a 3X improvement in memory performance compared to previous-generation GPUs.
  • Enhanced Unified Memory allows GPU applications to directly access the memory of all GPUs as well as all of system memory (up to 512TB).
  • Native ECC Memory detects and corrects memory errors without any capacity or performance overhead.
  • Combined L1 Cache and Shared Memory provides additional flexibility and higher performance than Pascal.
  • Cooperative Groups – a new programming model introduced in CUDA 9 for organizing groups of communicating threads

Tesla “Volta” GPU Specifications

The table below summarizes the features of the available Tesla Volta GPU Accelerators. To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

Comparison between “Kepler”, “Pascal”, and “Volta” GPU Architectures

FeatureKepler GK210Pascal GP100Volta GV100
Compute Capability ^3.76.07.0
Threads per Warp32
Max Warps per SM64
Max Threads per SM2048
Max Thread Blocks per SM1632
Max Concurrent Kernels32128
32-bit Registers per SM128 K64 K
Max Registers per Thread Block64 K
Max Registers per Thread255
Max Threads per Thread Block1024
L1 Cache Configurationsplit with shared memory24KB dedicated L1 cache32KB ~ 128KB
(dynamic with shared memory)
Shared Memory Configurations16KB + 112KB L1 Cache

32KB + 96KB L1 Cache

48KB + 80KB L1 Cache

(128KB total)
64KBconfigurable up to 96KB; remainder for L1 Cache

(128KB total)
Max Shared Memory per Thread Block48KB96KB*
Max X Grid Dimension232-1
Hyper-QYes
Dynamic ParallelismYes
Unified MemoryNoYes
Pre-EmptionNoYes

&Hat; For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation
* above 48 KB requires dynamic shared memory

Hardware-accelerated video encoding and decoding

All NVIDIA “Volta” GPUs include one or more hardware units for video encoding and decoding (NVENC / NVDEC). For complete hardware details, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.

The post In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/in-depth-comparison-of-nvidia-tesla-volta-gpu-accelerators/feed/ 0