Knowledge Center Archives - Microway

Common Maintenance Tasks (Clusters)

Nate Conley — Tue, 05 Mar 2024 18:02:47 +0000

The following items should be completed to maintain the health of your Linux cluster. For servers and workstations, please see Common Maintenance Tasks (Workstations and Servers).

Backup non-replaceable data

Remember that RAID is not a replacement for backups. If your system is stolen, hacked or started on fire, your data will be gone forever. Automate this task or you will forget.

Compute clusters are built from a large group of computers, so there are many different places for data to hide. Make users aware of your backup policies and be certain they aren’t storing vital data on the compute nodes. Let them know which areas are scratch space (for temporary files) and which areas are regularly backed up and designed for user data.

Strongly consider keeping a backup image of the entire head node installation (including a copy of the compute node software image). Bare-metal recovery software is available if you’re not certain how to do this yourself.

As for the user data:

For many groups, a weekly or monthly cron job is fine. Write a script calling rsync or tar which writes the files to a separate server, NAS or SAN. Place the script in /etc/cron.weekly/ or /etc/cron.monthly/
Users with more complex requirements should look at AMANDA or Bacula
Tape backup systems are still available for those who prefer them. Contact us.

Verify the health of your Storage

Drive sectors can go bad silently. Scheduling regular verifies will weed out any issues before they occur. Automate them or you will forget.

Linux Software RAID (mdadm) arrays can be easily kicked into verify mode. Many distributions (Red Hat, CentOS, Ubuntu) come with their own utilities. To manually start a verify, run this line for each RAID (as root):
echo check > /sys/block/md#/md/sync_action
Watch the text file /proc/mdstat and the output of dmesg to watch the status of each verify.
Hardware RAID controllers provide their own methods for automated verifies and alert notification. Reference the controller’s manual.
Enterprise and parallel storage systems typically provide their own management interfaces (separate from your cluster management software). Familiarize yourself with these interfaces and enable e-mail alerts.

Monitor system alarms and system health

If Microway provided you with a preconfigured cluster, then we performed the software integration before the cluster arrived at your site. The cluster can monitor its own health (via MCMS or Bright Cluster Manager), but you should familiarize yourself with the user interface and double-check that e-mail alerts are being sent to the correct e-mail address.

Each system in the cluster also supports traditional monitoring and management features:

Preferred: learn how to use the IPMI capability for remote monitoring and management. You’ll spend a lot less time trekking to the datacenter.
Alternative: listen for system alarms and check for warning LEDs.

Don’t ignore alarms! If you put it off, you’ll soon find that something else is wrong and your cluster needs to be rebuilt from scratch.

Schedule and Test System Software Updates

Although modern Linux distributions have made it very easy to keep software packages up-to-date, there are some pitfalls an administrator might encounter when updating software on a compute cluster.

Cluster software packages are usually not managed from the same software repository as the standard Linux packages, so the updater may unknowingly break compatibility. In particular, upgrading or changing the Linux kernel on your cluster may require manual re-configuration – particularly for systems with large/parallel storage, InfiniBand and/or GPU compute processor components. These types of systems usually require that kernel modules or other packages be recompiled against the new kernel. Test updates on a single system before making such changes on the entire cluster!

Please keep in mind that updating the software on your cluster may break existing functionality, so don’t update just for the sake of updating! Plan an update schedule and notify users in case there is downtime from unexpected snags.

The post Common Maintenance Tasks (Clusters) appeared first on Microway.

Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs

Eliot Eshelman — Tue, 06 Apr 2021 15:00:23 +0000

This article provides in-depth discussion and analysis of the 10nm Intel Xeon Processor Scalable Family (formerly codenamed “Ice Lake-SP” or “Ice Lake Scalable Processor”). These processors replace the previous 14nm “Cascade Lake-SP” microarchitecture and are available for sale as of April 6, 2021.

The “Ice Lake SP” CPUs are the 3rd generation of Intel’s Xeon Scalable Processor family. This generation brings new features, increased performance, and new server/workstation platforms. The Xeon ‘Ice Lake SP’ CPUs cannot be installed into previous-generation systems. Those considering a new deployment are encouraged to review with one of our experts.

Highlights of the features in Xeon Scalable Processor Family “Ice Lake SP” CPUs include:

Up to 40 processor cores per socket (with options for 8-, 12-, 16-, 18-, 20-, 24-, 26-, 28-, 32-, 36-, and 38-cores)
Up to 38% higher per-core performance through micro-architecture improvements (at same clock speed vs “Cascade Lake SP”)
Significant memory performance & capacity increases:
- Eight-channel memory controller on each CPU (up from six)
- Support for DDR4 memory speeds up to 3200MHz (up from 2933MHz)
- Large-memory capacity with Intel Optane Persistent Memory
- All CPU models support up to 6TB per socket (combined system memory and Optane persistent memory)
Increased link speed between CPU sockets: 11.2GT/s UPI links (up from 10.4GT/s)
I/O Performance Improvements – more than twice the throughput of “Cascade Lake SP”:
- PCI-Express generation 4.0 doubles the throughput of each PCI-E lane (compared to gen 3.0)
- Support for 64 PCI-E lanes per CPU socket (up from 48 lanes)
Continued high performance with the AVX-512 instruction capabilities of the previous generation:
- AVX-512 instructions (up to 16 double-precision FLOPS per cycle per AVX-512 FMA unit)
- Two AVX-512 FMA units per CPU core (available in all Ice Lake-SP CPU SKUs)
Continued support for deep learning inference with AVX-512 VNNI instruction:
- Intel Deep Learning Boost (VNNI) provides significant, more efficient deep learning inference acceleration
- Combines three AVX-512 instructions (VPMADDUBSW, VPMADDWD, VPADDD) into a single VPDPBUSD operation
Improvements to Intel Speed Select processor configurability:
- Performance Profiles: certain processors support three distinct core count/clock speed operating points
- Base Frequency: specific CPU cores are given higher base clock speeds; the remaining cores run at lower speeds
- Turbo Frequency: specific CPU cores are given higher turbo-boost speeds; the remaining cores run at lower speeds
- Core Power: each CPU core is prioritized; when surplus frequency is available, it is given to high-priority cores
Integrated hardware-based security improvements and total memory encryption

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC & AI applications.

Continued Specialization of Xeon CPU SKUs

Those already familiar with Intel Xeon will see this processor family is divided into familiar tiers: Silver, Gold, and Platinum. The Silver and Gold models are in the price/performance range familiar to HPC/AI teams. Platinum models are in a higher price range. The low-end Bronze tier present in previous generations has been dropped.

Further, Intel continues to add new specialized CPU models that are optimized for particular workloads and environments. Many of these specialized SKUs are not relevant to readers here, but we summarize them briefly:

N: network function virtualization (NFV) optimized
P: virtualization-optimized (with a focus on clock frequency)
S: max SGX enclave size
T: designed for higher-temperature environments (NEBS)
V: virtualization-optimized (with focus on high-density/low-power)

Targeting specific workloads and environments provides the best performance and efficiency for those use cases. However, using these CPUs for other workloads may reduce performance, as the CPU clock frequencies and Turbo Boost speeds are guaranteed only for those specific workloads. Running other workloads on these optimized CPUs will likely lead to CPU throttling, which would be undesirable. Considering these limitations, the above workload-optimized models will not be included in our review.

Four Xeon CPU specializations relevant to HPC & AI use cases

There are several specialized Xeon CPU options which are relevant to high performance computationally-intensive workloads. Each capability is summarized below and included in our analysis.

Liquid-cooled – Xeon 8368Q CPU: optimized for liquid-cooled deployment, this CPU SKU offers high core counts along with higher CPU clock frequencies. The high clock frequencies are made possible only through the more effective cooling provided by liquid-cooled datacenters.
Media, AI, and HPC – Xeon 8352M CPU: optimized for AVX-heavy vector instruction workloads as found in media processing, AI, and HPC; this CPU SKU offers improved performance per watt.
Performance Profiles – Y: a set of CPU SKUs with support for Intel Speed Select Technology – Performance Profiles. These CPUs are indicated with a Y suffix in the model name (e.g., Xeon 8352Y) and provide flexibility for those with mixed workloads. Each CPU supports three different operating profiles with separate CPU core count, base clock and turbo boost frequencies, as well as operating wattages (TDP). In other words, each CPU could be thought of as three different CPUs. Administrators switch between profiles via system BIOS, or through Operating Systems with support for this capability (Intel SST-PP). Note that several of the other specialized CPU SKUs also support multiple Performance Profiles (e.g., Xeon 8352M).
Single Socket – U: single-socket optimized. The CPUs designed for a single socket are indicated with a U suffix in the model name (e.g., Xeon 6312U). These CPUs are more cost-effective. However, they do not include UPI links and thus can only be installed in systems with a single processor.

Summary of Xeon “Ice Lake-SP” CPU tiers

With the Bronze CPU tier no longer present, all models in this CPU family are well-suited to HPC and AI (though some will offer more performance than others). Before diving into the details, we provide a high-level summary of this Xeon processor family:

Intel Xeon Silver – suitable for entry-level HPC
The Xeon Silver 4300-series CPU models provide higher core counts and increased memory throughput compared to previous generations. However, their performance is limited compared to Gold and Platinum (particularly on Core Count, Clock Speed, Memory Performance, and UPI speed).
Intel Xeon Gold – recommended for most HPC workloads
Xeon Gold 5300- and 6300-series CPUs provide the best balance of performance and price. In particular, the 6300-series models should be preferred over the 5300-series models, because the 6300-series CPUs offer improved Clock Speeds and Memory Performance.
Intel Xeon Platinum – only for specific HPC workloads
Although 8300-series models provide the highest performance, their higher price makes them suitable only for particular workloads which require their specific capabilities (e.g., highest core count, large L3 cache).

Xeon “Ice Lake SP” Computational Performance

With this new family of Xeon processors, Intel once again delivers unprecedented performance. Nearly every model provides over 1 TFLOPS (one teraflop of double-precision 64-bit performance per second), many models exceed 2 TFLOPS, and a few touch 3 TFLOPS. These performance levels are achieved through high core counts and AVX-512 instructions with FMA (as in the first and second Xeon Scalable generations). The plots in the tabs below compare the performance ranges for these new CPUs:
[tabby title=”AVX-512 Instruction Performance”]

[tabby title=”AVX2 Instruction Performance”]

[tabbyending]

In the charts above, the shaded/colored bars indicate the expected performance range for each CPU model. The performance is a range rather than a specific value, because CPU clock frequencies scale up and down on a second-by-second basis. The precise achieved performance depends upon a variety of factors including temperature, power envelope, type of cooling technology, the load on each CPU core, and the type(s) of CPU instructions being issued to each core.

The first tab shows performance when using Intel’s AVX-512 instructions with FMA. Note that only a small set of codes will be capable of issuing exclusively AVX-512 FMA instructions (e.g., HPL LINPACK). Most applications issue a mix of instructions and will achieve lower than peak FLOPS. Further, applications which have not been re-compiled with an appropriate compiler will not include AVX-512 instructions and thus achieve lower performance. Computational applications which do not utilize AVX-512 instructions will most likely utilize AVX2 instructions (as shown in the second tab with AVX2 Instruction performance.

Intel Xeon “Ice Lake SP” Price Ranges

The pricing of the 3rd-generation Xeon Processor Scalable Family spans a wide range, so budget must be kept in mind when selecting options. It would be frustrating to plan on 38-core processors when the budget cannot support a price of more than $10,000 per CPU. The plot below compares the prices of the Xeon “Cascade Lake SP” processors:

As shown in the above plot, the CPUs in this article have been sorted by tier and by price. Most HPC users are expected to select CPU models from the Gold Xeon 6300-series. These models provide close to peak performance for a price around $3,000 per processor. Certain specialized applications will leverage the Platinum Xeon 8300-series

To ease comparisons, all of the plots in this article are ordered to match the above plot. Keep this pricing in mind as you review this article and plan your system architecture.

Recommended Xeon CPU Models for HPC & AI/Deep Learning

As stated at the top, most of this new CPU family offers excellent performance. However, it is common for HPC sites to set a minimum floor on CPU clock speeds (usually around 2.5GHz), with the intent that no workload suffers too low of a performance. While there are users who would demand even higher clock speeds, experience shows that most groups settle on a minimum clock speed in the 2.5GHz to 2.6GHz range. With that in mind, the comparisons below highlight only those CPU models which offer 2.5+GHz performance.

[tabby title=”2.5+GHz Core Counts”]

[tabby title=”AVX-512 Performance”]

[tabby title=”AVX2 Performance”]

[tabby title=”2.5+GHz Cost-Effectiveness”]

[tabbyending]

The post Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs appeared first on Microway.

Detailed Specifications of the AMD EPYC “Milan” CPUs

Eliot Eshelman — Mon, 15 Mar 2021 15:00:23 +0000

This article provides in-depth discussion and analysis of the 7nm AMD EPYC processor (codenamed “Milan” and based on AMD’s Zen3 architecture). EPYC “Milan” processors replace the previous “Rome” processors and are available for sale as of March 15th, 2021.

These new CPUs are the third iteration of AMD’s EPYC server processor family. They are compatible with existing workstation and server platforms that supported “Rome”, but include new performance and security improvements. If you’re looking to upgrade to or deploy these new CPUs, please speak with one of our experts to learn more.

Important features/changes in EPYC “Milan” CPUs include:

Up to 64 processor cores per socket (with options for 8-, 16-, 24-, 28-, 32-, 48-, and 56-cores)
Improved CPU clock speeds up to 3.7GHz (with Max Boost speeds up to 4.1GHz)
Unified 32MB L3 cache shared between each set of 8 cores (instead of two separate 16MB caches)
Increase in instructions completed per clock cycle (IPC)
IOMMU for improved IO performance in virtualized environments
The security/memory encryption features present in “Rome”, along with SEV-SNP support (protecting against malicious hypervisors)
Plus all the advantages of the previous “Rome” generation:
- Full support for 256-bit AVX2 instructions with two 256-bit FMA units per CPU core
- Up to 16 double-precision FLOPS per cycle per core
- Eight-channel memory controller on each CPU
- Support for DDR4 memory speeds up to 3200MHz
- Up to 4TB memory per CPU socket
- Up to 256MB L3 cache per CPU
- Support for PCI-Express generation 4.0 (which doubles the throughput of gen 3.0)
- 128 lanes of PCI-Express 4.0 per CPU socket

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC & AI applications.

Before diving into the details, it helps to keep in mind the following recommendations. Based on our experience with HPC and Deep Learning deployments, our general guidance for selecting among the EPYC options is shown below. Note that certain applications may deviate from this general advice (e.g., software which benefits from particularly high clock speeds or larger L3 cache per core).

8-core EPYC CPUs – not recommended for HPC
While perfect for particular applications, these models are not as cost-effective as many of the higher core count options.
16-core to 28-core EPYC CPUs – suitable for most HPC workloads
While not typically offering the best cost-effectiveness, they provide excellent performance at lower price points.
32-core EPYC CPUs – excellent for HPC workloads
These models offer excellent price/performance along with higher clock speeds and core counts
48-core to 64-core EPYC CPUs – suitable for certain HPC workloads
Although these models with high core counts may provide the highest cost-effectiveness and power efficiency, some applications exhibit diminishing returns at the highest core counts. Scalable applications that are not memory bandwidth bound will benefit the most from these EPYC CPUs.

Microway provides a Test Drive cluster to assist in evaluating and comparing products as users determine the ideal specifications for their new HPC & AI deployments. We would be happy to help you evaluate AMD EPYC processors as you plan your next deployment.

AMD EPYC “Milan” Computational Performance

This latest iteration of EPYC CPUs offers excellent performance. However, many of the on-paper comparisons between this generation and the previous generation do not demonstrate large gains. Application benchmarking will be needed to demonstrate many of the gains (such as those provided by the larger/unified L3 cache and the IPC improvements). That being said, most models in this generation provide at least 1 TFLOPs (one teraflop of double-precision 64-bit compute per second) and the 64-core CPUs provide over 2 TFLOPS. The plot below shows the expected performance across this new CPU line-up:

In the chart above, shaded/colored bars indicate the expected performance ranges for each CPU model on traditional HPC applications that use double-precision 64-bit math operations. Peak performance numbers are achieved when executing 256-bit AVX2 instructions with FMA. Note that only a small set of applications are able to use exclusively AVX2 FMA instructions (e.g., LINPACK). Most applications issue a variety of instructions and will achieve lower than the peak FLOPS values shown above. Applications which have not been re-compiled in recent years (with a compiler supporting AVX2 instructions) would achieve lower performance.

The dotted lines above each bar indicate the possible peak performance were all CPU cores operating at boosted clock speeds. While theoretically possible for short amounts of time, sustained performance at these increased CPU frequencies is not expected. Sections of code with dense, vectorized instructions are very demanding, and typically result in each core slightly lowering clock speeds (a behavior not unique to AMD CPUs). While AMD has not published specific clock speed expectations for such codes, Microway expects the EPYC “Milan” CPUs to operate near their standard/published clock speed values when all cores are in use.

Throughout this article, the CPU models are sorted largely by price. The lowest-performance models provide fewer numbers of CPU cores and less L3 cache memory. Higher-end models offer high core counts for the increased performance. HPC and AI groups are generally expected to favor the processor models in the middle of the pack, as the highest core count CPUs are priced at a premium.

Note that those models which only support single-CPU installations are separated on the left side of each plot.

AMD “Milan” EPYC Processor Specifications

The tabs below compare the features and specifications of this 3rd iteration of the EPYC processors. Please notice that CPU models ending with a P suffix are designed for single-socket systems (and do not operate in dual-socket systems). All other CPU models are compatible with both single- or dual-socket systems. The P-series EPYC processors tend to be priced lower and can thus be quite cost-effective, however they are not available in dual-CPU systems.

AMD continues to increase the CPU frequencies of the EPYC CPUs. This ‘Milan’ generation offers multiple SKUs operating above 3GHz. Additionally, the boost clock frequencies are considerably increased – with many nearing or exceeding 4GHz. Each CPU core supports “boost” speeds enabling temporary boosts of speed over the base clock speed. The maximum Boost speed for each CPU model is shown as a dotted line.

The industry continues to push for increased performance, and power control within CPUs continues to advance. With ‘Milan’ EPYC CPUs, each model ships with a default power consumption setting. However, each can be adjusted in the BIOS to set a new configurable TDP (cTDP), which might be higher or lower than the default. The default TDP of each model is shown in the plot below. Note the lowest default wattage for EPYC “Milan” is 155 Watts, with the majority of models in the 200W~240W range, and two models at 280 Watts. Demanding computational users must be certain that the systems they select have received thorough thermal validation. Systems not designed for these wattages will run hot, throttle CPU speeds, and provide lower performance.

Editor’s note: complete pricing was not available at time of publication, so additional analysis of price and cost-effectiveness of each CPU SKU will be added to this article when available.

The post Detailed Specifications of the AMD EPYC “Milan” CPUs appeared first on Microway.

In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators

Eliot Eshelman — Sat, 20 Jun 2020 22:09:44 +0000

This article provides details on the NVIDIA A-series GPUs (codenamed “Ampere”). “Ampere” GPUs improve upon the previous-generation “Volta” and “Turing” architectures. Ampere A100 GPUs began shipping in May 2020 (with other variants shipping by end of 2020).

Note that not all “Ampere” generation GPUs provide the same capabilities and feature sets. Broadly-speaking, there is one version dedicated solely to computation and a second version dedicated to a mixture of graphics/visualization and compute. The specifications of both versions are shown below – speak with one of our GPU experts for a personalized summary of the options best suited to your needs.

Computational “Ampere” GPU architecture – important features and changes:

Exceptional HPC performance:
- 9.7 TFLOPS FP64 double-precision floating-point performance
- Up to 19.5 TFLOPS FP64 double-precision via Tensor Core FP64 instruction support
- 19.5 TFLOPS FP32 single-precision floating-point performance
Exceptional AI deep learning training and inference performance:
- TensorFloat 32 (TF32) instructions improve performance without loss of accuracy
- Sparse matrix optimizations potentially double training and inference performance
- Speedups of 3x~20x for network training, with sparse TF32 TensorCores (vs Tesla V100)
- Speedups of 7x~20x for inference, with sparse INT8 TensorCores (vs Tesla V100)
- Tensor Cores support many instruction types: FP64, TF32, BF16, FP16, I8, I4, B1
High-speed HBM2 Memory delivers 40GB or 80GB capacity at 1.6TB/s or 2TB/s throughput
Multi-Instance GPU allows each A100 GPU to run seven separate/isolated applications
3rd-generation NVLink doubles transfer speeds between GPUs
4th-generation PCI-Express doubles transfer speeds between the system and each GPU
Native ECC Memory detects and corrects memory errors without any capacity or performance overhead
Larger and Faster L1 Cache and Shared Memory for improved performance
Improved L2 Cache is twice as fast and nearly seven times as large as L2 on Tesla V100
Compute Data Compression accelerates compressible data patterns, resulting in up to 4x faster DRAM bandwidth, up to 4x faster L2 read bandwidth, and up to 2x increase in L2 capacity.

Visualization “Ampere” GPU architecture – important features and changes:

Double FP32 processing throughput with upgraded Streaming Multiprocessors (SM) that support FP32 computation on both datapaths
(previous generations provided one dedicated FP32 path and one dedicated Integer path)
2nd-generation RT cores provide up to a 2x increase in raytracing performance
3rd-generation Tensor Cores with TF32 and support for sparsity optimizations
3rd-generation NVLink provides up to 56.25 GB/sec bandwidth between pairs of GPUs in each direction
GDDR6X memory providing up to 768 GB/s of GPU memory throughput
4th-generation PCI-Express doubles transfer speeds between the system and each GPU

As stated above, the feature sets vary between the “computational” and the “visualization” GPU models. Additional details on each are shared in the tabs below, and the best choice will depend upon your mix of workloads. Please contact our team for additional review and discussion.

NVIDIA “Ampere” GPU Specifications

[tabby title=”High Performance Computing & Deep Learning GPUs”]
The table below summarizes the features of the NVIDIA Ampere GPU Accelerators designed for computation and deep learning/AI/ML. Note that the PCI-Express version of the NVIDIA A100 GPU features a much lower TDP than the SXM4 version of the A100 GPU (250W vs 400W). For this reason, the PCI-Express GPU is not able to sustain peak performance in the same way as the higher-power part. Thus, the performance values of the PCI-E A100 GPU are shown as a range and actual performance will vary by workload.

To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC/AI expert.

Feature

NVIDIA A30 PCI-E

NVIDIA A100 40GB PCI-E

NVIDIA A100 80GB PCI-E

NVIDIA A100 SXM4

GPU Chip

Ampere GA100

TensorCore Performance*

10.3 TFLOPS	FP64
82 TFLOPS †	TF32
165 TFLOPS †	FP16/BF16
330 TOPS †	INT8
661 TOPS †	INT4

17.6 ~ 19.5 TFLOPS	FP64
140 ~ 156 TFLOPS †	TF32
281 ~ 312 TFLOPS †	FP16/BF16
562 ~ 624 TOPS †	INT8
1,123 ~ 1,248 TOPS †	INT4

19.5 TFLOPS	FP64
156 TFLOPS †	TF32
312 TFLOPS †	FP16/BF16
624 TOPS †	INT8
1,248 TOPS †	INT4

Double Precision (FP64) Performance*

5.2 TFLOPS

8.7 ~ 9.7 TFLOPS

9.7 TFLOPS

Single Precision (FP32) Performance*

10.3 TFLOPS

17.6 ~ 19.5 TFLOPS

19.5 TFLOPS

Half Precision (FP16) Performance*

41 TFLOPS

70 ~ 78 TFLOPS

78 TFLOPS

Brain Floating Point (BF16) Performance*

20 TFLOPS

35 ~ 39 TFLOPS

39 TFLOPS

On-die Memory

24GB HBM2

40GB HBM2

80GB HBM2

40GB HBM2 or 80GB HBM2e

Memory Bandwidth

933 GB/s

1,555 GB/s

1,940 GB/s

1,555 GB/s for 40GB
2,039 GB/s for 80GB

L2 Cache

40MB

Interconnect

NVLink 3.0 (4 bricks) + PCI-E 4.0
NVLink is limited to pairs of directly-linked cards

NVLink 3.0 (12 bricks) + PCI-E 4.0
NVLink is limited to pairs of directly-linked cards

NVLink 3.0 (12 bricks) + PCI-E 4.0

GPU-to-GPU transfer bandwidth (bidirectional)

200 GB/s

600 GB/s

Host-to-GPU transfer bandwidth (bidirectional)

64 GB/s

# of MIG instances supported

up to 4

up to 7

# of SM Units

108

# of Tensor Cores

224

432

# of integer INT32 CUDA Cores

3,584

6,912

# of single-precision FP32 CUDA Cores

3,584

6,912

# of double-precision FP64 CUDA Cores

1,792

3,456

GPU Base Clock

930 MHz

765 MHz

1065 MHz

1095 MHz

GPU Boost Support

Yes – Dynamic

GPU Boost Clock

1440 MHz

1410 MHz

Compute Capability

8.0

Workstation Support

Server Support

yes

Cooling Type

Passive

Wattage (TDP)

165

250W

300W

400W

* theoretical peak performance based on GPU boost clock
† an additional 2X performance can be achieved via NVIDIA’s new sparsity feature

[tabby title=”Visualization & Ray Tracing GPUs”]
The table below summarizes the features of the NVIDIA Ampere GPU Accelerators designed for visualization and ray tracing. Note that these GPUs would not necessarily be connecting directly to a display device, but might be performing remote rendering from a datacenter.

To learn more about these GPUs and to review which are the best options for you, please speak with a GPU expert.

Feature

NVIDIA RTX A5000

NVIDIA RTX A6000

NVIDIA A40

GPU Chip

Ampere GA102

TensorCore Performance*

55.6 TFLOPS †	TF32
111.1 TFLOPS †	FP16/BF16
222.2 TOPS †	INT8
444.4 TOPS †	INT4

77.4 TFLOPS †	TF32
154.8 TFLOPS †	FP16/BF16
309.7 TOPS †	INT8
619.3 TOPS †	INT4

74.8 TFLOPS †	TF32
149.7 TFLOPS †	FP16/BF16
299.3 TOPS †	INT8
598.7 TOPS †	INT4

Double Precision (FP64) Performance*

0.4 TFLOPS

0.6 TFLOPS

Single Precision (FP32) Performance*

27.8 TFLOPS

38.7 TFLOPS

37.4 TFLOPS

Integer (INT32) Performance*

13.9 TOPS

19.4 TOPS

18.7 TOPS

GPU Memory

24GB

48GB

Memory Bandwidth

768 GB/s

696 GB/s

L2 Cache

6MB

Interconnect

NVLink 3.0 + PCI-E 4.0
NVLink is limited to pairs of directly-linked cards

GPU-to-GPU transfer bandwidth (bidirectional)

112.5 GB/s

Host-to-GPU transfer bandwidth (bidirectional)

64 GB/s

# of MIG instances supported

N/A

# of SM Units

# of RT Cores

# of Tensor Cores

256

336

# of integer INT32 CUDA Cores

8,192

10,752

# of single-precision FP32 CUDA Cores

8,192

10,752

# of double-precision FP64 CUDA Cores

128

168

GPU Base Clock

not published

GPU Boost Support

Yes – Dynamic

GPU Boost Clock

not published

Compute Capability

8.6

Workstation Support

yes

Server Support

yes

Cooling Type

Active

Passive

Wattage (TDP)

230W

300W

* theoretical peak performance based on GPU boost clock
† an additional 2X performance can be achieved via NVIDIA’s new sparsity feature

Several lower-end graphics cards and datacenter GPUs are also available including RTX A2000, RTX A4000, A10, and A16. These GPUs offer similar capabilities, but with lower levels of performance and available at lower price points.
[tabbyending]

Comparison between “Pascal”, “Volta”, and “Ampere” GPU Architectures

Feature	Pascal GP100	Volta GV100	Ampere GA100
Compute Capability*	6.0	7.0	8.0
Threads per Warp	32
Max Warps per SM	64
Max Threads per SM	2048
Max Thread Blocks per SM	32
Max Concurrent Kernels	128
32-bit Registers per SM	64 K
Max Registers per Block	64 K
Max Registers per Thread	255
Max Threads per Block	1024
L1 Cache Configuration	24KB dedicated cache	32KB ~ 128KB dynamic with shared memory	28KB ~ 192KB dynamic with shared memory
Shared Memory Configurations	64KB	configurable up to 96KB; remainder for L1 Cache (128KB total)	configurable up to 164KB; remainder for L1 Cache (192KB total)
Max Shared Memory per SM	64KB	96KB	164KB
Max Shared Memory per Thread Block	48KB	96KB	160KB
Max X Grid Dimension	2^32-1
Tensor Cores	No	Yes
Mixed Precision Warp-Matrix Functions	No	Yes
Hardware-accelerated async-copy	No		Yes
L2 Cache Residency Management	No		Yes
Dynamic Parallelism	Yes
Unified Memory	Yes
Preemption	Yes

* For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation

Hardware-accelerated raytracing, video encoding, video decoding, and image decoding

The NVIDIA “Ampere” Datacenter GPUs that are designed for computational workloads do not include graphics acceleration features such as RT cores and hardware-accelerated video encoders. For example, RT cores for accelerated raytracing are not included in the A30 and A100 GPUs. Similarly, video encoding units (NVENC) are not included in these GPUs.

To accelerate computational workloads that require processing of image or video files, five JPEG decoding (NVJPG) units and five video decoding units (NVDEC) are included in A100. Details are described on NVIDIA’s A100 for computer vision blog post.

For additional details on NVENC and NVDEC, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.

The post In-Depth Comparison of NVIDIA “Ampere” GPU Accelerators appeared first on Microway.

Detailed Specifications of the AMD EPYC “Rome” CPUs

Eliot Eshelman — Wed, 07 Aug 2019 23:00:34 +0000

This article provides in-depth discussion and analysis of the 7nm AMD EPYC processor (codenamed “Rome” and based on AMD’s Zen2 architecture). EPYC “Rome” processors replace the previous “Naples” processors and are available for sale as of August 7th, 2019. We also have provided an AMD EPYC “Rome” CPU Review that you may wish to review. Note: these have since been superseded by the “Milan” AMD EPYC CPUs.

These new CPUs are the second iteration of AMD’s EPYC server processor family. They remain compatible with the existing workstation and server platforms, but feature significant feature and performance improvements. Some of the new features (e.g., PCI-E 4.0) will require updated/revised platforms. If you’re looking to upgrade to or deploy these new CPUs, please speak with one of our experts to learn more.

Important features/changes in EPYC “Rome” CPUs include:

Up to 64 processor cores per socket (with options for 8-, 12-, 16-, 24-, 32-, and 48-cores)
Improved CPU clock speeds up to 3.1GHz (with Boost speeds up to 3.4GHz)
Increased computational performance:
- Full support for 256-bit AVX2 instructions with two 256-bit FMA units per CPU core
  The previous “Naples” architecture split 256-bit instructions into two separate 128-bit operations
- Up to 16 double-precision FLOPS per cycle per core
- Double-precision floating point multiplies complete in 3 cycles (down from 4)
- 15% increase in instructions completed per clock cycle (IPC) for integer operations
Memory capacity & performance features:
- Eight-channel memory controller on each CPU
- Support for DDR4 memory speeds up to 3200MHz (up from 2666MHz)
- Up to 4TB memory per CPU socket
Up to 256MB L3 cache per CPU (up from 64MB)
Support for PCI-Express generation 4.0 (which doubles the throughput of gen 3.0)
Up to 128 lanes of PCI-Express per CPU socket
Improvements to NUMA architecture:
- Simplified design with one NUMA domain per CPU Socket
- Uniform latencies between CPU dies (plus fewer hops between cores)
- Improved InfinityFabric performance (read speed per clock is doubled to 32 bytes)
Integrated in-silicon security mitigations for Spectre

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC & AI applications.

Before diving into the details, it helps to keep in mind the following recommendations. Based on our experience with HPC and Deep Learning deployments, our guidance for selecting among the EPYC options is as follows:

8-core EPYC CPUs – not recommended for HPC
While available for a low price, these models are not as cost-effective as many of the higher core count models.
12-, 16-, and 24-core EPYC CPUs – suitable for most HPC workloads
While not typically offering the best cost-effectiveness, they provide excellent performance at lower price points.
32-core EPYC CPUs – excellent for HPC workloads
These models offer excellent price/performance along with relatively high clock speeds and core counts
48-core and 64-core EPYC CPUs – suitable for certain HPC workloads
Although the highest core count models appear to provide the best cost-effectiveness and power efficiency, many applications exhibit diminishing returns at the highest core counts. For scalable applications that are not memory bandwidth bound, these EPYC CPUs will be excellent choices.

Microway operates a Test Drive cluster to assist in evaluating and comparing these options as users develop the specifications for their new HPC & AI deployments. We would be happy to help you evaluate AMD EPYC processors as you plan your purchase.

Unprecedented Computational Performance

The EPYC “Rome” processors deliver new capabilities and exceptional performance. Many models provide over 1 TFLOPS (one teraflop of double-precision 64-bit performance per second) and several models provide 2 TFLOPS. This performance is achieved by doubling the computational power of each core and doubling the number of cores. The plot below shows the performance range across this new CPU line-up:

As shown above, the shaded/colored bars indicate the expected performance ranges for each CPU model. These peak performance numbers are achieved when executing 256-bit AVX2 instructions with FMA. Note that only a small set of codes issue almost exclusively AVX2 FMA instructions (e.g., LINPACK). Most applications issue a variety of instructions and will achieve lower than the peak FLOPS values shown above. Applications which have not been re-compiled with an appropriate compiler would not include AVX2 instructions and would thus achieve lower performance.

The dotted lines indicate the possible peak performance if all cores are operating at boosted clock speeds. While theoretically possible for short bursts, sustained performance at these levels is not expected. Sections of code with dense, vectorized instructions are very demanding, and typically result in the processor core slightly lowering its clock speed (this behavior is not unique to AMD CPUs). While AMD has not published specific clock speed expectations for such codes, Microway expects the EPYC “Rome” CPUs to operate near their “base” clock speed values even when executing code with intensive instructions.

The CPU models above are sorted by price (as discussed in the next section). The lowest-performance models provide fewer numbers of CPU cores, less cache, and slower memory speeds. Higher-end models offer high core counts for the best performance. HPC and AI groups are generally expected to favor the mid-range processor models, as the highest core count CPUs are priced at a premium.

Note that those models which only support single-CPU installations are separated on the left side of each plot.

AMD EPYC “Rome” Price Ranges

The new EPYC “Rome” processors span a fairly wide range of prices, so budget must be considered when selecting a CPU. While the entry-level models are under $1,000, the highest-end EPYC processors cost nearly $10,000 each. It would be frustrating to plan for 64-core processors when the budget cannot support the price. The plot below compares the prices of the EPYC “Rome” processors:

All the CPUs in this article are sorted by price (as shown in the plot above). To ease comparisons, all of the plots in this article are ordered to match the above plot. Keep this pricing in mind as you review this article and plan your system architecture. The color of each bar indicates the expected customer price per CPU:

Low price tier: prices below $1,000 per CPU
Mid price tier: prices between $1,000 and $2,000
High price tier: prices between $2,000 and $4,000
Premium price tier: prices above $4,000 per EPYC CPU

Most HPC users are expected to select CPU models around the high price tier. These models provide industry-leading performance (and excellent performance per dollar) for a price under $4,000 per processor. Applications can certainly leverage the premium EPYC processor models, but they will come at a higher price.

AMD “Rome” EPYC Processor Specifications

The set of tabs below compares the features and specifications of this new EPYC processor family. Take note that certain CPU SKUs are designed for single-socket systems (indicated with a P suffix on the part number). All other models may be used in either a single- or dual-socket system. The P-series AMD EPYC CPUs have a lower price and are thus the most cost-effective models, but remember that they are not available in dual-CPU systems.

As the industry pushes for increased performance, CPU power requirements have increased across the board. The lowest-wattage EPYC “Rome” CPU is 120 Watts, with the majority of models in the 155W~180W range. The highest core count models are over 200 Watts. HPC users must be certain that the systems they select have received thorough thermal validation. Systems which run hot will throttle CPU speeds and suffer lower performance.

Cost-Effectiveness and Power Efficiency of EPYC “Rome” CPUs

Overall, the AMD EPYC processors provide great value in price spent versus performance achieved. However, there is a spectrum of efficiency, with certain CPU models offering particularly compelling value. Also remember that the prices and power requirements for some of the top models are fairly high. Savvy readers may find the following facts useful:

The most cost-effective CPUs likely to be selected are EPYC 7452 and EPYC 7552
If a balance of cost-effectiveness and higher clock speed are needed, look to EPYC 7502
While the EPYC 7702 looks to be the most cost-effective on paper, it is important to consider that many applications may not be able to scale efficiently to 64 cores. Benchmark before making the selection.
Applications which can be satisfied by a single CPU will benefit greatly from the single-socket EPYC 7xx2P models

The plots below compare the cost-effectiveness and power efficiency of these CPU models. The intent is to go beyond the raw “speeds and feeds” of the processors to determine which models will be most attractive for HPC and Deep Learning/AI deployments.

Cost-Effectiveness for HPC & AI
CPU Power Efficiency

Historically, we have looked at the performance of each CPU and compared that with their costs. However, this presents a distorted view as it does not include all the other necessary components in an HPC/AI system (the server, system memory, high-speed fabric, etc). That simplistic comparison is shown further below, but first examine this plot which demonstrates the cost per FLOPS for a set of complete Compute Nodes with AMD EPYC Rome CPUs, 4GB of system memory per core, and 100Gbps EDR InfiniBand. As shown, the models with higher core counts tend to be the most cost-effective overall.

The plot below shows a simple comparison of CPU performance versus the number of FLOPS provided for each model. This may be useful when comparing to older CPU models, but for new projects we recommend the plot above.

Recommended CPU Models for HPC & AI/Deep Learning

Although most of the EPYC CPUs will offer excellent performance, it is common for computationally-demanding sites to set a floor on CPU clock speeds (usually around 2.5GHz). The intent is to ensure that no workload suffers too low of a performance, as not all are well parallelized. While there are users who would prefer higher clock speeds, experience shows that most groups settle on a minimum clock speed around ~2.5GHz. With that in mind, the comparisons below highlight only those CPU models which offer 2.5+GHz performance.

The post Detailed Specifications of the AMD EPYC “Rome” CPUs appeared first on Microway.

Detailed Specifications of the “Cascade Lake SP” Intel Xeon Processor Scalable Family CPUs

Eliot Eshelman — Tue, 02 Apr 2019 17:00:03 +0000

This article provides in-depth discussion and analysis of the 14nm Intel Xeon Processor Scalable Family (formerly codenamed “Cascade Lake-SP” or “Cascade Lake Scalable Processor”). “Cascade Lake-SP” processors replace the previous 14nm “Skylake-SP” microarchitecture and are available for sale as of April 2, 2019. On February 24, 2020, a set of “Cascade Lake Refresh” Xeon models were released with increased clock speeds and improved cost/performance. These Xeon CPUs have been superseded by the 3rd-generation Intel Xeon ‘Ice Lake SP’ scalable processors.

These new CPUs are the second iteration of Intel’s Xeon Processor Scalable Family. They remain compatible with the existing workstation and server platforms, but bring incremental performance along with additional capabilities and options.

Important features/changes in Xeon Scalable Processor Family “Cascade Lake SP” CPUs include:

Up to 28 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 16-, 18-, 20-, 24-, and 26-cores)
Improved CPU clock speeds (with Turbo Boost up to 4.4GHz)
Continued high performance with the AVX-512 instruction capabilities of the previous generation:
- AVX-512 instructions (up to 16 double-precision FLOPS per cycle per AVX-512 FMA unit)
- Up to two AVX-512 FMA units per CPU core (depends upon CPU SKU)
Introduction of new AVX-512 VNNI instruction:
- Intel Deep Learning Boost (VNNI) provides significant, more efficient deep learning inference acceleration
- Combines three AVX-512 instructions (VPMADDUBSW, VPMADDWD, VPADDD) into a single VPDPBUSD operation
Memory capacity & performance features:
- Six-channel memory controller on each CPU
- Support for DDR4 memory speeds up to 2933MHz (up from 2666MHz)
- Large-memory capabilities with Intel Optane DC Persistent Memory
- All CPU models support up to 1TB-per-socket system memory
- Optional CPUs support up to 4.5TB-per-socket system memory (only available on certain SKUs)
Introduction of Intel Speed Select processor models:
- Certain processors support three distinct operating points
- Each operating point provides a different number of CPU cores
- CPU clock and Turbo Boost speeds optimized for each core count
Integrated hardware-based security mitigations against side-channel attacks

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC & AI applications.

Specialization of Intel Xeon CPUs

The new “Cascade Lake-SP” processors will be be familiar to existing users. Just as in the previous generation, the processor family is divided into four tiers: Bronze, Silver, Gold, and Platinum. Bronze provides modest performance for a low price. The Silver and Gold models are in the price/performance range familiar to HPC users/architects. Platinum models are in a higher price range than HPC groups are typically accustomed to (Platinum tier targets Enterprise workloads, and is priced accordingly).

However, this new generation is not simply a revision of the previous models. Increasingly, we are seeing processors that have been designed with a particular workload in mind. The “Cascade Lake SP” Xeons introduce several new specialized CPU models:

S: search optimized
N: network function virtualization (NFV) optimized
V: virtualization density optimized
Y: Intel speed select
U: single-socket optimized

In the case of the first two specializations (search and NFV), specific CPU clock frequencies and Turbo Boost speeds are guaranteed only for those specific workloads. Running other workloads on these optimized CPUs will likely lead to CPU throttling, which would be undesirable. The virtualization density optimized models provide high CPU core counts within relatively modest power envelopes. However, the processor clock and memory clock frequencies are reduced to accomplish this. Considering these limitations, the search-, NFV-, and virtualization-optimized models will not be included in our review

The single-socket optimized CPUs are indicated with a U suffix in the model name (e.g., Xeon 6210U). These CPUs are quite cost-effective for what they offer (a 6200-series CPU for a 5200-series price). However, they do not include UPI links and thus can only be installed in systems with a single processor.

Intel Speed Select CPUs are indicated with a Y suffix in the model name (e.g., Xeon 6240Y). Each of these three CPUs offers the same core count and clock speed as their non-Y counterpart. However, the system can be rebooted into a lower core-count mode which boosts the CPU clock and Turbo Boost speeds. The Speed Select models available in this generation are: 8260Y, 6240Y, and 4214Y. Although these models are not called out by name below, understand that alternate versions of Xeon 8260, 6240, and 4214 are available if you need core count & clock speed flexibility.

Intel Xeon Bronze – not recommended for HPC
Base-level model with low performance.
Intel Xeon Silver – suitable for entry-level HPC
4200-series models offer slightly improved performance over previous generations.
Intel Xeon Gold – recommended for most HPC workloads
The best balance of performance and price. In particular, the 6200-series models should be preferred over the 5200-series models, because they have twice the number of AVX-512 units
Intel Xeon Platinum – recommended only for specific HPC workloads
Although these 8200-series models provide the highest performance, their higher price makes them suitable only for particular workloads which require their specific capabilities (e.g., high core count, large SMP, and large-memory Compute Nodes).

* Given their positioning, the Intel Xeon Bronze products will not be included in our review

Exceptional Computational Performance

The Xeon “Cascade Lake SP” processors deliver new capabilities and unprecedented performance. Most models provide over 1 TFLOPS (one teraflop of double-precision 64-bit performance per second) and a couple models provide 2 TFLOPS. This performance is achieved with high core counts and AVX-512 instructions with FMA (just as in the previous generation). The plots in the tabs below compare the performance ranges for these new CPUs:

AVX-512 Instructions
AVX2 Instructions

As shown above, the shaded/colored bars indicate the expected performance ranges for each CPU model. The first plot shows performance when using Intel’s AVX-512 instructions with FMA. Note that only a small set of codes will be capable of issuing exclusively AVX-512 FMA instructions (e.g., LINPACK). Most applications issue a variety of instructions and will achieve lower than peak FLOPS. Applications which have not been re-compiled with an appropriate compiler will not include AVX-512 instructions and thus achieve lower performance. Those expected performance ranges are shown in the plot of AVX2 Instruction performance.

Although the ordering of the above plots may seem arbitrary, they are sorted by price (as discussed in the next section). The lowest-performance models provide fewer numbers of CPU cores and fewer AVX math units. Higher-end models provide a mix of higher core counts and higher clock speeds. A few CPU models, such as Xeon 6244 and Xeon 8256, strongly favor high clock speeds over CPU core count (which results in lower overall FLOPS throughput). HPC and AI groups are expected to favor the Intel Xeon Gold processor models.

Intel Xeon “Cascade Lake SP” Price Ranges

The pricing of the Xeon Processor Scalable Family spans a wide range, so budget must be kept at top of mind when selecting options. It would be frustrating to plan for 28-core processors when the budget cannot support a price of more than $10,000 per CPU. The plot below compares the prices of the Xeon “Cascade Lake SP” processors:

As in the above plot, all the CPUs in this article are sorted by price. Most HPC users are expected to select CPU models from the Gold Xeon 6200-series. These models provide close to peak performance for a price under $4,000 per processor. Certain specialized applications will leverage the Platinum Xeon 8200-series, such as very large memory nodes (>3TB system memory).

To ease comparisons, all of the plots in this article are ordered to match the above plot. Keep this pricing in mind as you review this article and plan your system architecture.

Intel “Cascade Lake SP” Xeon Processor Scalable Family Specifications

The sets of tabs below compare the features and specifications of this new Xeon processor family. As you will see, the Silver (4200-series) and lower-end Gold (5200-series) CPU models offer fewer capabilities and lower performance. The higher-end Gold (6200-series) and Platinum (8200-series) offer more capabilities and higher performance. Additionally, certain CPU SKUs have special models integrating additional specializations:

Enabled for Intel Speed Select (indicated with a Y suffix on the part number)
Support for up to 4.5TB of memory per CPU socket (indicated with an L suffix on the part number)
(these same CPUs have a lower-cost alternate SKU supporting 2TB memory per socket (indicated with an M suffix on the part number)
Designed for single CPU socket systems (indicated with a U suffix on the part number)
All Gold- and Platinum-series CPUs support Intel’s new Optane DC Persistent Memory

Memory performance of Intel Xeon “Cascade Lake-SP” is fairly straightforward, with the Silver CPUs providing a lower speed than the Gold and Platinum models. The amount of memory bandwidth available per CPU core is an important factor, but is simply a function of the number of cores. Users planning to run on CPUs with higher core counts need to ensure that each core won’t be starved of data.

It is important to note that some system platforms support two memory slots per memory channel (a total of 24 DIMMs in a dual-socket system). If both memory slots are populated with memory, the slots will run no faster than 2666MHz (this is simply an electrical/signaling limit).

The UPI capabilities of these CPUs are nearly identical to the previous generation. Each CPU supports two or three UPI links operating at 9.6GT/s to 10.4GT/s. Only the Xeon 6200-series and 8200-series support the higher number of UPI links, which allows greater connectivity between sockets. Dual-socket systems are the most popular configuration for HPC, but not all dual-socket platforms support all three UPI links – review your proposed system architecture.

Although dual-socket systems are the most common for HPC & AI workloads, there are use cases requiring larger or smaller numbers of CPUs. The plot below compares the various CPU socket counts supported by this processor line-up (ranging from a single socket to eight sockets). Take note that although the 5200-series CPUs support four sockets, they only provide dual UPI links. HPC users are advised to look to 6200- and 8200-series models for four-socket systems.

Although there are still processor models in the same power range as previous generations, an increasing number of models feature TDPs above 140 Watts. A few models even reach over 200 Watts. HPC users must be certain that the systems they select have received thorough thermal validation. Systems which run warm will suffer lower performance. In particular, care is recommended with higher clock speed CPUs (2.5+ GHz) which may reduce their clock speeds more aggressively to remain within thermal limits.

In addition to the specifications called out above, technical readers should note that the “Cascade Lake SP” CPU architecture inherits most of the architectural design of the previous “Skylake-SP” architecture, including the mesh processor layout, redesigned L2/L3 caches, greater UPI connectivity between CPU sockets, and improvements to the processor frequency speeds/turbo. A more comprehensive list of features is shown at the end of the article.

Clock Speeds & Turbo Boost

Just as in the previous generation, the “Cascade Lake-SP” architecture acknowledges that highly-parallel/vectorized applications place the highest load on the processor cores (requiring more power and generating more heat). While a CPU core is executing intensive vector tasks (AVX2 or AVX-512 instructions), the clock speed will be adjusted downwards to keep the processor within its power limits (TDP).

In effect, this will result in the processor running at a lower frequency than the standard clock speed advertised for each model. For that reason, each processor is assigned three frequency ranges:

AVX-512 mode: due to the high requirements of AVX-512/FMA instructions, clock speeds will be lowered while executing AVX-512 instructions *
AVX2 mode: due to the higher requirements of AVX2/FMA instructions, clock speeds will be somewhat lower while executing AVX instructions *
Non-AVX mode: while not executing “heavy” AVX instructions, the processor will operate at the “stock” frequency

Each of the “modes” above is actually a range of CPU clock speeds. The CPU will run at the highest speed possible for the particular set of CPU instructions that have been issued. It is worth noting that these modes are isolated to each core. Within a given CPU, some cores may be operating in AVX mode while others are operating in Non-AVX mode.

* Intel has applied a 1-millisecond hysteresis window on each CPU core to prevent rapid frequency transitions

AVX-512, AVX, and Non-AVX Turbo Boost in Xeon “Cascade Lake-SP” Scalable Family processors

Each CPU also includes the Turbo Boost feature which allows each processor core to operate well above the “base” clock speed during most operations. The precise clock speed increase depends upon the number & intensity of tasks running on each CPU. However, Turbo Boost speed increases also depend upon the types of instructions (AVX-512, AVX, Non-AVX).

The plots below demonstrate processor clock speeds under the following conditions:

All cores on the CPU actively running Non-AVX, AVX, or AVX-512 instructions
A single active core running Non-AVX, AVX, or AVX-512 instructions (all other cores on the CPU must be idle)

The dotted lines represent the range of clock speeds for Non-AVX instructions. The thin grey bars represent the range of clock speeds for AVX2/FMA instructions. The thicker shaded/colored bars represent the range of clock speeds for AVX-512/FMA instructions.

All CPU cores active
A single CPU core active

Note that despite the clear rules stated above, each Turbo Boost value is still a range of clock speeds. Because workloads are so diverse, Intel is unable to guarantee one specific clock speed for AVX-512, AVX, or Non-AVX instructions. Users are guaranteed that cores will run within a specific range, but each application will have to be benchmarked to determine which frequencies a CPU will operate at.

Despite the perceived reduction in performance when running these vector instructions, keep in mind that AVX-512 roughly doubles the number of operations which can be completed per cycle. Although the clock speeds might be reduced by nearly 1GHz, the overall throughput is increased. HPC users should expect their processors to be running in AVX or AVX-512 mode most of the time.

Cost-Effectiveness and Power Efficiency of Xeon “Cascade Lake SP” CPUs

Many of these new processors have the same price structure as earlier Xeon server CPU families. However, the prices and power requirements for some of the premium models are fairly high. Savvy readers may find the following facts useful:

HPC applications run best on the higher-end Gold and Platinum CPU models (6200- and 8200-series), as all of the lower-end CPUs provide only half the number of math units.
Applications which can be satisfied by a single CPU will benefit greatly from the single-socket Xeon 62xxU models
The Platinum models (8200-series) are generally targeted towards Enterprise and Finance – these carry higher prices than other models.

Cost-Effectiveness for HPC & AI
CPU Power Efficiency

This plot compares the power requirements (TDP) versus performance throughput of each CPU. Although this generation includes some of the highest-wattage CPUs to date, each is actually quite power efficient. In fact, even the 205-Watt CPU models are among the top most-efficient models in this product line. Overall, any CPU selected from the Xeon 6200- or 8200-series will be close to the most efficient CPU on the market. Groups which select the lower-price 4200-series CPUs will end up spending more on power per useful work completed.

Recommended CPU Models for HPC & AI/Deep Learning

Although many of these CPU models will offer excellent performance, it is common for HPC sites to set a floor on CPU clock speeds (usually around 2.5GHz). The intent is to ensure that no workload suffers too low of a performance. While there are users who would prefer higher clock speeds, experience shows that most groups settle on a value of 2.5GHz to 2.6GHz. With that in mind, the comparisons below highlight only those CPU models which offer 2.5+GHz performance.

Summary of features in Xeon Scalable Family “Cascade Lake-SP” processors

In addition to the capabilities mentioned at the top of this article, these processors include many of the successful features from earlier Xeon designs. They also include lower-level changes that may of interest to expert users. The list below provides a more detailed summary of relevant technology features in Cascade Lake-SP:

Up to 28 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 16-, 18-, 20-, 24-, and 26-cores)
Improved CPU clock speeds (with Turbo Boost up to 4.4GHz)
Continued high performance with the AVX-512 instruction capabilities of the previous generation:
- AVX-512 instructions (up to 16 double-precision FLOPS per cycle per AVX-512 FMA unit)
- Up to two AVX-512 FMA units per CPU core (depends upon CPU SKU)
Introduction of new AVX-512 VNNI instruction:
- Intel Deep Learning Boost – the new 8-bit Vector Neural Network Instruction (VNNI) provides significant, more efficient deep learning inference acceleration
- Combines three AVX-512 instructions (VPMADDUBSW, VPMADDWD, VPADDD) into a single VPDPBUSD operation
As introduced with “Haswell” and “Broadwell”, these CPUs continue to support 128-bit AVX and 256-bit AVX2 Advanced Vector Extensions with FMA3

Memory capacity & performance features:

Six-channel memory controller on each CPU
Support for DDR4 memory speeds up to 2933MHz (up from 2666MHz)
Single DIMM per channel operates at up to 2933MHz; two DIMMs per channel operate at up to 2666MHz
Large-memory capabilities with Intel Optane DC Persistent Memory
All CPU models support up to 1TB-per-socket system memory
Optional CPU support for 2TB- or 4.5TB-per-socket system memory (only available on certain SKUs)

Introduction of Intel Speed Select processor models:

Certain processors support three distinct operating points
Each operating point provides a different number of CPU cores
CPU clock and Turbo Boost speeds optimized for each core count

Integrated hardware-based security mitigations against side-channel attacks
Fast links between CPU sockets with up to three 10.4GT/s UPI links
I/O connectivity of 48 lanes of generation 3.0 PCI-Express per CPU
CPU cores are arranged in an “Uncore” mesh interconnect
Optimized Turbo Boost profiles allow higher frequencies even when many CPU cores are in use
Direct PCI-Express (generation 3.0) connections between each CPU and peripheral devices such as network adapters, GPUs and coprocessors (48 PCI-E lanes per socket)

Turbo Boost technology improves performance under peak loads by increasing processor clock speeds. Clock speeds are boosted higher even when many cores are in use. There are three tiers of Turbo Boost clock speeds depending upon the type of instructions being executed:
- Non-AVX: Operations that are not math intensive, or “light” AVX/AVX2 instructions which don’t involve multiply/FMA
- AVX: Operations that heavily use the AVX/AVX2 units, or that use the AVX-512 unit (but not the multiply/FMA instructions)
- AVX-512: Operations that heavily use the AVX-512 units, including multiply/FMA instructions
Intel QuickData Technology (CBDMA) doubles memory-to-memory copy performance and supports MMIO memory copy
Intel Volume Management Device (VMD) provides CPU-level device aggregation for NVMe SSD storage devices
Intel Virtual RAID on CPU (VROC) provides RAID support for NVMe SSD storage devices
Intel C620-series PCH chipset (formerly codenamed “Lewisburg”) with improved connectivity:
- Up to four Intel X722 10GbE/1GbE Ethernet ports with iWARP RDMA support
- PCI-Express generation 3.0 x4 connection from the PCH to the CPUs
- Support for more integrated SATA3 6Gbps ports (up to 14)
- Support for more integrated USB 3.0 ports (up to 10)
- Integrated Intel QuickAssist Technology, which accelerates many cryptographic and compression/decompression workloads
Enhanced CPU Core Microarchitecture:
- Larger and improved branch predictor, higher throughput decoder, larger out-of-order window
- Improved scheduler and execution engine; improved throughput and latency of divide/sqrt
- More load/store bandwidth, deeper load/store buffers, improved prefetcher
- One or Two AVX-512 512-bit FMA units per core
- Support for the following AVX-512 instruction types:
  AVX512-F, AVX512-VL, AVX512-BW, AVX512-DQ, AVX512-CD
- 1MB dedicated L2 cache per core
- A 10% (geomean) improvement in instructions per cycle (IPC) versus the “Broadwell” generation CPUs
Re-architected L2/L3 cache hierarchy:
- Each CPU core contains 1MB L2 private cache (up from 256KB)
- Each core’s private L2 acts as primary cache
- Each CPU contains >1.3MB/core of shared L3 cache (for when the private L2 cache is exhausted)
- The shared L3 cache is non-inclusive (does not keep copies of the L2 caches)
- Larger 64-entry L2 TLB for 1GB pages (up from 16 entries)
Dual or Triple Ultra Path Interconnect (UPI) links between processor sockets improve communication speeds for data-intensive applications
RDSEED instruction for high-quality, non-deterministic, random seed values
Hyper-Threading technology allows two threads to “share” a processor core for improved resource usage. Although useful for some workloads, it is frequently disabled for HPC applications.
Hardware Controlled Power Management for more rapid and efficient decisions on optimal P- and C-State operating point

The post Detailed Specifications of the “Cascade Lake SP” Intel Xeon Processor Scalable Family CPUs appeared first on Microway.

Check for memory errors on NVIDIA GPUs

Eliot Eshelman — Thu, 14 Feb 2019 18:10:35 +0000

Professional NVIDIA GPUs (the Tesla and Quadro products) are equipped with error-correcting code (ECC) memory, which allows the system to detect when memory errors occur. Smaller “single-bit” errors are transparently corrected. Larger “double-bit” memory errors will cause applications to crash, but are at least detected (GPUs without ECC memory would continue operating on the corrupted data).

There are conditions under which GPU events are reported to the Linux kernel, in which case you will see such errors in the system logs. However, the GPUs themselves will also store the type and date of the event.

It’s important to note that not all ECC errors are due to hardware failures. Stray cosmic rays are known to cause bit flips. For this reason, memory is not considered “bad” when a single error occurs (or even when a number of errors occurs). If you have a device reporting tens or hundreds of Double Bit errors, please contact Microway tech support for review. You may also wish to review the NVIDIA documentation

To review the current health of the GPUs in a system, use the nvidia-smi utility:

[root@node7 ~]# nvidia-smi -q -d PAGE_RETIREMENT

==============NVSMI LOG==============

Timestamp                           : Thu Feb 14 10:58:34 2019
Driver Version                      : 410.48

Attached GPUs                       : 4
GPU 00000000:18:00.0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No

GPU 00000000:3B:00.0
    Retired Pages
        Single Bit ECC              : 15
        Double Bit ECC              : 0
        Pending                     : No

The output above shows one card with no issues and one card with a minor quantity of single-bit errors (the card is still functional and in operation).

If the above report indicates that memory pages have been retired, then you may wish to see additional details (including when the pages were retired). If nvidia-smi reports Pending: Yes, then memory errors have occurred since the last time the system rebooted. In either case, there may be older page retirements that took place.

To review a complete listing of the GPU memory pages which have been retired (including the unique ID of each GPU), run:

[root@node7 ~]# nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv

gpu_uuid, retired_pages.address, retired_pages.cause
GPU-9fa5168d-97bf-98aa-33b9-45329682f627, 0x000000000005c05e, Single Bit ECC
GPU-9fa5168d-97bf-98aa-33b9-45329682f627, 0x000000000005ca0d, Single Bit ECC
GPU-9fa5168d-97bf-98aa-33b9-45329682f627, 0x000000000005c72e, Single Bit ECC
...

A different type of output must be selected in order to read the timestamps of page retirements. The output is in XML format and may require a bit more effort to parse. In short, try running a report such as shown below:

[root@node7 ~]# nvidia-smi -i 1 -q -x| grep -i -A1 retired_page_addr

0x000000000005c05e
Mon Dec 18 06:25:25 2017
--
0x000000000005ca0d
Mon Dec 18 06:25:25 2017
--
0x000000000005c72e
Mon Dec 18 06:25:31 2017
...

The post Check for memory errors on NVIDIA GPUs appeared first on Microway.

Optimizing the Performance of System Memory

Eliot Eshelman — Wed, 12 Sep 2018 13:37:17 +0000

Compute-intensive applications typically require as much system memory bandwidth as can be provided. For this reason, it is very important that system memory be correctly configured and installed. Microway reviews all systems to ensure proper performance (both during the sales and production/integration stages), however we provide this resource as a reference for those who would like to understand the options.

Improperly-configured memory can result in significant performance reductions. For example, a misconfiguration on the latest Intel Xeon CPUs with 6-channel memory controllers can result in a 65% reduction in memory throughput. This can result in an application running at half the anticipated speed. As you’re considering a new system deployment, please work with our experts to ensure success.

The correct configuration depends upon several factors, including the type of CPUs, the product generation, and the design of the system motherboard. To use the tables below, first select which type and generation of system CPUs will be in use. Then look to the rows which show the optimal memory capacities.

Memory Capacity	1-socket Xeon Scalable	2-socket Xeon Scalable	4-socket Xeon Scalable
24GB
32GB
48GB
64GB
96GB
128GB
192GB
256GB
384GB
512GB
768GB	⓬
1TB
1.5TB		⓬
2TB
3TB			⓬

supported in all standard platforms (these provide at least 6 DIMM slots per CPU socket)
⓬ only supported in platforms with 12 DIMM slots per CPU socket

Memory Capacity	1-socket Xeon E5-2600v4	2-socket Xeon E5-2600v4	4-socket Xeon E5-4600v4
24GB
32GB
48GB
64GB
96GB
128GB
192GB
256GB
384GB
512GB
768GB	⓬
1TB
1.5TB		⓬
2TB
3TB			⓬

supported in all standard platforms (these provide at least 8 DIMM slots per CPU socket)
⓬ only supported in platforms with 12 DIMM slots per CPU socket

Memory Capacity	1-socket EPYC 7000-series	2-socket EPYC 7000-series
24GB
32GB
48GB
64GB
96GB
128GB
192GB
256GB
384GB
512GB
768GB
1TB	⓰
1.5TB
2TB		⓰
3TB

supported in all standard platforms (these provide at least 8 DIMM slots per CPU socket)
⓰ only supported in platforms with 16 DIMM slots per CPU socket

It should be noted that we consider a 64GB DIMM to be the largest available capacity in a single memory slot. Although 128GB and 256GB DIMMs are available, their extreme price and limited availability have made them impractical for most customer use cases.

The post Optimizing the Performance of System Memory appeared first on Microway.

In-Depth Comparison of NVIDIA Quadro “Turing” GPU Accelerators

Brett Newman — Tue, 21 Aug 2018 22:05:31 +0000

This article provides in-depth details of the NVIDIA Quadro RTX “Turing” GPUs. NVIDIA “Turing” GPUs bring an evolved core architecture and add dedicated ray tracing units to the previous-generation “Volta” architecture. Turing GPUs began shipping in late 2018.

Important features available in the “Turing” GPU architecture include:

New RT Ray Tracing Cores for the first realtime ray-tracing performance
Evolved Deep Learning performance with over 130 Tensor TFLOPS (training) and and 500 TOPS Int4 (inference) throughput
NVLink 2.0 between GPUs—when optional NVLink bridges are added—supporting up to 2 bricks and up to 100GB/sec bidirectional bandwidth
New GDDR6 Memory with a substantial improvement in memory performance compared to previous-generation GPUs.

Quadro “Turing” GPU Specifications

The table below summarizes the features of the available Quadro Turing GPU Accelerators. To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

Professional Visualization, Ray Tracing, & Deep Learning Applications

Feature	Quadro RTX 8000	Quadro RTX 6000	Quadro RTX 5000	Quadro RTX 4000
GPU Chip(s)	Turing, TU102		Turing, TU104	Turing, TU106
TensorFLOPS	130.5 Tensor TFLOPS*		89.2 Tensor TFLOPS*	57.0 Tensor TFLOPS*
Integer Operations (INT4)	522 TOPS*		356.8 TOPS*	Unknown
Integer Operations (INT8)	261 TOPS*		178.4 TOPS*	Unknown
Half Precision (FP16)	32.6 TFLOPS		22.3 TFLOPS	14.2 TFLOPS
Single Precision (FP32)	16.3 TFLOPS*		11.2 TFLOPS*	7.1 TFLOPS*
Double Precision (FP64)	.509 TFLOPS*		.350 TFLOPS*	.222 TFLOPS*
Ray Tracing	10 GigaRays/s		8 GigaRays/sec	6 GigaRays/sec
# of CUDA Cores	4608		3072	2034
# of Turing Tensor Cores	576		384	288
# of SM Units	72		48	36
# of RT Cores	72		48	36
GPU Base Clock	1455 Mhz		1620 Mhz	Unknown Mhz
GPU Boost Clock	1770 Mhz		1815 Mhz	Unknown Mhz
GDDR6 Memory	48GB	24GB	16GB	8GB
Memory Bandwidth	672 GB/sec		448 GB/sec	416 GB/sec
Interconnect	PCI-E 3.0 + optional NVLink 2.0 (2 bricks)		PCI-E 3.0 + optional NVLink 2.0 (1 brick)	PCI-E 3.0
Theoretical transfer bandwidth (bidirectional)	100 GB/s NVLink 32GB/s PCI-E x16 3.0		50 GB/s NVLink 32GB/s PCI-E x16 3.0	32GB/s PCI-E x16 3.0
Achievable transfer bandwidth	~94 GB/s NVLink ~12 GB/s PCI-E x16 3.0			~12 GB/s PCI-E x16 3.0
GPU Boost Support	Yes – Dynamic
Workstation Support	yes
Server Support	Yes, with passive GPU version		specific server models only
Wattage (TDP)	295W		265W	160W
Cooling Type	Active†		Active

* FLOPS and TOPS calculations are presented at Max Boost
† Passively-cooled models are available with slightly reduced clock speeds

The post In-Depth Comparison of NVIDIA Quadro “Turing” GPU Accelerators appeared first on Microway.

In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators

Eliot Eshelman — Mon, 12 Mar 2018 22:25:26 +0000

This article provides in-depth details of the NVIDIA Tesla V-series GPU accelerators (codenamed “Volta”). “Volta” GPUs improve upon the previous-generation “Pascal” architecture. Volta GPUs began shipping in September 2017 and were updated to 32GB of memory in March 2018; Tesla V100S was released in late 2019. Note: these have since been superseded by the NVIDIA Ampere GPU architecture.

This page is intended to be a fast and easy reference of key specs for these GPUs. You may wish to browse our Tesla V100 Price Analysis and Tesla V100 GPU Review for more extended discussion.

Important features available in the “Volta” GPU architecture include:

Exceptional HPC performance with up to 8.2 TFLOPS double- and 16.4 TFLOPS single-precision floating-point performance.
Deep Learning training performance with up to 130 TFLOPS FP16 half-precision floating-point performance.
Deep Learning inference performance with up to 62.8 TeraOPS INT8 8-bit integer performance.
Simultaneous execution of FP32 and INT32 operations improves the overall computational throughput of the GPU
NVLink enables an 8~10X increase in bandwidth between the Tesla GPUs and from GPUs to supported system CPUs (compared with PCI-E).
High-bandwidth HBM2 memory provides a 3X improvement in memory performance compared to previous-generation GPUs.
Enhanced Unified Memory allows GPU applications to directly access the memory of all GPUs as well as all of system memory (up to 512TB).
Native ECC Memory detects and corrects memory errors without any capacity or performance overhead.
Combined L1 Cache and Shared Memory provides additional flexibility and higher performance than Pascal.
Cooperative Groups – a new programming model introduced in CUDA 9 for organizing groups of communicating threads

Tesla “Volta” GPU Specifications

The table below summarizes the features of the available Tesla Volta GPU Accelerators. To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

HPC and Deep Learning Applications

Feature	Tesla V100 SXM2 16GB/32GB	Tesla V100 PCI-E 16GB/32GB	Tesla V100S PCI-E 32GB	Quadro GV100 32GB
GPU Chip(s)	Volta GV100
TensorFLOPS	125 TFLOPS	112 TFLOPS	130 TFLOPS	118.5 TFLOPS
Integer Operations (INT8)*	62.8 TOPS	56.0 TOPS	65 TOPS	59.3 TOPS
Half Precision (FP16)*	31.4 TFLOPS	28 TFLOPS	32.8 TFLOPS	29.6 TFLOPS
Single Precision (FP32)*	15.7 TFLOPS	14.0 TFLOPS	16.4 TFLOPS	14.8 TFLOPS
Double Precision (FP64)*	7.8 TFLOPS	7.0 TFLOPS	8.2 TFLOPS	7.4 TFLOPS
On-die HBM2 Memory	16GB or 32GB		32GB
Memory Bandwidth	900 GB/s		1,134 GB/s	870 GB/s
L2 Cache	6 MB
Interconnect	NVLink 2.0 (6 bricks) + PCI-E 3.0	PCI-Express 3.0		NVLink 2.0 (4 bricks) + PCI-E 3.0
Theoretical transfer bandwidth (bidirectional)	300 GB/s	32 GB/s		200 GB/s
Achievable transfer bandwidth	143.5 GB/s	~12 GB/s
# of SM Units	80
# of Tensor Cores	640
# of integer INT32 CUDA Cores	5120
# of single-precision FP32 CUDA Cores	5120
# of double-precision FP64 CUDA Cores	2560
GPU Base Clock	not published	1245Mhz	not published
GPU Boost Support	Yes – Dynamic
GPU Boost Clock	1530 MHz	~1380 MHz	TBM
Compute Capability	7.0
Workstation Support	–			yes
Server Support	yes			specific server models only
Cooling Type	Passive			Active
Wattage (TDP)	300W	250W

* theoretical peak performance with GPU Boost enabled

Comparison between “Kepler”, “Pascal”, and “Volta” GPU Architectures

Feature	Kepler GK210	Pascal GP100	Volta GV100
Compute Capability ^	3.7	6.0	7.0
Threads per Warp	32
Max Warps per SM	64
Max Threads per SM	2048
Max Thread Blocks per SM	16	32
Max Concurrent Kernels	32	128
32-bit Registers per SM	128 K	64 K
Max Registers per Thread Block	64 K
Max Registers per Thread	255
Max Threads per Thread Block	1024
L1 Cache Configuration	split with shared memory	24KB dedicated L1 cache	32KB ~ 128KB (dynamic with shared memory)
Shared Memory Configurations	16KB + 112KB L1 Cache 32KB + 96KB L1 Cache 48KB + 80KB L1 Cache (128KB total)	64KB	configurable up to 96KB; remainder for L1 Cache (128KB total)
Max Shared Memory per Thread Block	48KB		96KB*
Max X Grid Dimension	2^32-1
Hyper-Q	Yes
Dynamic Parallelism	Yes
Unified Memory	No	Yes
Pre-Emption	No	Yes

^ For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation
* above 48 KB requires dynamic shared memory

Hardware-accelerated video encoding and decoding

All NVIDIA “Volta” GPUs include one or more hardware units for video encoding and decoding (NVENC / NVDEC). For complete hardware details, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.

The post In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators appeared first on Microway.