pci-express Archives - Microway https://www.microway.com/tag/pci-express/ We Speak HPC & AI Tue, 28 May 2024 17:03:48 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-ice-lake-sp-intel-xeon-processor-scalable-family-cpus-2/ https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-ice-lake-sp-intel-xeon-processor-scalable-family-cpus-2/#respond Tue, 06 Apr 2021 15:00:23 +0000 https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-ice-lake-sp-intel-xeon-processor-scalable-family-cpus-2/ This article provides in-depth discussion and analysis of the 10nm Intel Xeon Processor Scalable Family (formerly codenamed “Ice Lake-SP” or “Ice Lake Scalable Processor”). These processors replace the previous 14nm “Cascade Lake-SP” microarchitecture and are available for sale as of April 6, 2021. The “Ice Lake SP” CPUs are the 3rd generation of Intel’s Xeon […]

The post Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs appeared first on Microway.

]]>
This article provides in-depth discussion and analysis of the 10nm Intel Xeon Processor Scalable Family (formerly codenamed “Ice Lake-SP” or “Ice Lake Scalable Processor”). These processors replace the previous 14nm “Cascade Lake-SP” microarchitecture and are available for sale as of April 6, 2021.

The “Ice Lake SP” CPUs are the 3rd generation of Intel’s Xeon Scalable Processor family. This generation brings new features, increased performance, and new server/workstation platforms. The Xeon ‘Ice Lake SP’ CPUs cannot be installed into previous-generation systems. Those considering a new deployment are encouraged to review with one of our experts.

Highlights of the features in Xeon Scalable Processor Family “Ice Lake SP” CPUs include:

  • Up to 40 processor cores per socket (with options for 8-, 12-, 16-, 18-, 20-, 24-, 26-, 28-, 32-, 36-, and 38-cores)
  • Up to 38% higher per-core performance through micro-architecture improvements (at same clock speed vs “Cascade Lake SP”)
  • Significant memory performance & capacity increases:
    • Eight-channel memory controller on each CPU (up from six)
    • Support for DDR4 memory speeds up to 3200MHz (up from 2933MHz)
    • Large-memory capacity with Intel Optane Persistent Memory
    • All CPU models support up to 6TB per socket (combined system memory and Optane persistent memory)
  • Increased link speed between CPU sockets: 11.2GT/s UPI links (up from 10.4GT/s)
  • I/O Performance Improvements – more than twice the throughput of “Cascade Lake SP”:
    • PCI-Express generation 4.0 doubles the throughput of each PCI-E lane (compared to gen 3.0)
    • Support for 64 PCI-E lanes per CPU socket (up from 48 lanes)
  • Continued high performance with the AVX-512 instruction capabilities of the previous generation:
    • AVX-512 instructions (up to 16 double-precision FLOPS per cycle per AVX-512 FMA unit)
    • Two AVX-512 FMA units per CPU core (available in all Ice Lake-SP CPU SKUs)
  • Continued support for deep learning inference with AVX-512 VNNI instruction:
    • Intel Deep Learning Boost (VNNI) provides significant, more efficient deep learning inference acceleration
    • Combines three AVX-512 instructions (VPMADDUBSW, VPMADDWD, VPADDD) into a single VPDPBUSD operation
  • Improvements to Intel Speed Select processor configurability:
    • Performance Profiles: certain processors support three distinct core count/clock speed operating points
    • Base Frequency: specific CPU cores are given higher base clock speeds; the remaining cores run at lower speeds
    • Turbo Frequency: specific CPU cores are given higher turbo-boost speeds; the remaining cores run at lower speeds
    • Core Power: each CPU core is prioritized; when surplus frequency is available, it is given to high-priority cores
  • Integrated hardware-based security improvements and total memory encryption

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC & AI applications.

Continued Specialization of Xeon CPU SKUs

Those already familiar with Intel Xeon will see this processor family is divided into familiar tiers: Silver, Gold, and Platinum. The Silver and Gold models are in the price/performance range familiar to HPC/AI teams. Platinum models are in a higher price range. The low-end Bronze tier present in previous generations has been dropped.

Further, Intel continues to add new specialized CPU models that are optimized for particular workloads and environments. Many of these specialized SKUs are not relevant to readers here, but we summarize them briefly:

  • N: network function virtualization (NFV) optimized
  • P: virtualization-optimized (with a focus on clock frequency)
  • S: max SGX enclave size
  • T: designed for higher-temperature environments (NEBS)
  • V: virtualization-optimized (with focus on high-density/low-power)

Targeting specific workloads and environments provides the best performance and efficiency for those use cases. However, using these CPUs for other workloads may reduce performance, as the CPU clock frequencies and Turbo Boost speeds are guaranteed only for those specific workloads. Running other workloads on these optimized CPUs will likely lead to CPU throttling, which would be undesirable. Considering these limitations, the above workload-optimized models will not be included in our review.

Four Xeon CPU specializations relevant to HPC & AI use cases

There are several specialized Xeon CPU options which are relevant to high performance computationally-intensive workloads. Each capability is summarized below and included in our analysis.

  • Liquid-cooled – Xeon 8368Q CPU: optimized for liquid-cooled deployment, this CPU SKU offers high core counts along with higher CPU clock frequencies. The high clock frequencies are made possible only through the more effective cooling provided by liquid-cooled datacenters.
  • Media, AI, and HPC – Xeon 8352M CPU: optimized for AVX-heavy vector instruction workloads as found in media processing, AI, and HPC; this CPU SKU offers improved performance per watt.
  • Performance Profiles – Y: a set of CPU SKUs with support for Intel Speed Select Technology – Performance Profiles. These CPUs are indicated with a Y suffix in the model name (e.g., Xeon 8352Y) and provide flexibility for those with mixed workloads. Each CPU supports three different operating profiles with separate CPU core count, base clock and turbo boost frequencies, as well as operating wattages (TDP). In other words, each CPU could be thought of as three different CPUs. Administrators switch between profiles via system BIOS, or through Operating Systems with support for this capability (Intel SST-PP). Note that several of the other specialized CPU SKUs also support multiple Performance Profiles (e.g., Xeon 8352M).
  • Single Socket – U: single-socket optimized. The CPUs designed for a single socket are indicated with a U suffix in the model name (e.g., Xeon 6312U). These CPUs are more cost-effective. However, they do not include UPI links and thus can only be installed in systems with a single processor.

Summary of Xeon “Ice Lake-SP” CPU tiers

With the Bronze CPU tier no longer present, all models in this CPU family are well-suited to HPC and AI (though some will offer more performance than others). Before diving into the details, we provide a high-level summary of this Xeon processor family:

  • Intel Xeon Silver – suitable for entry-level HPC
    The Xeon Silver 4300-series CPU models provide higher core counts and increased memory throughput compared to previous generations. However, their performance is limited compared to Gold and Platinum (particularly on Core Count, Clock Speed, Memory Performance, and UPI speed).
  • Intel Xeon Gold – recommended for most HPC workloads
    Xeon Gold 5300- and 6300-series CPUs provide the best balance of performance and price. In particular, the 6300-series models should be preferred over the 5300-series models, because the 6300-series CPUs offer improved Clock Speeds and Memory Performance.
  • Intel Xeon Platinum – only for specific HPC workloads
    Although 8300-series models provide the highest performance, their higher price makes them suitable only for particular workloads which require their specific capabilities (e.g., highest core count, large L3 cache).

Xeon “Ice Lake SP” Computational Performance

With this new family of Xeon processors, Intel once again delivers unprecedented performance. Nearly every model provides over 1 TFLOPS (one teraflop of double-precision 64-bit performance per second), many models exceed 2 TFLOPS, and a few touch 3 TFLOPS. These performance levels are achieved through high core counts and AVX-512 instructions with FMA (as in the first and second Xeon Scalable generations). The plots in the tabs below compare the performance ranges for these new CPUs:
[tabby title=”AVX-512 Instruction Performance”]
Comparison chart of Intel Xeon Ice Lake SP CPU theoretical GFLOPS performance with AVX-512 instructions

[tabby title=”AVX2 Instruction Performance”]
Comparison chart of Intel Xeon Ice Lake SP CPU theoretical GFLOPS performance with AVX2 instructions
[tabbyending]

In the charts above, the shaded/colored bars indicate the expected performance range for each CPU model. The performance is a range rather than a specific value, because CPU clock frequencies scale up and down on a second-by-second basis. The precise achieved performance depends upon a variety of factors including temperature, power envelope, type of cooling technology, the load on each CPU core, and the type(s) of CPU instructions being issued to each core.

The first tab shows performance when using Intel’s AVX-512 instructions with FMA. Note that only a small set of codes will be capable of issuing exclusively AVX-512 FMA instructions (e.g., HPL LINPACK). Most applications issue a mix of instructions and will achieve lower than peak FLOPS. Further, applications which have not been re-compiled with an appropriate compiler will not include AVX-512 instructions and thus achieve lower performance. Computational applications which do not utilize AVX-512 instructions will most likely utilize AVX2 instructions (as shown in the second tab with AVX2 Instruction performance.

Intel Xeon “Ice Lake SP” Price Ranges

The pricing of the 3rd-generation Xeon Processor Scalable Family spans a wide range, so budget must be kept in mind when selecting options. It would be frustrating to plan on 38-core processors when the budget cannot support a price of more than $10,000 per CPU. The plot below compares the prices of the Xeon “Cascade Lake SP” processors:

As shown in the above plot, the CPUs in this article have been sorted by tier and by price. Most HPC users are expected to select CPU models from the Gold Xeon 6300-series. These models provide close to peak performance for a price around $3,000 per processor. Certain specialized applications will leverage the Platinum Xeon 8300-series

To ease comparisons, all of the plots in this article are ordered to match the above plot. Keep this pricing in mind as you review this article and plan your system architecture.

Recommended Xeon CPU Models for HPC & AI/Deep Learning

As stated at the top, most of this new CPU family offers excellent performance. However, it is common for HPC sites to set a minimum floor on CPU clock speeds (usually around 2.5GHz), with the intent that no workload suffers too low of a performance. While there are users who would demand even higher clock speeds, experience shows that most groups settle on a minimum clock speed in the 2.5GHz to 2.6GHz range. With that in mind, the comparisons below highlight only those CPU models which offer 2.5+GHz performance.

[tabby title=”2.5+GHz Core Counts”]
Comparison chart of Intel Xeon Ice Lake SP CPU core counts (for models with 2.5+GHz clock speed)

[tabby title=”AVX-512 Performance”]
Comparison chart of Intel Xeon Ice Lake SP CPU throughput with AVX-512 instructions (models with 2.5+GHz clock speeds)

[tabby title=”AVX2 Performance”]
Comparison chart of Intel Xeon Ice Lake SP CPU throughput with AVX2 instructions (models with 2.5+GHz clock speeds)

[tabby title=”2.5+GHz Cost-Effectiveness”]
Comparison chart of Intel Xeon Ice Lake SP cost-effectiveness (models with 2.5+GHz clock speeds)

[tabbyending]

The post Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-ice-lake-sp-intel-xeon-processor-scalable-family-cpus-2/feed/ 0
Performance Characteristics of Common Transports and Buses https://www.microway.com/knowledge-center-articles/performance-characteristics-of-common-transports-buses/ https://www.microway.com/knowledge-center-articles/performance-characteristics-of-common-transports-buses/#respond Fri, 19 Jul 2013 16:45:25 +0000 http://https://www.microway.com/?post_type=incsub_wiki&p=1817 Memory The following values are measured per CPU socket. They must be doubled or quadrupled to calculate the total memory bandwidth of a multiprocessor workstation or server. For dual-processor systems, multiply by two. For quad-processor systems, multiply by four. Type # Channels Theoretical Bandwidth (unidirectional) Typical Bandwidth(in Practice) DDR4 3200MHz Eight-Channel 204.8 GB/s 171.5 GB/s […]

The post Performance Characteristics of Common Transports and Buses appeared first on Microway.

]]>
Memory

The following values are measured per CPU socket. They must be doubled or quadrupled to calculate the total memory bandwidth of a multiprocessor workstation or server. For dual-processor systems, multiply by two. For quad-processor systems, multiply by four.

Type # Channels Theoretical Bandwidth (unidirectional) Typical Bandwidth
(in Practice)
DDR4 3200MHz Eight-Channel 204.8 GB/s 171.5 GB/s
DDR4 2933MHz Six-Channel 140.8 GB/s 98 GB/s
DDR4 2666MHz Six-Channel 128 GB/s 90 GB/s
DDR4 2400MHz Quad-Channel 76.8 GB/s 64 GB/s
DDR4 2133MHz Quad-Channel 68.2 GB/s 55.5 GB/s
DDR3 1866MHz Quad-Channel 59.7 GB/s 42.8 GB/s
DDR3 1600MHz Quad-Channel 51.2 GB/s
DDR3 1333MHz Quad-Channel 42.7 GB/s
DDR3 1066MHz Quad-Channel 34.1 GB/s
DDR3 1333MHz Triple-Channel 32.0 GB/s
DDR3 1066MHz Triple-Channel 25.6 GB/s
DDR3 800MHz Triple-Channel 19.2 GB/s
DDR3 1866MHz Dual-Channel 29.9 GB/s
DDR3 1600MHz Dual-Channel 25.6 GB/s
DDR3 1333MHz Dual-Channel 21.3 GB/s
DDR3 1066MHz Dual-Channel 17.0 GB/s

Theoretical memory bandwidths are calculated with: 64 bits/transfer * DDR transfers/s * number of memory channels


PCI-Express

PCI-E Generation Lanes Theoretical Bandwidth (unidirectional) Typical Bandwidth
(in Practice)
Gen 1 x4 1,000 MB/s 880 MB/s
Gen 1 x8 2,000 MB/s 1,760 MB/s
Gen 1 x16 4,000 MB/s 3,520 MB/s
Gen 2 x4 2,000 MB/s 1,600 MB/s
Gen 2 x8 4,000 MB/s 3,200 MB/s
Gen 2 x16 8,000 MB/s 6,400 MB/s
Gen 3 x4 4,000 MB/s 2,800 MB/s
Gen 3 x8 8,000 MB/s 5,600 MB/s
Gen 3 x16 16,000 MB/s 12,100 MB/s
Gen 4 x16 32,000 MB/s 26,200 MB/s

NVIDIA GPU NVLink

The NVLink connectivity on a GPU can be split different ways depending upon the system platform design. Most NVLink 1.0 configurations split the connectivity two ways or four ways (20GB/s on each of four links). NVLink 2.0 configurations can split connectivity two, three, or six ways (25GB/s on each of six links). NVLink 3.0 supports up to twelve links (25GB/s per link).

NVLink Generation Theoretical Bandwidth* (unidirectional) Typical Bandwidth
(in Practice)
NVLink 1.0 (4 bricks) 80 GB/s 73.4 GB/s
NVLink 2.0 (6 bricks) 150 GB/s 143.5 GB/s
NVLink 3.0 (12 bricks) 300 GB/s 276 GB/s

SAS and SATA

Generation Theoretical Bandwidth (unidirectional)
4x wide port
Typical Bandwidth (in Practice)
SAS / SATA
1.5Gbps (SAS/SATA I) 600 MB/s 520 / 450 MB/s
3Gbps (SAS/SATA II) 1,200 MB/s 1,140 / 990 MB/s
6Gbps (SAS II/SATA III) 2,400 MB/s 2,280 / 1,975 MB/s
12Gbps SAS 4,800 MB/s 3,107 / — MB/s

Hard Drives and SSDs

Drive Type Random IOPS Sustained Sequential I/O
SAS/SATA 7,200RPM 70 – 175 100 – 230 MB/s
SAS 10,000RPM 275 – 300 125 – 200 MB/s
SAS 15,000RPM 350 – 450 125 – 200 MB/s
SAS/SATA Solid State Drives (SSD) 15,000 – 100,000 110 – 500 MB/s
PCI-E Solid States (NVMe SSD) 70,000 – 625,000 1,100 – 3,200 MB/s

Intel QuickPath Interconnect (QPI) and UltraPath Interconnect (UPI)

The values listed below describe a single QPI/UPI link on an Intel Xeon processor. There are typically two to three UPI links between CPU sockets, but this will vary by platform. Note that the Xeon product lines are segmented. Within a given processor series (e.g., Xeon Scalable “Cascade Lake-SP”), transfer speeds will vary from model to model.

Interconnect Transfer Speed Theoretical Bandwidth (unidirectional)
QPI 4.8 GT/s 9.6 GB/s
QPI 5.6 GT/s 11.2 GB/s
QPI 6.4 GT/s 12.8 GB/s
QPI 7.2 GT/s 14.4 GB/s
QPI 8.0 GT/s 16.0 GB/s
QPI 9.6 GT/s 19.2 GB/s
UPI 10.4 GT/s 20.8 GB/s

AMD Infinity Fabric

The values listed below describe a single Infinity Fabric link on an AMD EPYC processor. In dual-socket EPYC systems, there are typically three or four links between the CPU sockets. Within each EPYC CPU, each of the eight dies on the chip is connected to the I/O die via one Infinity Fabric link.

DDR4 Memory Speed Theoretical Bandwidth (unidirectional)
Zen2/Zen3 18GT/s 72 GB/s
Zen1 10.6GT/s 42.667 GB/s

Note that links between EPYC sockets include CRC overhead, which results in 8/9ths of the bandwidth values shown above (e.g., 37.9GB/s rather than 42.6GB/s).


AMD HyperTransport Link

The values listed below describe a single HyperTransport link on an AMD Opteron processor. In many of systems, there were dual HyperTransport links between the CPUs.

Generation Transfers Theoretical Bandwidth (unidirectional)
3.1 (Socket G34) 6.4 GT/s (16-bit) 12.8 GB/s

Fibre Channel (FC)

FC Rate Theoretical Bandwidth (unidirectional)
2Gb 200 MB/s
4Gb 400 MB/s
8Gb 800 MB/s
16Gb 1600 MB/s
32Gb 3200 MB/s

See also: Performance Characteristics of Common Network Fabrics

The post Performance Characteristics of Common Transports and Buses appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/performance-characteristics-of-common-transports-buses/feed/ 0