power Archives - Microway

Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs

Eliot Eshelman — Tue, 06 Apr 2021 15:00:23 +0000

This article provides in-depth discussion and analysis of the 10nm Intel Xeon Processor Scalable Family (formerly codenamed “Ice Lake-SP” or “Ice Lake Scalable Processor”). These processors replace the previous 14nm “Cascade Lake-SP” microarchitecture and are available for sale as of April 6, 2021.

The “Ice Lake SP” CPUs are the 3rd generation of Intel’s Xeon Scalable Processor family. This generation brings new features, increased performance, and new server/workstation platforms. The Xeon ‘Ice Lake SP’ CPUs cannot be installed into previous-generation systems. Those considering a new deployment are encouraged to review with one of our experts.

Highlights of the features in Xeon Scalable Processor Family “Ice Lake SP” CPUs include:

Up to 40 processor cores per socket (with options for 8-, 12-, 16-, 18-, 20-, 24-, 26-, 28-, 32-, 36-, and 38-cores)
Up to 38% higher per-core performance through micro-architecture improvements (at same clock speed vs “Cascade Lake SP”)
Significant memory performance & capacity increases:
- Eight-channel memory controller on each CPU (up from six)
- Support for DDR4 memory speeds up to 3200MHz (up from 2933MHz)
- Large-memory capacity with Intel Optane Persistent Memory
- All CPU models support up to 6TB per socket (combined system memory and Optane persistent memory)
Increased link speed between CPU sockets: 11.2GT/s UPI links (up from 10.4GT/s)
I/O Performance Improvements – more than twice the throughput of “Cascade Lake SP”:
- PCI-Express generation 4.0 doubles the throughput of each PCI-E lane (compared to gen 3.0)
- Support for 64 PCI-E lanes per CPU socket (up from 48 lanes)
Continued high performance with the AVX-512 instruction capabilities of the previous generation:
- AVX-512 instructions (up to 16 double-precision FLOPS per cycle per AVX-512 FMA unit)
- Two AVX-512 FMA units per CPU core (available in all Ice Lake-SP CPU SKUs)
Continued support for deep learning inference with AVX-512 VNNI instruction:
- Intel Deep Learning Boost (VNNI) provides significant, more efficient deep learning inference acceleration
- Combines three AVX-512 instructions (VPMADDUBSW, VPMADDWD, VPADDD) into a single VPDPBUSD operation
Improvements to Intel Speed Select processor configurability:
- Performance Profiles: certain processors support three distinct core count/clock speed operating points
- Base Frequency: specific CPU cores are given higher base clock speeds; the remaining cores run at lower speeds
- Turbo Frequency: specific CPU cores are given higher turbo-boost speeds; the remaining cores run at lower speeds
- Core Power: each CPU core is prioritized; when surplus frequency is available, it is given to high-priority cores
Integrated hardware-based security improvements and total memory encryption

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC & AI applications.

Continued Specialization of Xeon CPU SKUs

Those already familiar with Intel Xeon will see this processor family is divided into familiar tiers: Silver, Gold, and Platinum. The Silver and Gold models are in the price/performance range familiar to HPC/AI teams. Platinum models are in a higher price range. The low-end Bronze tier present in previous generations has been dropped.

Further, Intel continues to add new specialized CPU models that are optimized for particular workloads and environments. Many of these specialized SKUs are not relevant to readers here, but we summarize them briefly:

N: network function virtualization (NFV) optimized
P: virtualization-optimized (with a focus on clock frequency)
S: max SGX enclave size
T: designed for higher-temperature environments (NEBS)
V: virtualization-optimized (with focus on high-density/low-power)

Targeting specific workloads and environments provides the best performance and efficiency for those use cases. However, using these CPUs for other workloads may reduce performance, as the CPU clock frequencies and Turbo Boost speeds are guaranteed only for those specific workloads. Running other workloads on these optimized CPUs will likely lead to CPU throttling, which would be undesirable. Considering these limitations, the above workload-optimized models will not be included in our review.

Four Xeon CPU specializations relevant to HPC & AI use cases

There are several specialized Xeon CPU options which are relevant to high performance computationally-intensive workloads. Each capability is summarized below and included in our analysis.

Liquid-cooled – Xeon 8368Q CPU: optimized for liquid-cooled deployment, this CPU SKU offers high core counts along with higher CPU clock frequencies. The high clock frequencies are made possible only through the more effective cooling provided by liquid-cooled datacenters.
Media, AI, and HPC – Xeon 8352M CPU: optimized for AVX-heavy vector instruction workloads as found in media processing, AI, and HPC; this CPU SKU offers improved performance per watt.
Performance Profiles – Y: a set of CPU SKUs with support for Intel Speed Select Technology – Performance Profiles. These CPUs are indicated with a Y suffix in the model name (e.g., Xeon 8352Y) and provide flexibility for those with mixed workloads. Each CPU supports three different operating profiles with separate CPU core count, base clock and turbo boost frequencies, as well as operating wattages (TDP). In other words, each CPU could be thought of as three different CPUs. Administrators switch between profiles via system BIOS, or through Operating Systems with support for this capability (Intel SST-PP). Note that several of the other specialized CPU SKUs also support multiple Performance Profiles (e.g., Xeon 8352M).
Single Socket – U: single-socket optimized. The CPUs designed for a single socket are indicated with a U suffix in the model name (e.g., Xeon 6312U). These CPUs are more cost-effective. However, they do not include UPI links and thus can only be installed in systems with a single processor.

Summary of Xeon “Ice Lake-SP” CPU tiers

With the Bronze CPU tier no longer present, all models in this CPU family are well-suited to HPC and AI (though some will offer more performance than others). Before diving into the details, we provide a high-level summary of this Xeon processor family:

Intel Xeon Silver – suitable for entry-level HPC
The Xeon Silver 4300-series CPU models provide higher core counts and increased memory throughput compared to previous generations. However, their performance is limited compared to Gold and Platinum (particularly on Core Count, Clock Speed, Memory Performance, and UPI speed).
Intel Xeon Gold – recommended for most HPC workloads
Xeon Gold 5300- and 6300-series CPUs provide the best balance of performance and price. In particular, the 6300-series models should be preferred over the 5300-series models, because the 6300-series CPUs offer improved Clock Speeds and Memory Performance.
Intel Xeon Platinum – only for specific HPC workloads
Although 8300-series models provide the highest performance, their higher price makes them suitable only for particular workloads which require their specific capabilities (e.g., highest core count, large L3 cache).

Xeon “Ice Lake SP” Computational Performance

With this new family of Xeon processors, Intel once again delivers unprecedented performance. Nearly every model provides over 1 TFLOPS (one teraflop of double-precision 64-bit performance per second), many models exceed 2 TFLOPS, and a few touch 3 TFLOPS. These performance levels are achieved through high core counts and AVX-512 instructions with FMA (as in the first and second Xeon Scalable generations). The plots in the tabs below compare the performance ranges for these new CPUs:
[tabby title=”AVX-512 Instruction Performance”]

[tabby title=”AVX2 Instruction Performance”]

[tabbyending]

In the charts above, the shaded/colored bars indicate the expected performance range for each CPU model. The performance is a range rather than a specific value, because CPU clock frequencies scale up and down on a second-by-second basis. The precise achieved performance depends upon a variety of factors including temperature, power envelope, type of cooling technology, the load on each CPU core, and the type(s) of CPU instructions being issued to each core.

The first tab shows performance when using Intel’s AVX-512 instructions with FMA. Note that only a small set of codes will be capable of issuing exclusively AVX-512 FMA instructions (e.g., HPL LINPACK). Most applications issue a mix of instructions and will achieve lower than peak FLOPS. Further, applications which have not been re-compiled with an appropriate compiler will not include AVX-512 instructions and thus achieve lower performance. Computational applications which do not utilize AVX-512 instructions will most likely utilize AVX2 instructions (as shown in the second tab with AVX2 Instruction performance.

Intel Xeon “Ice Lake SP” Price Ranges

The pricing of the 3rd-generation Xeon Processor Scalable Family spans a wide range, so budget must be kept in mind when selecting options. It would be frustrating to plan on 38-core processors when the budget cannot support a price of more than $10,000 per CPU. The plot below compares the prices of the Xeon “Cascade Lake SP” processors:

As shown in the above plot, the CPUs in this article have been sorted by tier and by price. Most HPC users are expected to select CPU models from the Gold Xeon 6300-series. These models provide close to peak performance for a price around $3,000 per processor. Certain specialized applications will leverage the Platinum Xeon 8300-series

To ease comparisons, all of the plots in this article are ordered to match the above plot. Keep this pricing in mind as you review this article and plan your system architecture.

Recommended Xeon CPU Models for HPC & AI/Deep Learning

As stated at the top, most of this new CPU family offers excellent performance. However, it is common for HPC sites to set a minimum floor on CPU clock speeds (usually around 2.5GHz), with the intent that no workload suffers too low of a performance. While there are users who would demand even higher clock speeds, experience shows that most groups settle on a minimum clock speed in the 2.5GHz to 2.6GHz range. With that in mind, the comparisons below highlight only those CPU models which offer 2.5+GHz performance.

[tabby title=”2.5+GHz Core Counts”]

[tabby title=”AVX-512 Performance”]

[tabby title=”AVX2 Performance”]

[tabby title=”2.5+GHz Cost-Effectiveness”]

[tabbyending]

The post Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs appeared first on Microway.

Detailed Specifications of the AMD EPYC “Milan” CPUs

Eliot Eshelman — Mon, 15 Mar 2021 15:00:23 +0000

This article provides in-depth discussion and analysis of the 7nm AMD EPYC processor (codenamed “Milan” and based on AMD’s Zen3 architecture). EPYC “Milan” processors replace the previous “Rome” processors and are available for sale as of March 15th, 2021.

These new CPUs are the third iteration of AMD’s EPYC server processor family. They are compatible with existing workstation and server platforms that supported “Rome”, but include new performance and security improvements. If you’re looking to upgrade to or deploy these new CPUs, please speak with one of our experts to learn more.

Important features/changes in EPYC “Milan” CPUs include:

Up to 64 processor cores per socket (with options for 8-, 16-, 24-, 28-, 32-, 48-, and 56-cores)
Improved CPU clock speeds up to 3.7GHz (with Max Boost speeds up to 4.1GHz)
Unified 32MB L3 cache shared between each set of 8 cores (instead of two separate 16MB caches)
Increase in instructions completed per clock cycle (IPC)
IOMMU for improved IO performance in virtualized environments
The security/memory encryption features present in “Rome”, along with SEV-SNP support (protecting against malicious hypervisors)
Plus all the advantages of the previous “Rome” generation:
- Full support for 256-bit AVX2 instructions with two 256-bit FMA units per CPU core
- Up to 16 double-precision FLOPS per cycle per core
- Eight-channel memory controller on each CPU
- Support for DDR4 memory speeds up to 3200MHz
- Up to 4TB memory per CPU socket
- Up to 256MB L3 cache per CPU
- Support for PCI-Express generation 4.0 (which doubles the throughput of gen 3.0)
- 128 lanes of PCI-Express 4.0 per CPU socket

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC & AI applications.

Before diving into the details, it helps to keep in mind the following recommendations. Based on our experience with HPC and Deep Learning deployments, our general guidance for selecting among the EPYC options is shown below. Note that certain applications may deviate from this general advice (e.g., software which benefits from particularly high clock speeds or larger L3 cache per core).

8-core EPYC CPUs – not recommended for HPC
While perfect for particular applications, these models are not as cost-effective as many of the higher core count options.
16-core to 28-core EPYC CPUs – suitable for most HPC workloads
While not typically offering the best cost-effectiveness, they provide excellent performance at lower price points.
32-core EPYC CPUs – excellent for HPC workloads
These models offer excellent price/performance along with higher clock speeds and core counts
48-core to 64-core EPYC CPUs – suitable for certain HPC workloads
Although these models with high core counts may provide the highest cost-effectiveness and power efficiency, some applications exhibit diminishing returns at the highest core counts. Scalable applications that are not memory bandwidth bound will benefit the most from these EPYC CPUs.

Microway provides a Test Drive cluster to assist in evaluating and comparing products as users determine the ideal specifications for their new HPC & AI deployments. We would be happy to help you evaluate AMD EPYC processors as you plan your next deployment.

AMD EPYC “Milan” Computational Performance

This latest iteration of EPYC CPUs offers excellent performance. However, many of the on-paper comparisons between this generation and the previous generation do not demonstrate large gains. Application benchmarking will be needed to demonstrate many of the gains (such as those provided by the larger/unified L3 cache and the IPC improvements). That being said, most models in this generation provide at least 1 TFLOPs (one teraflop of double-precision 64-bit compute per second) and the 64-core CPUs provide over 2 TFLOPS. The plot below shows the expected performance across this new CPU line-up:

In the chart above, shaded/colored bars indicate the expected performance ranges for each CPU model on traditional HPC applications that use double-precision 64-bit math operations. Peak performance numbers are achieved when executing 256-bit AVX2 instructions with FMA. Note that only a small set of applications are able to use exclusively AVX2 FMA instructions (e.g., LINPACK). Most applications issue a variety of instructions and will achieve lower than the peak FLOPS values shown above. Applications which have not been re-compiled in recent years (with a compiler supporting AVX2 instructions) would achieve lower performance.

The dotted lines above each bar indicate the possible peak performance were all CPU cores operating at boosted clock speeds. While theoretically possible for short amounts of time, sustained performance at these increased CPU frequencies is not expected. Sections of code with dense, vectorized instructions are very demanding, and typically result in each core slightly lowering clock speeds (a behavior not unique to AMD CPUs). While AMD has not published specific clock speed expectations for such codes, Microway expects the EPYC “Milan” CPUs to operate near their standard/published clock speed values when all cores are in use.

Throughout this article, the CPU models are sorted largely by price. The lowest-performance models provide fewer numbers of CPU cores and less L3 cache memory. Higher-end models offer high core counts for the increased performance. HPC and AI groups are generally expected to favor the processor models in the middle of the pack, as the highest core count CPUs are priced at a premium.

Note that those models which only support single-CPU installations are separated on the left side of each plot.

AMD “Milan” EPYC Processor Specifications

The tabs below compare the features and specifications of this 3rd iteration of the EPYC processors. Please notice that CPU models ending with a P suffix are designed for single-socket systems (and do not operate in dual-socket systems). All other CPU models are compatible with both single- or dual-socket systems. The P-series EPYC processors tend to be priced lower and can thus be quite cost-effective, however they are not available in dual-CPU systems.

AMD continues to increase the CPU frequencies of the EPYC CPUs. This ‘Milan’ generation offers multiple SKUs operating above 3GHz. Additionally, the boost clock frequencies are considerably increased – with many nearing or exceeding 4GHz. Each CPU core supports “boost” speeds enabling temporary boosts of speed over the base clock speed. The maximum Boost speed for each CPU model is shown as a dotted line.

The industry continues to push for increased performance, and power control within CPUs continues to advance. With ‘Milan’ EPYC CPUs, each model ships with a default power consumption setting. However, each can be adjusted in the BIOS to set a new configurable TDP (cTDP), which might be higher or lower than the default. The default TDP of each model is shown in the plot below. Note the lowest default wattage for EPYC “Milan” is 155 Watts, with the majority of models in the 200W~240W range, and two models at 280 Watts. Demanding computational users must be certain that the systems they select have received thorough thermal validation. Systems not designed for these wattages will run hot, throttle CPU speeds, and provide lower performance.

Editor’s note: complete pricing was not available at time of publication, so additional analysis of price and cost-effectiveness of each CPU SKU will be added to this article when available.

The post Detailed Specifications of the AMD EPYC “Milan” CPUs appeared first on Microway.

Detailed Specifications of the AMD EPYC “Rome” CPUs

Eliot Eshelman — Wed, 07 Aug 2019 23:00:34 +0000

This article provides in-depth discussion and analysis of the 7nm AMD EPYC processor (codenamed “Rome” and based on AMD’s Zen2 architecture). EPYC “Rome” processors replace the previous “Naples” processors and are available for sale as of August 7th, 2019. We also have provided an AMD EPYC “Rome” CPU Review that you may wish to review. Note: these have since been superseded by the “Milan” AMD EPYC CPUs.

These new CPUs are the second iteration of AMD’s EPYC server processor family. They remain compatible with the existing workstation and server platforms, but feature significant feature and performance improvements. Some of the new features (e.g., PCI-E 4.0) will require updated/revised platforms. If you’re looking to upgrade to or deploy these new CPUs, please speak with one of our experts to learn more.

Important features/changes in EPYC “Rome” CPUs include:

Up to 64 processor cores per socket (with options for 8-, 12-, 16-, 24-, 32-, and 48-cores)
Improved CPU clock speeds up to 3.1GHz (with Boost speeds up to 3.4GHz)
Increased computational performance:
- Full support for 256-bit AVX2 instructions with two 256-bit FMA units per CPU core
  The previous “Naples” architecture split 256-bit instructions into two separate 128-bit operations
- Up to 16 double-precision FLOPS per cycle per core
- Double-precision floating point multiplies complete in 3 cycles (down from 4)
- 15% increase in instructions completed per clock cycle (IPC) for integer operations
Memory capacity & performance features:
- Eight-channel memory controller on each CPU
- Support for DDR4 memory speeds up to 3200MHz (up from 2666MHz)
- Up to 4TB memory per CPU socket
Up to 256MB L3 cache per CPU (up from 64MB)
Support for PCI-Express generation 4.0 (which doubles the throughput of gen 3.0)
Up to 128 lanes of PCI-Express per CPU socket
Improvements to NUMA architecture:
- Simplified design with one NUMA domain per CPU Socket
- Uniform latencies between CPU dies (plus fewer hops between cores)
- Improved InfinityFabric performance (read speed per clock is doubled to 32 bytes)
Integrated in-silicon security mitigations for Spectre

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC & AI applications.

Before diving into the details, it helps to keep in mind the following recommendations. Based on our experience with HPC and Deep Learning deployments, our guidance for selecting among the EPYC options is as follows:

8-core EPYC CPUs – not recommended for HPC
While available for a low price, these models are not as cost-effective as many of the higher core count models.
12-, 16-, and 24-core EPYC CPUs – suitable for most HPC workloads
While not typically offering the best cost-effectiveness, they provide excellent performance at lower price points.
32-core EPYC CPUs – excellent for HPC workloads
These models offer excellent price/performance along with relatively high clock speeds and core counts
48-core and 64-core EPYC CPUs – suitable for certain HPC workloads
Although the highest core count models appear to provide the best cost-effectiveness and power efficiency, many applications exhibit diminishing returns at the highest core counts. For scalable applications that are not memory bandwidth bound, these EPYC CPUs will be excellent choices.

Microway operates a Test Drive cluster to assist in evaluating and comparing these options as users develop the specifications for their new HPC & AI deployments. We would be happy to help you evaluate AMD EPYC processors as you plan your purchase.

Unprecedented Computational Performance

The EPYC “Rome” processors deliver new capabilities and exceptional performance. Many models provide over 1 TFLOPS (one teraflop of double-precision 64-bit performance per second) and several models provide 2 TFLOPS. This performance is achieved by doubling the computational power of each core and doubling the number of cores. The plot below shows the performance range across this new CPU line-up:

As shown above, the shaded/colored bars indicate the expected performance ranges for each CPU model. These peak performance numbers are achieved when executing 256-bit AVX2 instructions with FMA. Note that only a small set of codes issue almost exclusively AVX2 FMA instructions (e.g., LINPACK). Most applications issue a variety of instructions and will achieve lower than the peak FLOPS values shown above. Applications which have not been re-compiled with an appropriate compiler would not include AVX2 instructions and would thus achieve lower performance.

The dotted lines indicate the possible peak performance if all cores are operating at boosted clock speeds. While theoretically possible for short bursts, sustained performance at these levels is not expected. Sections of code with dense, vectorized instructions are very demanding, and typically result in the processor core slightly lowering its clock speed (this behavior is not unique to AMD CPUs). While AMD has not published specific clock speed expectations for such codes, Microway expects the EPYC “Rome” CPUs to operate near their “base” clock speed values even when executing code with intensive instructions.

The CPU models above are sorted by price (as discussed in the next section). The lowest-performance models provide fewer numbers of CPU cores, less cache, and slower memory speeds. Higher-end models offer high core counts for the best performance. HPC and AI groups are generally expected to favor the mid-range processor models, as the highest core count CPUs are priced at a premium.

Note that those models which only support single-CPU installations are separated on the left side of each plot.

AMD EPYC “Rome” Price Ranges

The new EPYC “Rome” processors span a fairly wide range of prices, so budget must be considered when selecting a CPU. While the entry-level models are under $1,000, the highest-end EPYC processors cost nearly $10,000 each. It would be frustrating to plan for 64-core processors when the budget cannot support the price. The plot below compares the prices of the EPYC “Rome” processors:

All the CPUs in this article are sorted by price (as shown in the plot above). To ease comparisons, all of the plots in this article are ordered to match the above plot. Keep this pricing in mind as you review this article and plan your system architecture. The color of each bar indicates the expected customer price per CPU:

Low price tier: prices below $1,000 per CPU
Mid price tier: prices between $1,000 and $2,000
High price tier: prices between $2,000 and $4,000
Premium price tier: prices above $4,000 per EPYC CPU

Most HPC users are expected to select CPU models around the high price tier. These models provide industry-leading performance (and excellent performance per dollar) for a price under $4,000 per processor. Applications can certainly leverage the premium EPYC processor models, but they will come at a higher price.

AMD “Rome” EPYC Processor Specifications

The set of tabs below compares the features and specifications of this new EPYC processor family. Take note that certain CPU SKUs are designed for single-socket systems (indicated with a P suffix on the part number). All other models may be used in either a single- or dual-socket system. The P-series AMD EPYC CPUs have a lower price and are thus the most cost-effective models, but remember that they are not available in dual-CPU systems.

As the industry pushes for increased performance, CPU power requirements have increased across the board. The lowest-wattage EPYC “Rome” CPU is 120 Watts, with the majority of models in the 155W~180W range. The highest core count models are over 200 Watts. HPC users must be certain that the systems they select have received thorough thermal validation. Systems which run hot will throttle CPU speeds and suffer lower performance.

Cost-Effectiveness and Power Efficiency of EPYC “Rome” CPUs

Overall, the AMD EPYC processors provide great value in price spent versus performance achieved. However, there is a spectrum of efficiency, with certain CPU models offering particularly compelling value. Also remember that the prices and power requirements for some of the top models are fairly high. Savvy readers may find the following facts useful:

The most cost-effective CPUs likely to be selected are EPYC 7452 and EPYC 7552
If a balance of cost-effectiveness and higher clock speed are needed, look to EPYC 7502
While the EPYC 7702 looks to be the most cost-effective on paper, it is important to consider that many applications may not be able to scale efficiently to 64 cores. Benchmark before making the selection.
Applications which can be satisfied by a single CPU will benefit greatly from the single-socket EPYC 7xx2P models

The plots below compare the cost-effectiveness and power efficiency of these CPU models. The intent is to go beyond the raw “speeds and feeds” of the processors to determine which models will be most attractive for HPC and Deep Learning/AI deployments.

Cost-Effectiveness for HPC & AI
CPU Power Efficiency

Historically, we have looked at the performance of each CPU and compared that with their costs. However, this presents a distorted view as it does not include all the other necessary components in an HPC/AI system (the server, system memory, high-speed fabric, etc). That simplistic comparison is shown further below, but first examine this plot which demonstrates the cost per FLOPS for a set of complete Compute Nodes with AMD EPYC Rome CPUs, 4GB of system memory per core, and 100Gbps EDR InfiniBand. As shown, the models with higher core counts tend to be the most cost-effective overall.

The plot below shows a simple comparison of CPU performance versus the number of FLOPS provided for each model. This may be useful when comparing to older CPU models, but for new projects we recommend the plot above.

Recommended CPU Models for HPC & AI/Deep Learning

Although most of the EPYC CPUs will offer excellent performance, it is common for computationally-demanding sites to set a floor on CPU clock speeds (usually around 2.5GHz). The intent is to ensure that no workload suffers too low of a performance, as not all are well parallelized. While there are users who would prefer higher clock speeds, experience shows that most groups settle on a minimum clock speed around ~2.5GHz. With that in mind, the comparisons below highlight only those CPU models which offer 2.5+GHz performance.

The post Detailed Specifications of the AMD EPYC “Rome” CPUs appeared first on Microway.

Detailed Specifications of the “Cascade Lake SP” Intel Xeon Processor Scalable Family CPUs

Eliot Eshelman — Tue, 02 Apr 2019 17:00:03 +0000

This article provides in-depth discussion and analysis of the 14nm Intel Xeon Processor Scalable Family (formerly codenamed “Cascade Lake-SP” or “Cascade Lake Scalable Processor”). “Cascade Lake-SP” processors replace the previous 14nm “Skylake-SP” microarchitecture and are available for sale as of April 2, 2019. On February 24, 2020, a set of “Cascade Lake Refresh” Xeon models were released with increased clock speeds and improved cost/performance. These Xeon CPUs have been superseded by the 3rd-generation Intel Xeon ‘Ice Lake SP’ scalable processors.

These new CPUs are the second iteration of Intel’s Xeon Processor Scalable Family. They remain compatible with the existing workstation and server platforms, but bring incremental performance along with additional capabilities and options.

Important features/changes in Xeon Scalable Processor Family “Cascade Lake SP” CPUs include:

Up to 28 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 16-, 18-, 20-, 24-, and 26-cores)
Improved CPU clock speeds (with Turbo Boost up to 4.4GHz)
Continued high performance with the AVX-512 instruction capabilities of the previous generation:
- AVX-512 instructions (up to 16 double-precision FLOPS per cycle per AVX-512 FMA unit)
- Up to two AVX-512 FMA units per CPU core (depends upon CPU SKU)
Introduction of new AVX-512 VNNI instruction:
- Intel Deep Learning Boost (VNNI) provides significant, more efficient deep learning inference acceleration
- Combines three AVX-512 instructions (VPMADDUBSW, VPMADDWD, VPADDD) into a single VPDPBUSD operation
Memory capacity & performance features:
- Six-channel memory controller on each CPU
- Support for DDR4 memory speeds up to 2933MHz (up from 2666MHz)
- Large-memory capabilities with Intel Optane DC Persistent Memory
- All CPU models support up to 1TB-per-socket system memory
- Optional CPUs support up to 4.5TB-per-socket system memory (only available on certain SKUs)
Introduction of Intel Speed Select processor models:
- Certain processors support three distinct operating points
- Each operating point provides a different number of CPU cores
- CPU clock and Turbo Boost speeds optimized for each core count
Integrated hardware-based security mitigations against side-channel attacks

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC & AI applications.

Specialization of Intel Xeon CPUs

The new “Cascade Lake-SP” processors will be be familiar to existing users. Just as in the previous generation, the processor family is divided into four tiers: Bronze, Silver, Gold, and Platinum. Bronze provides modest performance for a low price. The Silver and Gold models are in the price/performance range familiar to HPC users/architects. Platinum models are in a higher price range than HPC groups are typically accustomed to (Platinum tier targets Enterprise workloads, and is priced accordingly).

However, this new generation is not simply a revision of the previous models. Increasingly, we are seeing processors that have been designed with a particular workload in mind. The “Cascade Lake SP” Xeons introduce several new specialized CPU models:

S: search optimized
N: network function virtualization (NFV) optimized
V: virtualization density optimized
Y: Intel speed select
U: single-socket optimized

In the case of the first two specializations (search and NFV), specific CPU clock frequencies and Turbo Boost speeds are guaranteed only for those specific workloads. Running other workloads on these optimized CPUs will likely lead to CPU throttling, which would be undesirable. The virtualization density optimized models provide high CPU core counts within relatively modest power envelopes. However, the processor clock and memory clock frequencies are reduced to accomplish this. Considering these limitations, the search-, NFV-, and virtualization-optimized models will not be included in our review

The single-socket optimized CPUs are indicated with a U suffix in the model name (e.g., Xeon 6210U). These CPUs are quite cost-effective for what they offer (a 6200-series CPU for a 5200-series price). However, they do not include UPI links and thus can only be installed in systems with a single processor.

Intel Speed Select CPUs are indicated with a Y suffix in the model name (e.g., Xeon 6240Y). Each of these three CPUs offers the same core count and clock speed as their non-Y counterpart. However, the system can be rebooted into a lower core-count mode which boosts the CPU clock and Turbo Boost speeds. The Speed Select models available in this generation are: 8260Y, 6240Y, and 4214Y. Although these models are not called out by name below, understand that alternate versions of Xeon 8260, 6240, and 4214 are available if you need core count & clock speed flexibility.

Intel Xeon Bronze – not recommended for HPC
Base-level model with low performance.
Intel Xeon Silver – suitable for entry-level HPC
4200-series models offer slightly improved performance over previous generations.
Intel Xeon Gold – recommended for most HPC workloads
The best balance of performance and price. In particular, the 6200-series models should be preferred over the 5200-series models, because they have twice the number of AVX-512 units
Intel Xeon Platinum – recommended only for specific HPC workloads
Although these 8200-series models provide the highest performance, their higher price makes them suitable only for particular workloads which require their specific capabilities (e.g., high core count, large SMP, and large-memory Compute Nodes).

* Given their positioning, the Intel Xeon Bronze products will not be included in our review

Exceptional Computational Performance

The Xeon “Cascade Lake SP” processors deliver new capabilities and unprecedented performance. Most models provide over 1 TFLOPS (one teraflop of double-precision 64-bit performance per second) and a couple models provide 2 TFLOPS. This performance is achieved with high core counts and AVX-512 instructions with FMA (just as in the previous generation). The plots in the tabs below compare the performance ranges for these new CPUs:

AVX-512 Instructions
AVX2 Instructions

As shown above, the shaded/colored bars indicate the expected performance ranges for each CPU model. The first plot shows performance when using Intel’s AVX-512 instructions with FMA. Note that only a small set of codes will be capable of issuing exclusively AVX-512 FMA instructions (e.g., LINPACK). Most applications issue a variety of instructions and will achieve lower than peak FLOPS. Applications which have not been re-compiled with an appropriate compiler will not include AVX-512 instructions and thus achieve lower performance. Those expected performance ranges are shown in the plot of AVX2 Instruction performance.

Although the ordering of the above plots may seem arbitrary, they are sorted by price (as discussed in the next section). The lowest-performance models provide fewer numbers of CPU cores and fewer AVX math units. Higher-end models provide a mix of higher core counts and higher clock speeds. A few CPU models, such as Xeon 6244 and Xeon 8256, strongly favor high clock speeds over CPU core count (which results in lower overall FLOPS throughput). HPC and AI groups are expected to favor the Intel Xeon Gold processor models.

Intel Xeon “Cascade Lake SP” Price Ranges

The pricing of the Xeon Processor Scalable Family spans a wide range, so budget must be kept at top of mind when selecting options. It would be frustrating to plan for 28-core processors when the budget cannot support a price of more than $10,000 per CPU. The plot below compares the prices of the Xeon “Cascade Lake SP” processors:

As in the above plot, all the CPUs in this article are sorted by price. Most HPC users are expected to select CPU models from the Gold Xeon 6200-series. These models provide close to peak performance for a price under $4,000 per processor. Certain specialized applications will leverage the Platinum Xeon 8200-series, such as very large memory nodes (>3TB system memory).

To ease comparisons, all of the plots in this article are ordered to match the above plot. Keep this pricing in mind as you review this article and plan your system architecture.

Intel “Cascade Lake SP” Xeon Processor Scalable Family Specifications

The sets of tabs below compare the features and specifications of this new Xeon processor family. As you will see, the Silver (4200-series) and lower-end Gold (5200-series) CPU models offer fewer capabilities and lower performance. The higher-end Gold (6200-series) and Platinum (8200-series) offer more capabilities and higher performance. Additionally, certain CPU SKUs have special models integrating additional specializations:

Enabled for Intel Speed Select (indicated with a Y suffix on the part number)
Support for up to 4.5TB of memory per CPU socket (indicated with an L suffix on the part number)
(these same CPUs have a lower-cost alternate SKU supporting 2TB memory per socket (indicated with an M suffix on the part number)
Designed for single CPU socket systems (indicated with a U suffix on the part number)
All Gold- and Platinum-series CPUs support Intel’s new Optane DC Persistent Memory

Memory performance of Intel Xeon “Cascade Lake-SP” is fairly straightforward, with the Silver CPUs providing a lower speed than the Gold and Platinum models. The amount of memory bandwidth available per CPU core is an important factor, but is simply a function of the number of cores. Users planning to run on CPUs with higher core counts need to ensure that each core won’t be starved of data.

It is important to note that some system platforms support two memory slots per memory channel (a total of 24 DIMMs in a dual-socket system). If both memory slots are populated with memory, the slots will run no faster than 2666MHz (this is simply an electrical/signaling limit).

The UPI capabilities of these CPUs are nearly identical to the previous generation. Each CPU supports two or three UPI links operating at 9.6GT/s to 10.4GT/s. Only the Xeon 6200-series and 8200-series support the higher number of UPI links, which allows greater connectivity between sockets. Dual-socket systems are the most popular configuration for HPC, but not all dual-socket platforms support all three UPI links – review your proposed system architecture.

Although dual-socket systems are the most common for HPC & AI workloads, there are use cases requiring larger or smaller numbers of CPUs. The plot below compares the various CPU socket counts supported by this processor line-up (ranging from a single socket to eight sockets). Take note that although the 5200-series CPUs support four sockets, they only provide dual UPI links. HPC users are advised to look to 6200- and 8200-series models for four-socket systems.

Although there are still processor models in the same power range as previous generations, an increasing number of models feature TDPs above 140 Watts. A few models even reach over 200 Watts. HPC users must be certain that the systems they select have received thorough thermal validation. Systems which run warm will suffer lower performance. In particular, care is recommended with higher clock speed CPUs (2.5+ GHz) which may reduce their clock speeds more aggressively to remain within thermal limits.

In addition to the specifications called out above, technical readers should note that the “Cascade Lake SP” CPU architecture inherits most of the architectural design of the previous “Skylake-SP” architecture, including the mesh processor layout, redesigned L2/L3 caches, greater UPI connectivity between CPU sockets, and improvements to the processor frequency speeds/turbo. A more comprehensive list of features is shown at the end of the article.

Clock Speeds & Turbo Boost

Just as in the previous generation, the “Cascade Lake-SP” architecture acknowledges that highly-parallel/vectorized applications place the highest load on the processor cores (requiring more power and generating more heat). While a CPU core is executing intensive vector tasks (AVX2 or AVX-512 instructions), the clock speed will be adjusted downwards to keep the processor within its power limits (TDP).

In effect, this will result in the processor running at a lower frequency than the standard clock speed advertised for each model. For that reason, each processor is assigned three frequency ranges:

AVX-512 mode: due to the high requirements of AVX-512/FMA instructions, clock speeds will be lowered while executing AVX-512 instructions *
AVX2 mode: due to the higher requirements of AVX2/FMA instructions, clock speeds will be somewhat lower while executing AVX instructions *
Non-AVX mode: while not executing “heavy” AVX instructions, the processor will operate at the “stock” frequency

Each of the “modes” above is actually a range of CPU clock speeds. The CPU will run at the highest speed possible for the particular set of CPU instructions that have been issued. It is worth noting that these modes are isolated to each core. Within a given CPU, some cores may be operating in AVX mode while others are operating in Non-AVX mode.

* Intel has applied a 1-millisecond hysteresis window on each CPU core to prevent rapid frequency transitions

AVX-512, AVX, and Non-AVX Turbo Boost in Xeon “Cascade Lake-SP” Scalable Family processors

Each CPU also includes the Turbo Boost feature which allows each processor core to operate well above the “base” clock speed during most operations. The precise clock speed increase depends upon the number & intensity of tasks running on each CPU. However, Turbo Boost speed increases also depend upon the types of instructions (AVX-512, AVX, Non-AVX).

The plots below demonstrate processor clock speeds under the following conditions:

All cores on the CPU actively running Non-AVX, AVX, or AVX-512 instructions
A single active core running Non-AVX, AVX, or AVX-512 instructions (all other cores on the CPU must be idle)

The dotted lines represent the range of clock speeds for Non-AVX instructions. The thin grey bars represent the range of clock speeds for AVX2/FMA instructions. The thicker shaded/colored bars represent the range of clock speeds for AVX-512/FMA instructions.

All CPU cores active
A single CPU core active

Note that despite the clear rules stated above, each Turbo Boost value is still a range of clock speeds. Because workloads are so diverse, Intel is unable to guarantee one specific clock speed for AVX-512, AVX, or Non-AVX instructions. Users are guaranteed that cores will run within a specific range, but each application will have to be benchmarked to determine which frequencies a CPU will operate at.

Despite the perceived reduction in performance when running these vector instructions, keep in mind that AVX-512 roughly doubles the number of operations which can be completed per cycle. Although the clock speeds might be reduced by nearly 1GHz, the overall throughput is increased. HPC users should expect their processors to be running in AVX or AVX-512 mode most of the time.

Cost-Effectiveness and Power Efficiency of Xeon “Cascade Lake SP” CPUs

Many of these new processors have the same price structure as earlier Xeon server CPU families. However, the prices and power requirements for some of the premium models are fairly high. Savvy readers may find the following facts useful:

HPC applications run best on the higher-end Gold and Platinum CPU models (6200- and 8200-series), as all of the lower-end CPUs provide only half the number of math units.
Applications which can be satisfied by a single CPU will benefit greatly from the single-socket Xeon 62xxU models
The Platinum models (8200-series) are generally targeted towards Enterprise and Finance – these carry higher prices than other models.

Cost-Effectiveness for HPC & AI
CPU Power Efficiency

This plot compares the power requirements (TDP) versus performance throughput of each CPU. Although this generation includes some of the highest-wattage CPUs to date, each is actually quite power efficient. In fact, even the 205-Watt CPU models are among the top most-efficient models in this product line. Overall, any CPU selected from the Xeon 6200- or 8200-series will be close to the most efficient CPU on the market. Groups which select the lower-price 4200-series CPUs will end up spending more on power per useful work completed.

Recommended CPU Models for HPC & AI/Deep Learning

Although many of these CPU models will offer excellent performance, it is common for HPC sites to set a floor on CPU clock speeds (usually around 2.5GHz). The intent is to ensure that no workload suffers too low of a performance. While there are users who would prefer higher clock speeds, experience shows that most groups settle on a value of 2.5GHz to 2.6GHz. With that in mind, the comparisons below highlight only those CPU models which offer 2.5+GHz performance.

Summary of features in Xeon Scalable Family “Cascade Lake-SP” processors

In addition to the capabilities mentioned at the top of this article, these processors include many of the successful features from earlier Xeon designs. They also include lower-level changes that may of interest to expert users. The list below provides a more detailed summary of relevant technology features in Cascade Lake-SP:

Up to 28 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 16-, 18-, 20-, 24-, and 26-cores)
Improved CPU clock speeds (with Turbo Boost up to 4.4GHz)
Continued high performance with the AVX-512 instruction capabilities of the previous generation:
- AVX-512 instructions (up to 16 double-precision FLOPS per cycle per AVX-512 FMA unit)
- Up to two AVX-512 FMA units per CPU core (depends upon CPU SKU)
Introduction of new AVX-512 VNNI instruction:
- Intel Deep Learning Boost – the new 8-bit Vector Neural Network Instruction (VNNI) provides significant, more efficient deep learning inference acceleration
- Combines three AVX-512 instructions (VPMADDUBSW, VPMADDWD, VPADDD) into a single VPDPBUSD operation
As introduced with “Haswell” and “Broadwell”, these CPUs continue to support 128-bit AVX and 256-bit AVX2 Advanced Vector Extensions with FMA3

Memory capacity & performance features:

Six-channel memory controller on each CPU
Support for DDR4 memory speeds up to 2933MHz (up from 2666MHz)
Single DIMM per channel operates at up to 2933MHz; two DIMMs per channel operate at up to 2666MHz
Large-memory capabilities with Intel Optane DC Persistent Memory
All CPU models support up to 1TB-per-socket system memory
Optional CPU support for 2TB- or 4.5TB-per-socket system memory (only available on certain SKUs)

Introduction of Intel Speed Select processor models:

Certain processors support three distinct operating points
Each operating point provides a different number of CPU cores
CPU clock and Turbo Boost speeds optimized for each core count

Integrated hardware-based security mitigations against side-channel attacks
Fast links between CPU sockets with up to three 10.4GT/s UPI links
I/O connectivity of 48 lanes of generation 3.0 PCI-Express per CPU
CPU cores are arranged in an “Uncore” mesh interconnect
Optimized Turbo Boost profiles allow higher frequencies even when many CPU cores are in use
Direct PCI-Express (generation 3.0) connections between each CPU and peripheral devices such as network adapters, GPUs and coprocessors (48 PCI-E lanes per socket)

Turbo Boost technology improves performance under peak loads by increasing processor clock speeds. Clock speeds are boosted higher even when many cores are in use. There are three tiers of Turbo Boost clock speeds depending upon the type of instructions being executed:
- Non-AVX: Operations that are not math intensive, or “light” AVX/AVX2 instructions which don’t involve multiply/FMA
- AVX: Operations that heavily use the AVX/AVX2 units, or that use the AVX-512 unit (but not the multiply/FMA instructions)
- AVX-512: Operations that heavily use the AVX-512 units, including multiply/FMA instructions
Intel QuickData Technology (CBDMA) doubles memory-to-memory copy performance and supports MMIO memory copy
Intel Volume Management Device (VMD) provides CPU-level device aggregation for NVMe SSD storage devices
Intel Virtual RAID on CPU (VROC) provides RAID support for NVMe SSD storage devices
Intel C620-series PCH chipset (formerly codenamed “Lewisburg”) with improved connectivity:
- Up to four Intel X722 10GbE/1GbE Ethernet ports with iWARP RDMA support
- PCI-Express generation 3.0 x4 connection from the PCH to the CPUs
- Support for more integrated SATA3 6Gbps ports (up to 14)
- Support for more integrated USB 3.0 ports (up to 10)
- Integrated Intel QuickAssist Technology, which accelerates many cryptographic and compression/decompression workloads
Enhanced CPU Core Microarchitecture:
- Larger and improved branch predictor, higher throughput decoder, larger out-of-order window
- Improved scheduler and execution engine; improved throughput and latency of divide/sqrt
- More load/store bandwidth, deeper load/store buffers, improved prefetcher
- One or Two AVX-512 512-bit FMA units per core
- Support for the following AVX-512 instruction types:
  AVX512-F, AVX512-VL, AVX512-BW, AVX512-DQ, AVX512-CD
- 1MB dedicated L2 cache per core
- A 10% (geomean) improvement in instructions per cycle (IPC) versus the “Broadwell” generation CPUs
Re-architected L2/L3 cache hierarchy:
- Each CPU core contains 1MB L2 private cache (up from 256KB)
- Each core’s private L2 acts as primary cache
- Each CPU contains >1.3MB/core of shared L3 cache (for when the private L2 cache is exhausted)
- The shared L3 cache is non-inclusive (does not keep copies of the L2 caches)
- Larger 64-entry L2 TLB for 1GB pages (up from 16 entries)
Dual or Triple Ultra Path Interconnect (UPI) links between processor sockets improve communication speeds for data-intensive applications
RDSEED instruction for high-quality, non-deterministic, random seed values
Hyper-Threading technology allows two threads to “share” a processor core for improved resource usage. Although useful for some workloads, it is frequently disabled for HPC applications.
Hardware Controlled Power Management for more rapid and efficient decisions on optimal P- and C-State operating point

The post Detailed Specifications of the “Cascade Lake SP” Intel Xeon Processor Scalable Family CPUs appeared first on Microway.

Detailed Specifications of the “Skylake-SP” Intel Xeon Processor Scalable Family CPUs

Eliot Eshelman — Tue, 11 Jul 2017 16:15:48 +0000

This article provides in-depth discussion and analysis of the 14nm Intel Xeon Processor Scalable Family (formerly codenamed “Skylake-SP” or “Skylake Scalable Processor”). “Skylake-SP” processors replace the previous 14nm “Broadwell” microarchitecture (both the E5 and E7 Xeon families) and are available for sale as of July 11, 2017. Note: these have since been superseded by the “Cascade Lake SP” Intel Xeon Processor Scalable Family CPUs.

Important changes available in Xeon Scalable Processor Family “Skylake-SP” CPUs include:

Up to 28 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 14-, 16-, 18-, 20-, 24-, and 26-cores)
Floating Point and Integer Instruction performance improvements:
- New AVX-512 instructions double performance
  (up to 16 double-precision FLOPS per cycle per AVX-512 FMA unit)
- Up to two AVX-512 FMA units per CPU core (depends upon CPU SKU)
Memory capacity & performance improvements:
- Six-channel memory controller on each CPU (up from four-channel on previous platforms)
- Support for DDR4 memory speeds up to 2666MHz
- Optional 1.5TB-per-socket system memory support (only available on certain SKUs)
Faster links between CPU sockets with up to three 10.4GT/s UPI links (replacing the older QPI interconnect)
More I/O connectivity with 48 lanes of generation 3.0 PCI-Express per CPU (up from 40 lanes)
Optional 100Gbps Omni-Path fabric integrated into the processor (only available on certain SKUs)
CPU cores are arranged in an “Uncore” mesh interconnect (replacing the older dual-ring mesh interconnect)
Optimized Turbo Boost profiles allow higher frequencies even when many CPU cores are in use
All 2-/4-/8-socket server product families (sometimes called EP 2S, EP 4S, and EX) are merged into a single product line
A new server platform (formerly codenamed “Purley”) to support this new CPU product family

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC applications.

A New Strategy with New Processor Tiers

With this new product release, Intel merges together all previous Xeon server product families into a single family. The old model numbers with which you might be familiar – E5-2600, E5-4600, E7-4800, E7-8800 – are now replaced by these “Skylake-SP” CPUs. While this opens up the possibility to select from a broad range of processor models for any given project, it requires attention to detail. There are more than 30 CPU models to select from in the Xeon Processor Scalable Family.

This processor family is divided into four tiers: Bronze, Silver, Gold, and Platinum. The Silver and Gold models are in the price range familiar to HPC users/architects. However, the Platinum models are in a higher price range than HPC groups are typically accustomed to. The Platinum tier targets Enterprise workloads, and is priced accordingly.

With that in mind, our analysis is divided into two sections:

CPU models which fit within the existing price ranges for mainstream HPC
CPU models which are of interest to HPC users, but come at a higher price

Before diving into the details, it helps to keep in mind the following recommendations. Based on our experience with adoption of new HPC products, our guidance for selecting Xeon tiers is as follows:

Intel Xeon Bronze – Not recommended for HPC
Base-level models with low performance.
Intel Xeon Silver – Suitable for entry-level HPC
Slightly improved performance over previous generations.
Intel Xeon Gold – Recommended for most HPC workloads
The best balance of performance and price. In particular, the 6100-series models should be preferred over the 5100-series models, because they have twice the number of AVX-512 units
Intel Xeon Platinum – Recommended for specific HPC workloads
Although these models provide the highest performance, their higher price makes them suitable only for particular workloads which require their specific capabilities (e.g., large SMP and large-memory Compute Nodes).

* Given their positioning, the Intel Xeon Bronze products will not be included in our review

Exceptional Computational Performance

The Xeon “Skylake-SP” processors bring new capabilities, new flexibility, and unprecedented performance. Many models provide over 1 TFLOPS (one teraflop of double-precision 64-bit performance per second) and a couple models provide nearly 2 TFLOPS. This performance is achieved with high core counts and the new AVX-512 instructions with FMA. The plots in the tabs below compare the performance ranges of the recommended CPU tiers:

Mid-Range CPU Tier Performance
High-End CPU Tier Performance

The shaded/colored bars indicate the expected performance ranges for each CPU using the new AVX-512 instructions with FMA. Note that only a small set of codes will be capable of issuing exclusively AVX-512 FMA instructions (e.g., LINPACK). Most applications issue a variety of instructions and will achieve lower than peak FLOPS.

Notice that each plot shows two separate groups of CPUs separated by a gap. The CPU models on the left of each plot offer the highest numbers of CPU cores (with CPU clock frequency being a secondary priority). The CPU models on the right of each plot are optimized for the highest CPU clock speeds (with high CPU core count as the secondary priority). Intel describes these high clock speed models as “optimized for the highest per-core performance”. In previous generations, these “frequency-optimized” CPU models were typically the niche option. However, in this generation the CPU models which offer the highest per-core performance are expected to be the primary choices for HPC users – they provide base clock speeds in the 2GHz~3GHz range. The CPU models which do not prioritize clock speed are in the 1.5GHz~2GHz range, which many HPC users would consider to be too low.

Intel Xeon “Skylake-SP” Price Ranges

Because the pricing of the Xeon Processor Scalable Family spans such a wide range, budgets need to be kept at top of mind when selecting options. It would be frustrating to plan for 28-core processors when the budget cannot support a price of more than $10,000 per CPU.

The tabs below compare the prices of the various CPU tiers. As above, each plot is divided with high-core-count CPUs on the left and highest-per-core performance on the right.

Mid-Range CPU Tier Pricing
High-End CPU Tier Pricing

As the above plots show, the CPUs are sorted by price. All of the plots in this article are ordered to match the plots above. Keep the pricing in mind as you review the remainder of the information in this article.

Intel “Skylake-SP” Xeon Processor Scalable Family Specifications

The sets of tabs below compare the features and specifications of this new Xeon processor family. As you will see, the Silver (4100-series) and low-end Gold (5100-series) offer fewer capabilities and lower performance. The high-end Gold (6100-series) and Platinum (8100-series) offer more capabilities and higher performance. Additionally, certain models within the 6100-series and 8100-series have special models integrating additional specializations:

Enabled for up to 1.5TB of memory per CPU socket (indicated with an M suffix on the part number)
Including integrated 100Gbps Omni-Path interconnect (indicated with an F suffix on the part number)

In addition to the significant performance increases, there are notable changes to the “Skylake-SP” processor designs. These include a completely new mesh connectivity between the processor cores, redesigned L2/L3 caches, greater connectivity between CPU sockets, and new changes to the processor frequency speeds. These are discussed further in the sections below.

Most HPC groups should find that 12-core, 14-core, and 16-core models fit within their budget. Systems with up to 24-cores per CPU will not be shockingly expensive. However, the 26-core and 28-core models are only available within the Platinum tier and will be at a higher cost than most groups would consider cost-effective.

DDR4 Memory Speed

Mid-Range CPU Tier
High-End CPU Tier

As shown above, memory performance is fairly homogeneous across this CPU family. The amount of memory bandwidth available per CPU core will be an important factor, but is simply a function of the number of cores. Users planning to run on CPUs with higher core counts need to ensure that each core won’t be starved of data.

Intel has also enabled these CPUs to drive fully-populated systems at full memory speed. In previous generations, populating more than half of the memory slots would result in a modest reduction in memory speed.

L3 Cache Size

Mid-Range CPU Tier
High-End CPU Tier

Each CPU has been designed to offer at least 1.375MB of L3 cache per core. As shown above, there are several models which feature a larger quantity of L3 per core. Remember that each core also has 1MB of private L2 cache. In this generation, the L3 cache is largely seen as a fallback if data spills out of L2 (a “victim cache”).

Ultra Path Interconnect (UPI) Performance

Mid-Range CPU Tier
High-End CPU Tier

With the “Skylake-SP” architecture, Intel has replaced the older QPI interconnect with UPI. The throughput per link increases from 9.6GT/s to 10.4GT/s. Additionally, many CPU models support up to 3 UPI links per socket (compared to 2 QPI links in most earlier platforms). This allows greater connectivity between sockets, particularly on dual-socket systems which are the most popular configuration for HPC.

Power Consumption (TDP)

Mid-Range CPU Tier
High-End CPU Tier

Although there are still many models in the same power range as previous generations, there are an increasing number of models with TDPs above 140 Watts. A couple of models even reach over 200 Watts. For this generation, HPC users must be certain that the systems they use have gone through careful thermal validation. Systems which run warm will suffer lower performance.

Clock Speeds & Turbo Boost in Xeon “Skylake-SP” Scalable Family processors

With each new processor line, Intel introduces new architecture optimizations. The design of the “Skylake-SP” architecture acknowledges that highly-parallel/vectorized applications place the highest load on the processor cores (requiring more power and thus generating more heat). While a CPU core is executing intensive vector tasks (AVX or AVX-512 instructions), the clock speed will be adjusted downwards to keep the processor within its power limits (TDP).

In effect, this will result in the processor running at a lower frequency than the standard clock speed advertised for each model. For that reason, each “Skylake-SP” processor is assigned three “base” frequencies:

AVX-512 mode: due to the high requirements of AVX-512/FMA instructions, clock speeds will be lowered while executing AVX-512 instructions *
AVX mode: due to the higher power requirements of AVX2/FMA instructions, clock speeds will be somewhat lower while executing AVX instructions *
Non-AVX mode: while not executing AVX/AVX-512 instructions, the processor will operate at what would traditionally be considered the “stock” frequency

* Intel has applied a 1-millisecond hysteresis window on each CPU core to prevent rapid frequency transitions

AVX-512, AVX, and Non-AVX Turbo Boost

Just as in previous generations, “Skylake-SP” CPUs include the Turbo Boost feature which allows each processor core to operate well above the “base” clock speed during most operations. The precise clock speed increase depends upon the number & intensity of tasks running on each CPU. However, Turbo Boost speed increases also depend upon the types of instructions (AVX-512, AVX, Non-AVX).

The plots below demonstrate processor clock speeds under the following conditions:

All cores on the CPU actively running Non-AVX, AVX, or AVX-512 instructions
A single active core running Non-AVX, AVX, or AVX-512 instructions (all other cores on the CPU must be idle)

The dotted lines represent the range of clock speeds for Non-AVX instructions. The thin cyan bars represent the range of clock speeds for AVX2/FMA instructions. The thicker shaded/colored bars represent the range of clock speeds for AVX-512/FMA instructions.

Mid-Range CPU Tier – All Cores Active
High-End CPU Tier – All Cores Active

Mid-Range CPU Tier – Single Core Active
High-End CPU Tier – Single Core Active

Note that despite the clear rules stated above, each value is still a range of clock speeds. Because workloads are so diverse, Intel is unable to guarantee one specific clock speed for AVX-512, AVX, or Non-AVX instructions. Users are guaranteed that cores will run within a specific range, but each application will have to be benchmarked to determine which frequencies a CPU will operate at.

Despite the perceived reduction in performance when running these vector instructions, keep in mind that AVX-512 roughly doubles the number of operations which can be completed per cycle. Although the clock speeds are reduced, the overall throughput is increased. HPC users should expect their processors to be running in AVX or AVX-512 mode most of the time.

Cost-Effectiveness and Power Efficiency of Xeon “Skylake-SP” CPUs

As mentioned earlier, many of the new processors have the same price structure as earlier Xeon E5 and E7 server CPU families. However, the prices and power requirements for some of the premium models are higher than in previous generations. Savvy readers may find the following facts useful:

HPC applications run best on the higher-end Gold and Platinum CPU models (6100- and 8100-series), as all of the lower-end CPUs provide only half the number of math units.
The Platinum models (8100-series) are generally targeted towards Enterprise and Finance – these carry higher prices than other models.

The plots below compare the price versus performance of these CPUs. In general, the Xeon 6100-series provide the most cost-effective performance. The Xeon 4100-series and Xeon 5100-series CPUs are available for a lower price, but they include only a single AVX-512 math unit and do not offer cost-effective performance.

Performance versus Price

Mid-Range CPU Tier
High-End CPU Tier

The plots below compare the power requirements (TDP) versus performance of each CPU. Although this generation includes some of the highest-wattage CPUs to date, each is actually quite power efficient. In fact, both of the 205 Watt CPU models are among the top three most efficient models in this product line.

Performance versus Power

Mid-Range CPU Tier
High-End CPU Tier

Summary of features in Xeon Scalable Family “Skylake-SP” processors

Up to 28 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 14-, 16-, 18-, 20-, 24-, and 26-cores)
Floating Point and Integer Instruction performance improvements:
- New AVX-512 instructions double performance
  (up to 16 double-precision FLOPS per cycle per AVX-512 FMA unit)
- Up to two AVX-512 FMA units per CPU core (depends upon CPU SKU)
- As introduced with “Haswell” and “Broadwell”, these CPUs continue to support 128-bit AVX and 256-bit AVX2 Advanced Vector Extensions with FMA3
Memory capacity & performance improvements:
- Six-channel memory controller on each CPU (up from four-channel on previous platforms)
- Support for DDR4 memory speeds up to 2666MHz
- Support for operating DDR4 memory at full speed, even with two memory DIMMs installed per channel
- Optional 1.5TB-per-socket system memory support (only available on certain SKUs)
Faster links between CPU sockets with up to three 10.4GT/s UPI links (replacing the older QPI interconnect)
More I/O connectivity with 48 lanes of generation 3.0 PCI-Express per CPU (up from 40 lanes)
Optional 100Gbps Omni-Path fabric integrated into the processor (only available on certain SKUs)
CPU cores are arranged in an “Uncore” mesh interconnect (replacing the older dual-ring mesh interconnect)
Optimized Turbo Boost profiles allow higher frequencies even when many CPU cores are in use
Direct PCI-Express (generation 3.0) connections between each CPU and peripheral devices such as network adapters, GPUs and coprocessors (48 PCI-E lanes per socket)
Turbo Boost technology improves performance under peak loads by increasing processor clock speeds. With “Skylake-SP”, clock speeds are boosted higher even when many cores are in use. There are three tiers of Turbo Boost clock speeds depending upon the type of instructions being executed:
- Non-AVX: Operations that are not math intensive, or that use AVX/AVX2 instructions which don’t involve multiply/FMA
- AVX: Operations that heavily use the AVX/AVX2 units, or that use the AVX-512 unit (but not the multiply/FMA instructions)
- AVX-512: Operations that heavily use the AVX-512 units, including multiply/FMA instructions
Intel QuickData Technology (CBDMA) doubles memory-to-memory copy performance and supports MMIO memory copy
Intel Volume Management Device (VMD) provides CPU-level device aggregation for NVMe SSD storage devices
Intel Virtual RAID on CPU (VROC) provides RAID support for NVMe SSD storage devices
A new Intel C620-series PCH chipset (formerly codenamed “Lewisburg”) with improved connectivity:
- Up to four Intel X722 10GbE/1GbE Ethernet ports with iWARP RDMA support
- PCI-Express generation 3.0 x4 connection from the PCH to the CPUs (previous generations used PCI-E gen 2.0)
- Support for more integrated SATA3 6Gbps ports (up to 14)
- Support for more integrated USB 3.0 ports (up to 10)
- Integrated Intel QuickAssist Technology, which accelerates many cryptographic and compression/decompression workloads
Enhancements to the CPU Core Microarchitecture:
- Larger and improved branch predictor, higher throughput decoder, larger out-of-order window
- Improved scheduler and execution engine; improved throughput and latency of divide/sqrt
- More load/store bandwidth, deeper load/store buffers, improved prefetcher
- One or Two AVX-512 512-bit FMA units per core (compared to only one on desktop “Skylake” models)
- Support for the following AVX-512 instruction types:
  AVX512-F, AVX512-VL, AVX512-BW, AVX512-DQ, AVX512-CD
- 1MB L2 cache per core (compared to only 256KB L2 on desktop “Skylake” models)
- A 10% (geomean) improvement in instructions per cycle (IPC) versus the previous-generation Broadwell CPUs
Re-architected L2/L3 cache hierarchy:
- Each CPU core contains 1MB L2 private cache (up from 256KB)
- Each core’s private L2 acts as primary cache
- Each CPU contains >1.3MB/core of shared L3 cache (for when the private L2 caches overflow)
- The shared L3 cache is now non-inclusive (does not keep copies of the L2 caches)
- Larger 64-entry L2 TLB for 1GB pages (up from 16 entries)
Dual or Triple Ultra Path Interconnect (UPI) links between processor sockets improve communication speeds for data-intensive applications
Introduction of the RDSEED instruction for high-quality, non-deterministic, random seed values
Hyper-Threading technology allows two threads to “share” a processor core for improved resource usage. Although useful for some workloads, it is frequently disabled for HPC applications.
Hardware Controlled Power Management for more rapid and efficient decisions on optimal P- and C-State operating point

The post Detailed Specifications of the “Skylake-SP” Intel Xeon Processor Scalable Family CPUs appeared first on Microway.

Detailed Specifications of the Intel Xeon E5-2600v4 “Broadwell-EP” Processors

Eliot Eshelman — Thu, 31 Mar 2016 16:30:59 +0000

This article provides in-depth discussion and analysis of the 14nm Xeon E5-2600v4 series processors (formerly codenamed “Broadwell-EP”). “Broadwell” processors replace the previous 22nm “Haswell” microarchitecture and are available for sale as of March 31, 2016. For an introduction, read our blog post Intel Xeon E5-2600 v4 “Broadwell” Processor ReviewNote: these have since been superceded by the Intel Xeon Processor Scalable Family CPUs.

Important changes available in E5-2600v4 “Broadwell-EP” include:

Up to 22 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 14-, 16-, 18-, and 20-cores)
Support for DDR4 memory speeds up to 2400MHz
Floating Point Instruction performance improvements:
- Faster floating point multiplier completes operations in 3 cycles (down from 5 cycles)
- 1024 Radix divider for reduced latency
- Split Scalar divides for increased parallelism/bandwidth
- Faster vector Gather
- As introduced with Haswell, Broadwell continues to support AVX2 and FMA3 instructions for significant speedups of floating-point multiplication and addition operations
Extract more parallelism in scheduling micro-operations:
- Reduced instruction latencies on ADC, CMOV and PCLMULQDQ
- Larger out-of-order scheduler, with 64 entries (up from 60 entries)
- Improved address prediction for branches and returns, with an expanded 10-way Branch Prediction Unit Target Array (up from 8-way)
Improved performance on large data sets:
- Larger L2 Translation Lookaside Buffer (TLB), with 1.5k entries (up from 1K entries)
- A new L2 TLB for 1GB pages (with 16 entries)
- Addition of a second TLB page miss handler for parallel page walks

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC applications.

Exceptional Computational Performance

The Xeon E5-2600v4 processors provide the highest performance available to date in a socketed CPU. Many of the higher-end models provide well over 500 GFLOPS (more than half a TFLOPS). Much of this performance is made possible through the use of AVX2 with FMA3 instructions. The plot below compares the peak performance of these CPUs with and without FMA instructions:

The colored bars indicate performance using only AVX instructions; the grey bars indicate theoretical peak performance when using AVX with FMA. Note that only a small set of codes will be capable of issuing almost exclusively FMA instructions (e.g., LINPACK). Most applications will issue a variety of instructions, which will result in lower than peak FLOPS. Expect the achieved performance for well-parallelized & optimized applications to fall between the grey and colored bars.

Intel Xeon E5-2600v4 Series Specifications

The tabs below compare the features and specifications of the new model line. Intel has divided the CPUs into several groups:

Standard: cost-effective CPUs with moderate performance
Advanced: CPUs offering the highest performance for most applications
High Core Count: ideal for highly multi-threaded applications; CPUs providing the highest number of processor cores (sometimes sacrificing clock frequency in favor of core count)
Frequency Optimized: ideal for non-parallel/single-threaded applications; CPUs with the highest clock speeds (sacrificing number of cores in order to provide the highest frequencies)

Although these processors introduce significant performance increases, technical readers will see that many of the changes are incremental: increased core counts, improved DDR memory speed, etc. However, processor clock speeds/frequencies have not seen significant improvements.

In fact, in some cases the CPU frequency has been lowered from the previous models. Processor frequency and Turbo Boost behavior have changed fairly significantly in the last two CPU releases (“Haswell” and “Broadwell”). Those metrics are discussed in further detail in the next section.

Model	AVX Frequency	AVX Turbo Boost	Core Count	Memory Speed	L3 Cache	QPI Speed	TDP (Watts)
E5-2699v4	1.80 GHz	3.60 GHz	22	2400 Mhz	55 MB	9.6 GT/s	145W
E5-2698v4	20	50 MB	135W
E5-2697Av4	2.20 GHz	3.10 GHz	16	40 MB	145W
E5-2697v4	2.00 GHz	3.60 GHz	18	45 MB
E5-2695v4	1.70 GHz	3.10 GHz	120W
E5-2683v4	3.00 GHz	16	40 MB
E5-2690v4	2.10 GHz	3.50 GHz	14	35 MB	135W
E5-2680v4	1.90 GHz	3.30 GHz	120W
E5-2660v4	1.70 GHz	3.20 GHz	105W
E5-2650v4	1.80 GHz	2.80 GHz	12	30 MB
E5-2640v4	2.00 GHz	3.40 GHz	10	2133 Mhz	25 MB	8 GT/s	90W
E5-2630v4	1.80 GHz	3.10 GHz	85W
E5-2620v4	3.00 GHz	8	20 MB
E5-2687Wv4	2.60 GHz	3.40 GHz	12	2400 Mhz	30 MB	9.6 GT/s	160W
E5-2667v4	3.50 GHz	8	25 MB	135W
E5-2643v4	2.80 GHz	3.60 GHz	6	20 MB
E5-2637v4	3.20 GHz	4	15 MB
E5-2623v4	2.20 GHz	3.20 GHz	2133 Mhz	10 MB	8 GT/s	85W

HPC groups do not typically choose Intel’s “Basic” and “Low Power” models – those skus are not shown.

Clock Speeds & Turbo Boost in Xeon E5-2600v4 series “Broadwell” processors

With each new processor line, Intel introduces new architecture optimizations. The design of the “Broadwell” architecture acknowledges that highly-parallel/vectorized applications place the highest load on the processor cores (requiring more power and thus generating more heat). While a CPU core is executing intensive vector tasks (AVX instructions), the clock speed may be reduced to keep the processor within its power limits (TDP).

In effect, this may result in the processor running at a lower frequency than the “base” clock speed advertised for each model. For that reason, each “Broadwell” processor is assigned two “base” frequencies:

AVX mode: due to the higher power requirements of AVX instructions, clock speeds may be somewhat lower while executing AVX instructions *
Non-AVX mode: while not executing AVX instructions, the processor will operate at what would traditionally be considered the “stock” frequency

* a CPU core will return to Non-AVX mode 1 millisecond after AVX instructions complete

It is worth noting that these modes are isolated to each core. Within a given CPU, some cores may be operating in AVX mode while others are operating in Non-AVX mode. In the previous generation, AVX instructions running on a single core would cause all cores to run in AVX mode.

AVX and Non-AVX Turbo Boost

Just as in previous architectures, “Broadwell” CPUs include the Turbo Boost feature which allows each processor core to operate well above the “base” clock speed during most operations. The precise clock speed increase depends upon the number & intensity of tasks running on each CPU. However, Turbo Boost speed increases also depend upon the types of instructions (AVX vs. Non-AVX).

The two plots below show that processor clock speeds can be categorized as:

All cores on the CPU actively running Non-AVX instructions
All cores on the CPU actively running AVX instructions
A single active core running Non-AVX instructions (all other cores on the CPU must be idle)
A single active core running AVX instructions (all other cores on the CPU must be idle)

Clock Speeds for All-Core Operation
Clock Speeds for Single-Core Operation

Note that despite the clear rules stated above, each value is still a range of clock speeds. Because workloads are so diverse, Intel is unable to guarantee one specific clock speed for AVX or Non-AVX instructions. Users are guaranteed that cores will run within a specific range, but each application will have to be benchmarked to determine which frequencies a CPU will operate at.

When examining the differences between AVX and Non-AVX instructions, notice that Non-AVX instructions typically result in no more than a 100MHz to 200MHz increase in the highest clock speed. However, AVX instructions may cause clock speeds to drop by 300MHz to 400MHz if they are particularly intensive.

Recall that AVX2 introduces support for both integer and floating-point instructions, which means any compute-intensive application will be using such instructions (if it has been properly designed and compiled). HPC users should expect their processors to be running in AVX mode most of the time.

Top Clock Speeds for Specific Core Counts

When workloads leave some CPU cores idle, the Xeon E5-2600v4 processors are able to use that headroom to increase the clock speed of the cores which are performing work. Just as with other Turbo Boost scenarios, the precise speed increase will depend upon the CPU model. It will also depend upon how many CPU cores are active.

We advise users to consider how many CPU cores their application is able to saturate. The tabs below detail the peak Turbo Boost frequencies for each CPU model, sorted by the number of active cores:

All of the above plots show CPU frequencies for applications utilizing AVX instructions. The colored bars indicate the worst-case scenario – CPUs will run at least this fast. The grey bars indicate the expected clock speeds for most workloads.

Cost-Effectiveness and Power Efficiency of Xeon E5-2600v4 CPUs

The “Broadwell-EP” processors have nearly the same price structure and power requirements as earlier Xeon E5-2600 products, so their cost-effectiveness and power-efficiency should be quite attractive to HPC users. Savvy readers may find the following facts useful:

HPC applications run best on the Advanced CPU models; they typically do not scale well on the High-Core-Count models.
The High-Core-Count models are more common in Enterprise and Finance – these carry higher prices than other E5-2600 models.
The following graphs depict the cost-effectiveness and power-efficiency of only the CPU itself. In many cases, HPC users will find that once they’ve taken the full platform and cluster design into account, the cost-effectiveness of an Advanced CPU may be higher than these plots demonstrate.

Summary of features in Xeon E5-2600v4 “Broadwell-EP” processors

In addition to the capabilities mentioned at the top of this article, these processors include many of the successful features from earlier Xeon designs. The list below provides a summary of relevant technology features:

Up to 22 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 14-, 16-, 18-, and 20-cores)
Support for Quad-channel ECC DDR4 memory speeds up to 2400MHz
Direct PCI-Express (generation 3.0) connections between each CPU and peripheral devices such as network adapters, GPUs and coprocessors (40 PCI-E lanes per socket)
Floating Point Instruction performance improvements:
- Faster floating point multiplier completes operations in 3 cycles (down from 5 cycles)
- 1024 Radix divider for reduced latency
- Split Scalar divides for increased parallelism/bandwidth
- Faster vector Gather
As introduced with “Haswell”, “Broadwell” continues to supportAdvanced Vector Extensions (AVX 2.0):
- effectively double the throughput of integer and floating-point operations with math units expanded from 128-bits to 256-bits
- introduce Fused Multiply Add (FMA3) instructions which allow a multiply and an accumulate instruction to be completed in a single cycle (effectively doubling the FLOPS/clock from 8 to 16 for each core of a CPU)
- add support for additional instructions, including Gather and vector shift
- F16C 16-bit Floating-Point conversion instructions accelerate data conversion between 16-bit and 32-bit floating point formats
Turbo Boost technology improves performance under peak loads by increasing processor clock speeds. With version 2.0, (introduced in “Sandy Bridge”) clock speeds are boosted more frequently, to higher speeds and for longer periods of time. With “Haswell” and “Broadwell”, top clock speeds depend upon the type of instructions (AVX vs. Non-AVX).
Extract more parallelism in scheduling micro-operations:
- Reduced instruction latencies on ADC, CMOV and PCLMULQDQ
- Larger out-of-order scheduler, with 64 entries (up from 60 entries)
- Introduction of the ADCX and ADOX instructions to speed up cryptography
- Improved address prediction for branches and returns, with an expanded 10-way Branch Prediction Unit Target Array (up from 8-way)
Improved performance on large data sets:
- Larger L2 Translation Lookaside Buffer (TLB), with 1.5k entries (up from 1K entries)
- A new L2 TLB for 1GB pages (with 16 entries)
- Addition of a second TLB page miss handler for parallel page walks
Dual Quick Path Interconnect (QPI) links between processor sockets improve communication speeds for multi-threaded applications
Intel Data Direct I/O Technology increases performance and reduces latency by allowing Intel ethernet controllers and adapters to talk directly with the processor cache
Transactional Synchronization Extensions (TSX) improve the parallelism of multi-threaded applications with synchronization locks
Introduction of the RDSEED instruction for high-quality, non-deterministic, random seed values
Advanced Encryption Standard New Instructions (AES-NI) accelerate encryption and decryption for fast, affordable data protection and security
32-bit & 64-bit Intel Virtualization Technology (VT/VT-x) forDirected I/O (VT-d) and Connectivity (VT-c) deliver faster performance for core virtualization processes and provide built-in hardware support for I/O virtualization.
Intel APIC Virtualization (APICv) provides increased virtualization performance
Hyper-Threading technology allows two threads to “share” a processor core for improved resource usage. Although useful for some workloads, it is not recommended for HPC applications.
Improved energy efficiency with Per Core P-States and independent uncore frequency control
Hardware Controlled Power Management for more rapid and efficient decisions on optimal P- and C-State operating point
DDR4 CRC provides better memory reliability and data integrity by detecting memory bus faults during write
ECRC for PCI-Express provides optional data integrity protection for systems using PCI-Express switches or bridges

The post Detailed Specifications of the Intel Xeon E5-2600v4 “Broadwell-EP” Processors appeared first on Microway.

Detailed Specifications of the Intel Xeon E5-4600 v3 “Haswell-EP” Processors

Eliot Eshelman — Mon, 15 Jun 2015 22:12:22 +0000

This article provides in-depth discussion and analysis of the 22nm Xeon E5-4600 v3 series processors (formerly codenamed “Haswell-EP”). “Haswell” processors replace the previous 22nm “Ivy Bridge” microarchitecture and are available for sale as of June 1, 2015. For an introduction, read our blog post Xeon E5-4600v3 4-socket CPU Review

Important changes available in E5-4600 v3 “Haswell-EP” include:

Up to 18 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 14- and 16-cores)
Support for DDR4 memory speeds up to 2133MHz
Advanced Vector Extensions version 2.0 (AVX2 instructions):
- allow 256-bit wide operations for both integer and floating-point numbers (the older AVX instructions supported only floating-point operations)
- introduce Fused Multiply Add FMA3 instructions, which allow a multiply and an accumulate instruction to be completed in a single cycle (potentially doubling throughput for floating-point applications – up to 16 FLOPS per cycle)
- add support for additional instructions, including Gather and vector shift
Improved energy efficiency with Per Core P-States and independent uncore frequency control

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC applications.

Exceptional Computational Performance

The Xeon E5-4600 v3 processors provide some of the highest performance available to date in a socketed CPU (similar to their dual-socket “Haswell-EP” counterparts). For the first time, this architecture offers a single CPU capable of more than half a TeraFLOPS (500 GFLOPS) and total system performance over 2 TFLOPS!. This is made possible through the use of AVX2 with FMA3 instructions. The plot below compares the peak performance of a single CPU with and without FMA instructions:

Intel Xeon E5-4600 v3 Series Specifications

The tabs below compare the features and specifications of the new model line. Intel has divided the CPUs into several groups:

Standard: cost-effective CPUs with moderate performance
Advanced: CPUs offering the highest performance for most applications
High Core Count: ideal for well-parallelized applications; CPUs providing the highest number of processor cores (sometimes sacrificing clock frequency in favor of core count)
Frequency Optimized: ideal for non-parallel/single-threaded applications; CPUs with the highest clock speeds (sacrificing number of cores in order to provide the highest frequencies)

In fact, in some cases the CPU frequency has been lowered from the previous models. Processor frequency and Turbo Boost behavior have changed significantly with this release. Those metrics are discussed in further detail in the next section.

Model	Frequency	Frequency (AVX)	Turbo Boost	Core Count	L3 Cache	QPI Speed	Memory Speed	TDP (Watts)
E5-4669 v3	2.10 GHz	1.80 GHz	2.90 GHz	18	45MB	9.6 GT/s	2133 MHz	135W
E5-4667 v3	2.00 GHz	1.70 GHz	2.90 GHz	16	40MB
E5-4660 v3	2.10 GHz	1.80 GHz	2.90 GHz	14	35MB	120W
E5-4650 v3	2.10 GHz	1.80 GHz	2.80 GHz	12	30MB	105W
E5-4640 v3	1.90 GHz	1.60 GHz	2.60 GHz	8.0 GT/s	1866 MHz
E5-4620 v3	2.00 GHz	1.70 GHz	2.60 GHz	10	25MB
E5-4610 v3	1.70 GHz	1.70 GHz	None	6.4 GT/s	1600 MHz

HPC groups do not typically choose Intel’s “Basic” models (e.g., E5-4610 v3)

Intel Xeon E5-4600v3 Frequency Optimized SKUs

Model	Frequency	Frequency (AVX)	Turbo Boost	Core Count	L3 Cache	QPI Speed	Memory Speed	TDP (Watts)
E5-4655 v3	2.90 GHz	2.60 GHz	3.20 GHz	6	30MB	9.6 GT/s	2133 MHz	135W
E5-4627 v3	2.60 GHz	2.30 GHz	3.20 GHz	10	25MB

The above SKUs offer better memory bandwidth per core

Clock Speeds & Turbo Boost in Xeon E5-4600 v3 series “Haswell” processors

With each new processor line, Intel introduces new architecture optimizations. The design of the “Haswell” architecture acknowledges that highly-parallel/vectorized applications place the highest load on the processor cores (requiring more power and thus generating more heat). While a CPU core is executing intensive vector tasks (AVX instructions), the clock speed may be reduced to keep the processor within its power limits (TDP).

In effect, this may result in the processor running at a lower frequency than the “base” clock speed advertised for each model. For that reason, each “Haswell” processor model is assigned two “base” frequencies:

AVX mode: due to the higher power requirements of AVX instructions, clock speeds may be somewhat lower while executing AVX instructions *
Non-AVX mode: while not executing AVX instructions, the processor will operate at what would traditionally be considered the “stock” frequency

* a CPU core will return to Non-AVX mode 1 millisecond after AVX instructions complete

AVX and Non-AVX Turbo Boost

Just as in previous architectures, “Haswell” CPUs include the Turbo Boost feature which causes each processor core to operate well above the “base” clock speed during most operations. The precise clock speed increase depends upon the number & intensity of tasks running on each CPU. With the “Haswell” architecture, Turbo Boost speed increases also depend upon the types of instructions (AVX vs. Non-AVX).

The two plots below show that processor clock speeds can be categorized as:

All cores on the CPU actively running Non-AVX instructions
All cores on the CPU actively running AVX instructions
A single active core running Non-AVX instructions (all other cores on the CPU must be idle)
A single active core running AVX instructions (all other cores on the CPU must be idle)

Clock Speeds for All-Core Operation
Clock Speeds for Single-Core Operation

Note that despite the clear rules stated above, each value is still a range of clock speeds. Because workloads are so diverse, Intel is unable to guarantee one specific clock speed for AVX or Non-AVX instructions. Users are guaranteed that cores will run within a specific range, but each application will have to be benchmarked to determine which frequencies a CPU will operate at.

When examining the differences between AVX and Non-AVX instructions, notice that Non-AVX instructions do not result in dramatically higher Turbo Boost speeds. With the exception of the E5-4620 v3, none of the grey bars rises any higher than the colored bars. Thus, for most CPUs the maximum possible Turbo Boost speed is the same when using AVX and Non-AVX instructions. However, heavy usage of AVX instructions may reduce the clock speed by as much as 300MHz.

Of course, it is worth remembering that the usage of AVX instructions can result in as much as a 100% increase in performance. It is much better to leverage AVX instructions – gaining the 100% increase in instruction throughput and suffering the small 5% to 15% CPU clock speed penalty. It would be unwise to turn off AVX with the expectation that overall performance would increase.

Top Clock Speeds for Specific Core Counts

When workloads leave some CPU cores idle, the Xeon E5-4600 v3 processors are able to use that headroom to increase the clock speed of the cores which are performing work. Just as with other Turbo Boost scenarios, the precise speed increase will depend upon the CPU model. It will also depend upon how many CPU cores are active.

We advise users to consider how many CPU cores their application is able to saturate. The tabs below detail the peak Turbo Boost frequencies for each CPU model, sorted by the number of active cores:

Cost-Effectiveness and Power Efficiency of Xeon E5-4600 v3 CPUs

The “Haswell-EP” processors have nearly the same price structure and power requirements as earlier Xeon E5-4600 products, so their cost-effectiveness and power-efficiency should be quite attractive to HPC users. Savvy readers may find the following facts useful:

The Xeon E5-4627 v3 CPUs are typically optimized for HPC workloads. Additionally, they feature pricing attractive to HPC groups.
The power requirement (TDP) for each model has increased by 5 Watts over the previous generation. This is due to integration of the Voltage Regulator Modules (VRMs) which were previously placed on the motherboard. Thus, CPU TDP increases 5W and motherboard TDP decreases 5W.
The following graphs depict the cost-effectiveness and power-efficiency of only the CPU itself. In many cases, HPC users will find that once they’ve taken the full platform and cluster design into account, the cost-effectiveness of a higher core count CPU may be more beneficial than these plots demonstrate.

Summary of features in Xeon E5-4600 v3 “Haswell-EP” processors

Up to 18 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 14- and 16-cores)
Support for Quad-channel ECC DDR4 memory speeds up to 2133MHz
Direct PCI-Express (generation 3.0) connections between each CPU and peripheral devices such as network adapters, GPUs and coprocessors (40 PCI-E lanes per socket)
Advanced Vector Extensions (AVX 2.0):
- effectively double the throughput of integer and floating-point operations with math units expanded from 128-bits to 256-bits
- introduce Fused Multiply Add (FMA3) instructions which allow a multiply and an accumulate instruction to be completed in a single cycle (effectively doubling the FLOPS/clock from 8 to 16 for each core of a CPU)
- add support for additional instructions, including Gather and vector shift
- F16C 16-bit Floating-Point conversion instructions accelerate data conversion between 16-bit and 32-bit floating point formats
Turbo Boost technology improves performance under peak loads by increasing processor clock speeds. With version 2.0, (introduced in “Sandy Bridge”) clock speeds are boosted more frequently, to higher speeds and for longer periods of time. With “Haswell”, top clock speeds depend upon the type of instructions (AVX vs. Non-AVX).
Faster Quick Path Interconnect (QPI) links between processor sockets improve communication speeds for multi-threaded applications
Improved energy efficiency with Per Core P-States and independent uncore frequency control
Intel Data Direct I/O Technology increases performance and reduces latency by allowing Intel ethernet controllers and adapters to talk directly with the processor cache
Advanced Encryption Standard New Instructions (AES-NI) accelerate encryption and decryption for fast, affordable data protection and security
32-bit & 64-bit Intel Virtualization Technology (VT/VT-x) forDirected I/O (VT-d) and Connectivity (VT-c) deliver faster performance for core virtualization processes and provide built-in hardware support for I/O virtualization.
Intel APIC Virtualization (APICv) provides increased virtualization performance
Hyper-Threading technology allows two threads to “share” a processor core for improved resource usage. Although useful for some workloads, it is not recommended for HPC applications.

The post Detailed Specifications of the Intel Xeon E5-4600 v3 “Haswell-EP” Processors appeared first on Microway.

Detailed Specifications of the Intel Xeon E5-2600v3 “Haswell-EP” Processors

Eliot Eshelman — Mon, 08 Sep 2014 17:00:21 +0000

This article provides in-depth discussion and analysis of the 22nm Xeon E5-2600v3 series processors (formerly codenamed “Haswell-EP”). “Haswell” processors replace the previous 22nm “Ivy Bridge” microarchitecture and are available for sale as of September 8, 2014. Note: these have since been superceded by Xeon E5-2600v4 Broadwell-EP Processors.

Important changes available in E5-2600v3 “Haswell-EP” include:

Up to 18 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 14- and 16-cores)
Support for DDR4 memory speeds up to 2133MHz
Advanced Vector Extensions version 2.0 (AVX2 instructions):
- allow 256-bit wide operations for both integer and floating-point numbers (the older AVX instructions supported only floating-point operations)
- introduce Fused Multiply Add FMA3 instructions, which allow a multiply and an accumulate instruction to be completed in a single cycle (potentially doubling throughput for floating-point applications – up to 16 FLOPS per cycle)
- add support for additional instructions, including Gather and vector shift
Improved energy efficiency with Per Core P-States and independent uncore frequency control

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC applications.

Exceptional Computational Performance

The Xeon E5-2600v3 processors introduce the highest performance available to date in a socketed CPU. For the first time, a single CPU is capable of more than half a TeraFLOPS (500 GFLOPS). This is made possible through the use of AVX2 with FMA3 instructions. The plot below compares the peak performance of these CPUs with and without FMA instructions:

Intel Xeon E5-2600v3 Series Specifications

The tabs below compare the features and specifications of the new model line. Intel has divided the CPUs into several groups:

Standard: cost-effective CPUs with moderate performance
Advanced: CPUs offering the highest performance for most applications
High Core Count: ideal for well-parallelized applications; CPUs providing the highest number of processor cores (sometimes sacrificing clock frequency in favor of core count)
Frequency Optimized: ideal for non-parallel/single-threaded applications; CPUs with the highest clock speeds (sacrificing number of cores in order to provide the highest frequencies)

Model	AVX Frequency	AVX Turbo Boost	Core Count	Memory Speed	L3 Cache	QPI Speed	TDP (Watts)
E5-2699v3	1.90 GHz	3.30 GHz	18	2133 MHz	45MB	9.6 GT/s	145W
E5-2698v3	16	40MB	135W
E5-2697v3	2.20 GHz	3.30 GHz	14	35MB	145W
E5-2695v3	1.90 GHz	3.00 GHz	120W
E5-2683v3	1.70 GHz	2.70 GHz
E5-2690v3	2.30 GHz	3.20 GHz	12	30MB	135W
E5-2680v3	2.10 GHz	3.10 GHz	120W
E5-2670v3	2.00 GHz	2.90 GHz
E5-2687Wv3	2.70 GHz	3.50 GHz	10	25MB	160W
E5-2660v3	2.20 GHz	3.10 GHz	105W
E5-2650v3	2.00 GHz	2.80 GHz
E5-2667v3	2.70 GHz	3.50 GHz	8	20MB	135W
E5-2640v3	2.20 GHz	3.40 GHz	1866 MHz	8 GT/s	90W
E5-2630v3	2.10 GHz	3.20 GHz	85W
E5-2643v3	2.80 GHz	3.50 GHz	6	2133 MHz	9.6 GT/s	135W
E5-2620v3	2.10 GHz	3.20 GHz	1866 MHz	15MB	8 GT/s	85W
E5-2637v3	3.20 GHz	3.60 GHz	4	2133 MHz	9.6 GT/s	135W
E5-2623v3	2.70 GHz	3.50 GHz	1866 MHz	10MB	8 GT/s	105W

HPC groups do not typically choose Intel’s “Basic” and “Low Power” models – those skus are not shown.

Clock Speeds & Turbo Boost in Xeon E5-2600v3 series “Haswell” processors

AVX mode: due to the higher power requirements of AVX instructions, clock speeds may be somewhat lower while executing AVX instructions *
Non-AVX mode: while not executing AVX instructions, the processor will operate at what would traditionally be considered the “stock” frequency

* a CPU core will return to Non-AVX mode 1 millisecond after AVX instructions complete

AVX and Non-AVX Turbo Boost

The two plots below show that processor clock speeds can be categorized as:

All cores on the CPU actively running Non-AVX instructions
All cores on the CPU actively running AVX instructions
A single active core running Non-AVX instructions (all other cores on the CPU must be idle)
A single active core running AVX instructions (all other cores on the CPU must be idle)

Clock Speeds for All-Core Operation
Clock Speeds for Single-Core Operation

Note that despite the clear rules stated above, each value is still a range of clock speeds. Because workloads are so diverse, Intel is unable to guarantee one specific clock speed for AVX or Non-AVX instructions. Users are guaranteed that cores will run within a specific range, but each application will have to be benchmarked to determine which frequencies a CPU will operate at.

Top Clock Speeds for Specific Core Counts

When workloads leave some CPU cores idle, the Xeon E5-2600v3 processors are able to use that headroom to increase the clock speed of the cores which are performing work. Just as with other Turbo Boost scenarios, the precise speed increase will depend upon the CPU model. It will also depend upon how many CPU cores are active.

We advise users to consider how many CPU cores their application is able to saturate. The tabs below detail the peak Turbo Boost frequencies for each CPU model, sorted by the number of active cores:

Cost-Effectiveness and Power Efficiency of Xeon E5-2600v3 CPUs

The “Haswell-EP” processors have nearly the same price structure and power requirements as earlier Xeon E5-2600 products, so their cost-effectiveness and power-efficiency should be quite attractive to HPC users. Savvy readers may find the following facts useful:

Although v3 Xeons follow the same price steps as their v2 counterparts, three High-Core-Count models were late additions. These models are higher performing and carry higher prices than previous E5-2600 models.
The power requirement (TDP) for each model has increased by 5 Watts over the previous generation. This is due to integration of the Voltage Regulator Modules (VRMs) which were previously placed on the motherboard. Thus, CPU TDP increases 5W and motherboard TDP decreases 5W.
The following graphs depict the cost-effectiveness and power-efficiency of only the CPU itself. In many cases, HPC users will find that once they’ve taken the full platform and cluster design into account, the cost-effectiveness of a higher core count CPU may be more beneficial than these plots demonstrate.

Summary of features in Xeon E5-2600v3 “Haswell-EP” processors

Up to 18 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 14- and 16-cores)
Support for Quad-channel ECC DDR4 memory speeds up to 2133MHz
Direct PCI-Express (generation 3.0) connections between each CPU and peripheral devices such as network adapters, GPUs and coprocessors (40 PCI-E lanes per socket)
Advanced Vector Extensions (AVX 2.0):
- effectively double the throughput of integer and floating-point operations with math units expanded from 128-bits to 256-bits
- introduce Fused Multiply Add (FMA3) instructions which allow a multiply and an accumulate instruction to be completed in a single cycle (effectively doubling the FLOPS/clock from 8 to 16 for each core of a CPU)
- add support for additional instructions, including Gather and vector shift
- F16C 16-bit Floating-Point conversion instructions accelerate data conversion between 16-bit and 32-bit floating point formats
Turbo Boost technology improves performance under peak loads by increasing processor clock speeds. With version 2.0, (introduced in “Sandy Bridge”) clock speeds are boosted more frequently, to higher speeds and for longer periods of time. With “Haswell”, top clock speeds depend upon the type of instructions (AVX vs. Non-AVX).
Dual Quick Path Interconnect (QPI) links between processor sockets improve communication speeds for multi-threaded applications
Improved energy efficiency with Per Core P-States and independent uncore frequency control
Intel Data Direct I/O Technology increases performance and reduces latency by allowing Intel ethernet controllers and adapters to talk directly with the processor cache
Advanced Encryption Standard New Instructions (AES-NI) accelerate encryption and decryption for fast, affordable data protection and security
32-bit & 64-bit Intel Virtualization Technology (VT/VT-x) forDirected I/O (VT-d) and Connectivity (VT-c) deliver faster performance for core virtualization processes and provide built-in hardware support for I/O virtualization.
Intel APIC Virtualization (APICv) provides increased virtualization performance
Hyper-Threading technology allows two threads to “share” a processor core for improved resource usage. Although useful for some workloads, it is not recommended for HPC applications.

The post Detailed Specifications of the Intel Xeon E5-2600v3 “Haswell-EP” Processors appeared first on Microway.

In-Depth Comparison of Intel Xeon E5-4600v2 “Ivy Bridge” Processors

Eliot Eshelman — Wed, 05 Mar 2014 16:18:50 +0000

This article provides in-depth discussion and analysis of the 22nm Xeon E5-4600v2 series processors (formerly codenamed “Ivy Bridge”). These “Ivy Bridge” processors improve upon the previous 32nm “Sandy Bridge” microarchitecture and are available for sale as of March 3, 2014. For an introduction, read our blog post reviewing E5-4600v2.

Important changes available in E5-4600v2 “Ivy Bridge” include:

Up to 12 processor cores per socket (with options for 4-, 6-, 8- and 10-cores)
Support for DDR3 memory speeds up to 1866MHz
Improved PCI-Express generation 3.0 support with improved compatibility and new features: atomics, x16 non-transparent bridge & quadrupled read buffers for P2P transfers
AVX has been extended to support F16C (16-bit Floating-Point conversion instructions) to accelerate data conversion between 16-bit and 32-bit floating point formats
Intel APIC Virtualization (APICv) provides increased virtualization performance

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC applications.

Intel Xeon E5-4600v2 Series Specifications

The table below provides an overview of the product line. To compare a particular metric from model-to-model, select one of the tabs above.

Model	Frequency	Turbo Boost	Core Count	Memory Speed	L3 Cache	QPI Speed	TDP (Watts)
E5-4657Lv2	2.40 GHz	2.90 GHz	12	1866 MHz	30MB	8 GT/S	115
E5-4650v2	2.40 GHz	2.90 GHz	10	25MB	95W
E5-4640v2	2.20 GHz	2.70 GHz	20MB
E5-4627v2	3.30 GHz	3.60 GHz	8	16MB	7.2 GT/S	130W
E5-4620v2	2.60 GHz	3.00 GHz	1600 MHz	20MB	95W
E5-4610v2	2.30 GHz	2.70 GHz	16MB

HPC groups do not typically choose Intel’s “Basic” models – those skus are not shown.

Intel Turbo Boost in Xeon E5-4600v2 series “Ivy Bridge” processors

Summary

The plot below details the turbo-boosted clock speeds for each model. As additional CPU cores are loaded, less power and thermal headroom remains, which results in lower clock speeds. When load diminishes and fewer CPU cores are loaded, additional power and thermal headroom is available. The end result is higher CPU clock speeds for fewer numbers of active cores.

Choose for your applications

If the applications you use are only lightly-threaded, you may wish to choose processors with the highest possible clock speed. This will deliver the best performance for applications that are not able to leverage the higher core counts of these latest processor models.

Review the tabs above – from 12 to 1 – to select the appropriate processor model given the number of CPU cores you expect to use.

Summary of features in Xeon E5-4600v2 “Ivy Bridge” processors

Up to 12 processor cores per socket (with options for 4-, 6-, 8- and 10-cores)
Support for Quad-channel ECC DDR3 memory speeds up to 1866MHz
Direct PCI-Express (generation 3.0) connections between each CPU and peripheral devices such as network adapters, GPUs and coprocessors (40 PCI-E lanes per socket). Improved PCI-Express generation 3.0 support with improved compatibility and new features: atomics, x16 non-transparent bridge & quadrupled read buffers for P2P transfers
Advanced Vector Extensions (AVX) accelerate floating point operations used in HPC & technical computing applications. This technology expands the math unit from 128-bits to 256-bits, effectively doubling throughput. AVX has been extended to support F16C (16-bit Floating-Point conversion instructions) to accelerate data conversion between 16-bit and 32-bit floating point formats
Turbo Boost technology improves performance under peak loads by increasing processor clock speeds. With version 2.0, (introduced in “Sandy Bridge”) clock speeds are boosted more frequently, to higher speeds and for longer periods of time.
Dual Quick Path Interconnect (QPI) links between processor sockets improve communication speeds for multi-threaded applications
Intel Intelligent Power Technology reduces individual idling cores to near-zero power. Power gates adjust processors and memory to the lowest available power state to meet workload requirements without impacting performance.
Intel Data Direct I/O Technology increases performance and reduces latency by allowing Intel ethernet controllers and adapters to talk directly with the processor cache
Advanced Encryption Standard New Instructions (AES-NI) accelerate encryption and decryption for fast, affordable data protection and security
32-bit & 64-bit Intel Virtualization Technology (VT/VT-x) for Directed I/O (VT-d) and Connectivity (VT-c) deliver faster performance for core virtualization processes and provide built-in hardware support for I/O virtualization.
Intel APIC Virtualization (APICv) provides increased virtualization performance
Hyper-Threading technology allows two threads to “share” a processor core for improved resource usage. Although useful for some workloads, it is not recommended for HPC applications.

More information is available in Intel’s Xeon E5-4600v2 Product Brief.

The post In-Depth Comparison of Intel Xeon E5-4600v2 “Ivy Bridge” Processors appeared first on Microway.

In-Depth Comparison of Intel Xeon E5-2600v2 “Ivy Bridge” Processors

Eliot Eshelman — Tue, 10 Sep 2013 16:01:56 +0000

This article provides in-depth discussion and analysis of the 22nm Xeon E5-2600v2 series processors (formerly codenamed “Ivy Bridge”). “Ivy Bridge” processors improve upon the previous 32nm “Sandy Bridge” microarchitecture and are available for sale as of September 10, 2013. For an introduction, read our blog post Intel Xeon E5-2600v2 “Ivy Bridge” Processor Review

Important changes available in E5-2600v2 “Ivy Bridge” include:

Up to 12 processor cores per socket (with options for 4-, 6-, 8- and 10-cores)
Support for DDR3 memory speeds up to 1866MHz
Improved PCI-Express generation 3.0 support with improved compatibility and new features: atomics, x16 non-transparent bridge & quadrupled read buffers for P2P transfers
AVX has been extended to support F16C (16-bit Floating-Point conversion instructions) to accelerate data conversion between 16-bit and 32-bit floating point formats
Intel APIC Virtualization (APICv) provides increased virtualization performance

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC applications.

Intel Xeon E5-2600v2 Series Specifications

The table below provides an overview of the product line. To compare a particular metric from model-to-model, select one of the tabs above.

Model	Frequency	Turbo Boost	Core Count	Memory Speed	L3 Cache	QPI Speed	TDP (Watts)
E5-2697v2	2.70 GHz	3.50 GHz	12	1866 MHz	30MB	8 GT/s	130W
E5-2695v2	2.40 GHz	3.20 GHz	115W
E5-2690v2	3.00 GHz	3.60 GHz	10	25MB	130W
E5-2680v2	2.80 GHz	3.60 GHz	115W
E5-2670v2	2.50 GHz	3.30 GHz
E5-2660v2	2.20 GHz	3.00 GHz	95W
E5-2650v2	2.60 GHz	3.40 GHz	8	20MB
E5-2640v2	2.00 GHz	2.50 GHz	1600 MHz	7.2 GT/s
E5-2687Wv2	3.40 GHz	4.00 GHz	1866 MHz	25MB	8 GT/s	150W
E5-2667v2	3.30 GHz	4.00 GHz	130W
E5-2630v2	2.60 GHz	3.10 GHz	6	1600 MHz	15MB	7.2 GT/s	80W
E5-2620v2	2.10 GHz	2.60 GHz
E5-2643v2	3.50 GHz	3.80 GHz	1866 MHz	25MB	8 GT/s	130W
E5-2637v2	3.50 GHz	3.80 GHz	4	15MB

HPC groups do not typically choose Intel’s “Basic” and “Low Power” models – those skus are not shown.

Intel Turbo Boost in Xeon E5-2600v2 series “Ivy Bridge” processors

Summary

Choose for your applications

Review the tabs above – from 12 to 1 – to select the appropriate processor model given the number of CPU cores you expect to use.

Summary of features in Xeon E5-2600v2 “Ivy Bridge” processors

Up to 12 processor cores per socket (with options for 4-, 6-, 8- and 10-cores)
Support for Quad-channel ECC DDR3 memory speeds up to 1866MHz
Direct PCI-Express (generation 3.0) connections between each CPU and peripheral devices such as network adapters, GPUs and coprocessors (40 PCI-E lanes per socket). Improved PCI-Express generation 3.0 support with improved compatibility and new features: atomics, x16 non-transparent bridge & quadrupled read buffers for P2P transfers
Advanced Vector Extensions (AVX) accelerate floating point operations used in HPC & technical computing applications. This technology expands the math unit from 128-bits to 256-bits, effectively doubling throughput. AVX has been extended to support F16C (16-bit Floating-Point conversion instructions) to accelerate data conversion between 16-bit and 32-bit floating point formats
Turbo Boost technology improves performance under peak loads by increasing processor clock speeds. With version 2.0, (introduced in “Sandy Bridge”) clock speeds are boosted more frequently, to higher speeds and for longer periods of time.
Dual Quick Path Interconnect (QPI) links between processor sockets improve communication speeds for multi-threaded applications
Intel Intelligent Power Technology reduces individual idling cores to near-zero power. Power gates adjust processors and memory to the lowest available power state to meet workload requirements without impacting performance.
Intel Data Direct I/O Technology increases performance and reduces latency by allowing Intel ethernet controllers and adapters to talk directly with the processor cache
Advanced Encryption Standard New Instructions (AES-NI) accelerate encryption and decryption for fast, affordable data protection and security
32-bit & 64-bit Intel Virtualization Technology (VT/VT-x) forDirected I/O (VT-d) and Connectivity (VT-c) deliver faster performance for core virtualization processes and provide built-in hardware support for I/O virtualization.
Intel APIC Virtualization (APICv) provides increased virtualization performance
Hyper-Threading technology allows two threads to “share” a processor core for improved resource usage. Although useful for some workloads, it is not recommended for HPC applications.

The post In-Depth Comparison of Intel Xeon E5-2600v2 “Ivy Bridge” Processors appeared first on Microway.