SPEC Archives - Microway https://www.microway.com/tag/spec/ We Speak HPC & AI Thu, 30 May 2024 20:01:09 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 Detailed Specifications of the “Skylake-SP” Intel Xeon Processor Scalable Family CPUs https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-skylake-sp-intel-xeon-processor-scalable-family-cpus/ https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-skylake-sp-intel-xeon-processor-scalable-family-cpus/#respond Tue, 11 Jul 2017 16:15:48 +0000 https://www.microway.com/?post_type=incsub_wiki&p=8834 This article provides in-depth discussion and analysis of the 14nm Intel Xeon Processor Scalable Family (formerly codenamed “Skylake-SP” or “Skylake Scalable Processor”). “Skylake-SP” processors replace the previous 14nm “Broadwell” microarchitecture (both the E5 and E7 Xeon families) and are available for sale as of July 11, 2017. Note: these have since been superseded by the […]

The post Detailed Specifications of the “Skylake-SP” Intel Xeon Processor Scalable Family CPUs appeared first on Microway.

]]>
This article provides in-depth discussion and analysis of the 14nm Intel Xeon Processor Scalable Family (formerly codenamed “Skylake-SP” or “Skylake Scalable Processor”). “Skylake-SP” processors replace the previous 14nm “Broadwell” microarchitecture (both the E5 and E7 Xeon families) and are available for sale as of July 11, 2017. Note: these have since been superseded by the “Cascade Lake SP” Intel Xeon Processor Scalable Family CPUs.

Important changes available in Xeon Scalable Processor Family “Skylake-SP” CPUs include:

  • Up to 28 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 14-, 16-, 18-, 20-, 24-, and 26-cores)
  • Floating Point and Integer Instruction performance improvements:
    • New AVX-512 instructions double performance
      (up to 16 double-precision FLOPS per cycle per AVX-512 FMA unit)
    • Up to two AVX-512 FMA units per CPU core (depends upon CPU SKU)
  • Memory capacity & performance improvements:
    • Six-channel memory controller on each CPU (up from four-channel on previous platforms)
    • Support for DDR4 memory speeds up to 2666MHz
    • Optional 1.5TB-per-socket system memory support (only available on certain SKUs)
  • Faster links between CPU sockets with up to three 10.4GT/s UPI links (replacing the older QPI interconnect)
  • More I/O connectivity with 48 lanes of generation 3.0 PCI-Express per CPU (up from 40 lanes)
  • Optional 100Gbps Omni-Path fabric integrated into the processor (only available on certain SKUs)
  • CPU cores are arranged in an “Uncore” mesh interconnect (replacing the older dual-ring mesh interconnect)
  • Optimized Turbo Boost profiles allow higher frequencies even when many CPU cores are in use
  • All 2-/4-/8-socket server product families (sometimes called EP 2S, EP 4S, and EX) are merged into a single product line
  • A new server platform (formerly codenamed “Purley”) to support this new CPU product family

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC applications.

A New Strategy with New Processor Tiers

With this new product release, Intel merges together all previous Xeon server product families into a single family. The old model numbers with which you might be familiar – E5-2600, E5-4600, E7-4800, E7-8800 – are now replaced by these “Skylake-SP” CPUs. While this opens up the possibility to select from a broad range of processor models for any given project, it requires attention to detail. There are more than 30 CPU models to select from in the Xeon Processor Scalable Family.

This processor family is divided into four tiers: Bronze, Silver, Gold, and Platinum. The Silver and Gold models are in the price range familiar to HPC users/architects. However, the Platinum models are in a higher price range than HPC groups are typically accustomed to. The Platinum tier targets Enterprise workloads, and is priced accordingly.

With that in mind, our analysis is divided into two sections:

  • CPU models which fit within the existing price ranges for mainstream HPC
  • CPU models which are of interest to HPC users, but come at a higher price

Before diving into the details, it helps to keep in mind the following recommendations. Based on our experience with adoption of new HPC products, our guidance for selecting Xeon tiers is as follows:

  • Intel Xeon Bronze – Not recommended for HPC
    Base-level models with low performance.
  • Intel Xeon Silver – Suitable for entry-level HPC
    Slightly improved performance over previous generations.
  • Intel Xeon Gold – Recommended for most HPC workloads
    The best balance of performance and price. In particular, the 6100-series models should be preferred over the 5100-series models, because they have twice the number of AVX-512 units
  • Intel Xeon Platinum – Recommended for specific HPC workloads
    Although these models provide the highest performance, their higher price makes them suitable only for particular workloads which require their specific capabilities (e.g., large SMP and large-memory Compute Nodes).

* Given their positioning, the Intel Xeon Bronze products will not be included in our review

Exceptional Computational Performance

The Xeon “Skylake-SP” processors bring new capabilities, new flexibility, and unprecedented performance. Many models provide over 1 TFLOPS (one teraflop of double-precision 64-bit performance per second) and a couple models provide nearly 2 TFLOPS. This performance is achieved with high core counts and the new AVX-512 instructions with FMA. The plots in the tabs below compare the performance ranges of the recommended CPU tiers:

The shaded/colored bars indicate the expected performance ranges for each CPU using the new AVX-512 instructions with FMA. Note that only a small set of codes will be capable of issuing exclusively AVX-512 FMA instructions (e.g., LINPACK). Most applications issue a variety of instructions and will achieve lower than peak FLOPS.

Notice that each plot shows two separate groups of CPUs separated by a gap. The CPU models on the left of each plot offer the highest numbers of CPU cores (with CPU clock frequency being a secondary priority). The CPU models on the right of each plot are optimized for the highest CPU clock speeds (with high CPU core count as the secondary priority). Intel describes these high clock speed models as “optimized for the highest per-core performance”. In previous generations, these “frequency-optimized” CPU models were typically the niche option. However, in this generation the CPU models which offer the highest per-core performance are expected to be the primary choices for HPC users – they provide base clock speeds in the 2GHz~3GHz range. The CPU models which do not prioritize clock speed are in the 1.5GHz~2GHz range, which many HPC users would consider to be too low.

Intel Xeon “Skylake-SP” Price Ranges

Because the pricing of the Xeon Processor Scalable Family spans such a wide range, budgets need to be kept at top of mind when selecting options. It would be frustrating to plan for 28-core processors when the budget cannot support a price of more than $10,000 per CPU.

The tabs below compare the prices of the various CPU tiers. As above, each plot is divided with high-core-count CPUs on the left and highest-per-core performance on the right.

As the above plots show, the CPUs are sorted by price. All of the plots in this article are ordered to match the plots above. Keep the pricing in mind as you review the remainder of the information in this article.

Intel “Skylake-SP” Xeon Processor Scalable Family Specifications

The sets of tabs below compare the features and specifications of this new Xeon processor family. As you will see, the Silver (4100-series) and low-end Gold (5100-series) offer fewer capabilities and lower performance. The high-end Gold (6100-series) and Platinum (8100-series) offer more capabilities and higher performance. Additionally, certain models within the 6100-series and 8100-series have special models integrating additional specializations:

  • Enabled for up to 1.5TB of memory per CPU socket (indicated with an M suffix on the part number)
  • Including integrated 100Gbps Omni-Path interconnect (indicated with an F suffix on the part number)

In addition to the significant performance increases, there are notable changes to the “Skylake-SP” processor designs. These include a completely new mesh connectivity between the processor cores, redesigned L2/L3 caches, greater connectivity between CPU sockets, and new changes to the processor frequency speeds. These are discussed further in the sections below.

Number of Cores per CPU

Most HPC groups should find that 12-core, 14-core, and 16-core models fit within their budget. Systems with up to 24-cores per CPU will not be shockingly expensive. However, the 26-core and 28-core models are only available within the Platinum tier and will be at a higher cost than most groups would consider cost-effective.

DDR4 Memory Speed

As shown above, memory performance is fairly homogeneous across this CPU family. The amount of memory bandwidth available per CPU core will be an important factor, but is simply a function of the number of cores. Users planning to run on CPUs with higher core counts need to ensure that each core won’t be starved of data.

Intel has also enabled these CPUs to drive fully-populated systems at full memory speed. In previous generations, populating more than half of the memory slots would result in a modest reduction in memory speed.

L3 Cache Size

Each CPU has been designed to offer at least 1.375MB of L3 cache per core. As shown above, there are several models which feature a larger quantity of L3 per core. Remember that each core also has 1MB of private L2 cache. In this generation, the L3 cache is largely seen as a fallback if data spills out of L2 (a “victim cache”).

Ultra Path Interconnect (UPI) Performance

With the “Skylake-SP” architecture, Intel has replaced the older QPI interconnect with UPI. The throughput per link increases from 9.6GT/s to 10.4GT/s. Additionally, many CPU models support up to 3 UPI links per socket (compared to 2 QPI links in most earlier platforms). This allows greater connectivity between sockets, particularly on dual-socket systems which are the most popular configuration for HPC.

Power Consumption (TDP)

Although there are still many models in the same power range as previous generations, there are an increasing number of models with TDPs above 140 Watts. A couple of models even reach over 200 Watts. For this generation, HPC users must be certain that the systems they use have gone through careful thermal validation. Systems which run warm will suffer lower performance.

Clock Speeds & Turbo Boost in Xeon “Skylake-SP” Scalable Family processors

With each new processor line, Intel introduces new architecture optimizations. The design of the “Skylake-SP” architecture acknowledges that highly-parallel/vectorized applications place the highest load on the processor cores (requiring more power and thus generating more heat). While a CPU core is executing intensive vector tasks (AVX or AVX-512 instructions), the clock speed will be adjusted downwards to keep the processor within its power limits (TDP).

In effect, this will result in the processor running at a lower frequency than the standard clock speed advertised for each model. For that reason, each “Skylake-SP” processor is assigned three “base” frequencies:

  • AVX-512 mode: due to the high requirements of AVX-512/FMA instructions, clock speeds will be lowered while executing AVX-512 instructions *
  • AVX mode: due to the higher power requirements of AVX2/FMA instructions, clock speeds will be somewhat lower while executing AVX instructions *
  • Non-AVX mode: while not executing AVX/AVX-512 instructions, the processor will operate at what would traditionally be considered the “stock” frequency

Each of the “modes” above is actually a range of CPU clock speeds. The CPU will run at the highest speed possible for the particular set of CPU instructions that have been issued. It is worth noting that these modes are isolated to each core. Within a given CPU, some cores may be operating in AVX mode while others are operating in Non-AVX mode.

* Intel has applied a 1-millisecond hysteresis window on each CPU core to prevent rapid frequency transitions

AVX-512, AVX, and Non-AVX Turbo Boost

Just as in previous generations, “Skylake-SP” CPUs include the Turbo Boost feature which allows each processor core to operate well above the “base” clock speed during most operations. The precise clock speed increase depends upon the number & intensity of tasks running on each CPU. However, Turbo Boost speed increases also depend upon the types of instructions (AVX-512, AVX, Non-AVX).

The plots below demonstrate processor clock speeds under the following conditions:

  • All cores on the CPU actively running Non-AVX, AVX, or AVX-512 instructions
  • A single active core running Non-AVX, AVX, or AVX-512 instructions (all other cores on the CPU must be idle)

The dotted lines represent the range of clock speeds for Non-AVX instructions. The thin cyan bars represent the range of clock speeds for AVX2/FMA instructions. The thicker shaded/colored bars represent the range of clock speeds for AVX-512/FMA instructions.

Note that despite the clear rules stated above, each value is still a range of clock speeds. Because workloads are so diverse, Intel is unable to guarantee one specific clock speed for AVX-512, AVX, or Non-AVX instructions. Users are guaranteed that cores will run within a specific range, but each application will have to be benchmarked to determine which frequencies a CPU will operate at.

Despite the perceived reduction in performance when running these vector instructions, keep in mind that AVX-512 roughly doubles the number of operations which can be completed per cycle. Although the clock speeds are reduced, the overall throughput is increased. HPC users should expect their processors to be running in AVX or AVX-512 mode most of the time.

Cost-Effectiveness and Power Efficiency of Xeon “Skylake-SP” CPUs

As mentioned earlier, many of the new processors have the same price structure as earlier Xeon E5 and E7 server CPU families. However, the prices and power requirements for some of the premium models are higher than in previous generations. Savvy readers may find the following facts useful:

  • HPC applications run best on the higher-end Gold and Platinum CPU models (6100- and 8100-series), as all of the lower-end CPUs provide only half the number of math units.
  • The Platinum models (8100-series) are generally targeted towards Enterprise and Finance – these carry higher prices than other models.

The plots below compare the price versus performance of these CPUs. In general, the Xeon 6100-series provide the most cost-effective performance. The Xeon 4100-series and Xeon 5100-series CPUs are available for a lower price, but they include only a single AVX-512 math unit and do not offer cost-effective performance.

Performance versus Price

The plots below compare the power requirements (TDP) versus performance of each CPU. Although this generation includes some of the highest-wattage CPUs to date, each is actually quite power efficient. In fact, both of the 205 Watt CPU models are among the top three most efficient models in this product line.

Performance versus Power

Summary of features in Xeon Scalable Family “Skylake-SP” processors

In addition to the capabilities mentioned at the top of this article, these processors include many of the successful features from earlier Xeon designs. They also include lower-level changes that may of interest to expert users. The list below provides a more detailed summary of relevant technology features in Skylake-SP:

  • Up to 28 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 14-, 16-, 18-, 20-, 24-, and 26-cores)
  • Floating Point and Integer Instruction performance improvements:
    • New AVX-512 instructions double performance
      (up to 16 double-precision FLOPS per cycle per AVX-512 FMA unit)
    • Up to two AVX-512 FMA units per CPU core (depends upon CPU SKU)
    • As introduced with “Haswell” and “Broadwell”, these CPUs continue to support 128-bit AVX and 256-bit AVX2 Advanced Vector Extensions with FMA3
  • Memory capacity & performance improvements:
    • Six-channel memory controller on each CPU (up from four-channel on previous platforms)
    • Support for DDR4 memory speeds up to 2666MHz
    • Support for operating DDR4 memory at full speed, even with two memory DIMMs installed per channel
    • Optional 1.5TB-per-socket system memory support (only available on certain SKUs)
  • Faster links between CPU sockets with up to three 10.4GT/s UPI links (replacing the older QPI interconnect)
  • More I/O connectivity with 48 lanes of generation 3.0 PCI-Express per CPU (up from 40 lanes)
  • Optional 100Gbps Omni-Path fabric integrated into the processor (only available on certain SKUs)
  • CPU cores are arranged in an “Uncore” mesh interconnect (replacing the older dual-ring mesh interconnect)
  • Optimized Turbo Boost profiles allow higher frequencies even when many CPU cores are in use
  • Direct PCI-Express (generation 3.0) connections between each CPU and peripheral devices such as network adapters, GPUs and coprocessors (48 PCI-E lanes per socket)
  • Turbo Boost technology improves performance under peak loads by increasing processor clock speeds. With “Skylake-SP”, clock speeds are boosted higher even when many cores are in use. There are three tiers of Turbo Boost clock speeds depending upon the type of instructions being executed:
    • Non-AVX: Operations that are not math intensive, or that use AVX/AVX2 instructions which don’t involve multiply/FMA
    • AVX: Operations that heavily use the AVX/AVX2 units, or that use the AVX-512 unit (but not the multiply/FMA instructions)
    • AVX-512: Operations that heavily use the AVX-512 units, including multiply/FMA instructions
  • Intel QuickData Technology (CBDMA) doubles memory-to-memory copy performance and supports MMIO memory copy
  • Intel Volume Management Device (VMD) provides CPU-level device aggregation for NVMe SSD storage devices
  • Intel Virtual RAID on CPU (VROC) provides RAID support for NVMe SSD storage devices
  • A new Intel C620-series PCH chipset (formerly codenamed “Lewisburg”) with improved connectivity:
    • Up to four Intel X722 10GbE/1GbE Ethernet ports with iWARP RDMA support
    • PCI-Express generation 3.0 x4 connection from the PCH to the CPUs (previous generations used PCI-E gen 2.0)
    • Support for more integrated SATA3 6Gbps ports (up to 14)
    • Support for more integrated USB 3.0 ports (up to 10)
    • Integrated Intel QuickAssist Technology, which accelerates many cryptographic and compression/decompression workloads
  • Enhancements to the CPU Core Microarchitecture:
    • Larger and improved branch predictor, higher throughput decoder, larger out-of-order window
    • Improved scheduler and execution engine; improved throughput and latency of divide/sqrt
    • More load/store bandwidth, deeper load/store buffers, improved prefetcher
    • One or Two AVX-512 512-bit FMA units per core (compared to only one on desktop “Skylake” models)
    • Support for the following AVX-512 instruction types:
      AVX512-F, AVX512-VL, AVX512-BW, AVX512-DQ, AVX512-CD
    • 1MB L2 cache per core (compared to only 256KB L2 on desktop “Skylake” models)
    • A 10% (geomean) improvement in instructions per cycle (IPC) versus the previous-generation Broadwell CPUs
  • Re-architected L2/L3 cache hierarchy:
    • Each CPU core contains 1MB L2 private cache (up from 256KB)
    • Each core’s private L2 acts as primary cache
    • Each CPU contains >1.3MB/core of shared L3 cache (for when the private L2 caches overflow)
    • The shared L3 cache is now non-inclusive (does not keep copies of the L2 caches)
    • Larger 64-entry L2 TLB for 1GB pages (up from 16 entries)
  • Dual or Triple Ultra Path Interconnect (UPI) links between processor sockets improve communication speeds for data-intensive applications
  • Introduction of the RDSEED instruction for high-quality, non-deterministic, random seed values
  • Hyper-Threading technology allows two threads to “share” a processor core for improved resource usage. Although useful for some workloads, it is frequently disabled for HPC applications.
  • Hardware Controlled Power Management for more rapid and efficient decisions on optimal P- and C-State operating point

The post Detailed Specifications of the “Skylake-SP” Intel Xeon Processor Scalable Family CPUs appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-skylake-sp-intel-xeon-processor-scalable-family-cpus/feed/ 0
Detailed Specifications of the Intel Xeon E5-2600v4 “Broadwell-EP” Processors https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-intel-xeon-e5-2600v4-broadwell-ep-processors/ https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-intel-xeon-e5-2600v4-broadwell-ep-processors/#respond Thu, 31 Mar 2016 16:30:59 +0000 https://www.microway.com/?post_type=incsub_wiki&p=7124 This article provides in-depth discussion and analysis of the 14nm Xeon E5-2600v4 series processors (formerly codenamed “Broadwell-EP”). “Broadwell” processors replace the previous 22nm “Haswell” microarchitecture and are available for sale as of March 31, 2016. For an introduction, read our blog post Intel Xeon E5-2600 v4 “Broadwell” Processor ReviewNote: these have since been superceded by […]

The post Detailed Specifications of the Intel Xeon E5-2600v4 “Broadwell-EP” Processors appeared first on Microway.

]]>
This article provides in-depth discussion and analysis of the 14nm Xeon E5-2600v4 series processors (formerly codenamed “Broadwell-EP”). “Broadwell” processors replace the previous 22nm “Haswell” microarchitecture and are available for sale as of March 31, 2016. For an introduction, read our blog post Intel Xeon E5-2600 v4 “Broadwell” Processor ReviewNote: these have since been superceded by the Intel Xeon Processor Scalable Family CPUs.

Important changes available in E5-2600v4 “Broadwell-EP” include:

  • Up to 22 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 14-, 16-, 18-, and 20-cores)
  • Support for DDR4 memory speeds up to 2400MHz
  • Floating Point Instruction performance improvements:
    • Faster floating point multiplier completes operations in 3 cycles (down from 5 cycles)
    • 1024 Radix divider for reduced latency
    • Split Scalar divides for increased parallelism/bandwidth
    • Faster vector Gather
    • As introduced with Haswell, Broadwell continues to support AVX2 and FMA3 instructions for significant speedups of floating-point multiplication and addition operations
  • Extract more parallelism in scheduling micro-operations:
    • Reduced instruction latencies on ADC, CMOV and PCLMULQDQ
    • Larger out-of-order scheduler, with 64 entries (up from 60 entries)
    • Improved address prediction for branches and returns, with an expanded 10-way Branch Prediction Unit Target Array (up from 8-way)
  • Improved performance on large data sets:
    • Larger L2 Translation Lookaside Buffer (TLB), with 1.5k entries (up from 1K entries)
    • A new L2 TLB for 1GB pages (with 16 entries)
    • Addition of a second TLB page miss handler for parallel page walks

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC applications.

Exceptional Computational Performance

The Xeon E5-2600v4 processors provide the highest performance available to date in a socketed CPU. Many of the higher-end models provide well over 500 GFLOPS (more than half a TFLOPS). Much of this performance is made possible through the use of AVX2 with FMA3 instructions. The plot below compares the peak performance of these CPUs with and without FMA instructions:

Plot of Xeon E5-2600v4 Theoretical Peak Performance (GFLOPS)

The colored bars indicate performance using only AVX instructions; the grey bars indicate theoretical peak performance when using AVX with FMA. Note that only a small set of codes will be capable of issuing almost exclusively FMA instructions (e.g., LINPACK). Most applications will issue a variety of instructions, which will result in lower than peak FLOPS. Expect the achieved performance for well-parallelized & optimized applications to fall between the grey and colored bars.

Intel Xeon E5-2600v4 Series Specifications

The tabs below compare the features and specifications of the new model line. Intel has divided the CPUs into several groups:

  • Standard: cost-effective CPUs with moderate performance
  • Advanced: CPUs offering the highest performance for most applications
  • High Core Count: ideal for highly multi-threaded applications; CPUs providing the highest number of processor cores (sometimes sacrificing clock frequency in favor of core count)
  • Frequency Optimized: ideal for non-parallel/single-threaded applications; CPUs with the highest clock speeds (sacrificing number of cores in order to provide the highest frequencies)

Although these processors introduce significant performance increases, technical readers will see that many of the changes are incremental: increased core counts, improved DDR memory speed, etc. However, processor clock speeds/frequencies have not seen significant improvements.

In fact, in some cases the CPU frequency has been lowered from the previous models. Processor frequency and Turbo Boost behavior have changed fairly significantly in the last two CPU releases (“Haswell” and “Broadwell”). Those metrics are discussed in further detail in the next section.

Clock Speeds & Turbo Boost in Xeon E5-2600v4 series “Broadwell” processors

With each new processor line, Intel introduces new architecture optimizations. The design of the “Broadwell” architecture acknowledges that highly-parallel/vectorized applications place the highest load on the processor cores (requiring more power and thus generating more heat). While a CPU core is executing intensive vector tasks (AVX instructions), the clock speed may be reduced to keep the processor within its power limits (TDP).

In effect, this may result in the processor running at a lower frequency than the “base” clock speed advertised for each model. For that reason, each “Broadwell” processor is assigned two “base” frequencies:

  1. AVX mode: due to the higher power requirements of AVX instructions, clock speeds may be somewhat lower while executing AVX instructions *
  2. Non-AVX mode: while not executing AVX instructions, the processor will operate at what would traditionally be considered the “stock” frequency

* a CPU core will return to Non-AVX mode 1 millisecond after AVX instructions complete

It is worth noting that these modes are isolated to each core. Within a given CPU, some cores may be operating in AVX mode while others are operating in Non-AVX mode. In the previous generation, AVX instructions running on a single core would cause all cores to run in AVX mode.

AVX and Non-AVX Turbo Boost

Just as in previous architectures, “Broadwell” CPUs include the Turbo Boost feature which allows each processor core to operate well above the “base” clock speed during most operations. The precise clock speed increase depends upon the number & intensity of tasks running on each CPU. However, Turbo Boost speed increases also depend upon the types of instructions (AVX vs. Non-AVX).

The two plots below show that processor clock speeds can be categorized as:

  1. All cores on the CPU actively running Non-AVX instructions
  2. All cores on the CPU actively running AVX instructions
  3. A single active core running Non-AVX instructions (all other cores on the CPU must be idle)
  4. A single active core running AVX instructions (all other cores on the CPU must be idle)

Note that despite the clear rules stated above, each value is still a range of clock speeds. Because workloads are so diverse, Intel is unable to guarantee one specific clock speed for AVX or Non-AVX instructions. Users are guaranteed that cores will run within a specific range, but each application will have to be benchmarked to determine which frequencies a CPU will operate at.

When examining the differences between AVX and Non-AVX instructions, notice that Non-AVX instructions typically result in no more than a 100MHz to 200MHz increase in the highest clock speed. However, AVX instructions may cause clock speeds to drop by 300MHz to 400MHz if they are particularly intensive.

Recall that AVX2 introduces support for both integer and floating-point instructions, which means any compute-intensive application will be using such instructions (if it has been properly designed and compiled). HPC users should expect their processors to be running in AVX mode most of the time.

Top Clock Speeds for Specific Core Counts

When workloads leave some CPU cores idle, the Xeon E5-2600v4 processors are able to use that headroom to increase the clock speed of the cores which are performing work. Just as with other Turbo Boost scenarios, the precise speed increase will depend upon the CPU model. It will also depend upon how many CPU cores are active.

We advise users to consider how many CPU cores their application is able to saturate. The tabs below detail the peak Turbo Boost frequencies for each CPU model, sorted by the number of active cores:

All of the above plots show CPU frequencies for applications utilizing AVX instructions. The colored bars indicate the worst-case scenario – CPUs will run at least this fast. The grey bars indicate the expected clock speeds for most workloads.

Cost-Effectiveness and Power Efficiency of Xeon E5-2600v4 CPUs

The “Broadwell-EP” processors have nearly the same price structure and power requirements as earlier Xeon E5-2600 products, so their cost-effectiveness and power-efficiency should be quite attractive to HPC users. Savvy readers may find the following facts useful:

  • HPC applications run best on the Advanced CPU models; they typically do not scale well on the High-Core-Count models.
  • The High-Core-Count models are more common in Enterprise and Finance – these carry higher prices than other E5-2600 models.
  • The following graphs depict the cost-effectiveness and power-efficiency of only the CPU itself. In many cases, HPC users will find that once they’ve taken the full platform and cluster design into account, the cost-effectiveness of an Advanced CPU may be higher than these plots demonstrate.

Summary of features in Xeon E5-2600v4 “Broadwell-EP” processors

In addition to the capabilities mentioned at the top of this article, these processors include many of the successful features from earlier Xeon designs. The list below provides a summary of relevant technology features:

  • Up to 22 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 14-, 16-, 18-, and 20-cores)
  • Support for Quad-channel ECC DDR4 memory speeds up to 2400MHz
  • Direct PCI-Express (generation 3.0) connections between each CPU and peripheral devices such as network adapters, GPUs and coprocessors (40 PCI-E lanes per socket)
  • Floating Point Instruction performance improvements:
    • Faster floating point multiplier completes operations in 3 cycles (down from 5 cycles)
    • 1024 Radix divider for reduced latency
    • Split Scalar divides for increased parallelism/bandwidth
    • Faster vector Gather
  • As introduced with “Haswell”, “Broadwell” continues to supportAdvanced Vector Extensions (AVX 2.0):
    • effectively double the throughput of integer and floating-point operations with math units expanded from 128-bits to 256-bits
    • introduce Fused Multiply Add (FMA3) instructions which allow a multiply and an accumulate instruction to be completed in a single cycle (effectively doubling the FLOPS/clock from 8 to 16 for each core of a CPU)
    • add support for additional instructions, including Gather and vector shift
    • F16C 16-bit Floating-Point conversion instructions accelerate data conversion between 16-bit and 32-bit floating point formats
  • Turbo Boost technology improves performance under peak loads by increasing processor clock speeds. With version 2.0, (introduced in “Sandy Bridge”) clock speeds are boosted more frequently, to higher speeds and for longer periods of time. With “Haswell” and “Broadwell”, top clock speeds depend upon the type of instructions (AVX vs. Non-AVX).
  • Extract more parallelism in scheduling micro-operations:
    • Reduced instruction latencies on ADC, CMOV and PCLMULQDQ
    • Larger out-of-order scheduler, with 64 entries (up from 60 entries)
    • Introduction of the ADCX and ADOX instructions to speed up cryptography
    • Improved address prediction for branches and returns, with an expanded 10-way Branch Prediction Unit Target Array (up from 8-way)
  • Improved performance on large data sets:
    • Larger L2 Translation Lookaside Buffer (TLB), with 1.5k entries (up from 1K entries)
    • A new L2 TLB for 1GB pages (with 16 entries)
    • Addition of a second TLB page miss handler for parallel page walks
  • Dual Quick Path Interconnect (QPI) links between processor sockets improve communication speeds for multi-threaded applications
  • Intel Data Direct I/O Technology increases performance and reduces latency by allowing Intel ethernet controllers and adapters to talk directly with the processor cache
  • Transactional Synchronization Extensions (TSX) improve the parallelism of multi-threaded applications with synchronization locks
  • Introduction of the RDSEED instruction for high-quality, non-deterministic, random seed values
  • Advanced Encryption Standard New Instructions (AES-NI) accelerate encryption and decryption for fast, affordable data protection and security
  • 32-bit & 64-bit Intel Virtualization Technology (VT/VT-x) forDirected I/O (VT-d) and Connectivity (VT-c) deliver faster performance for core virtualization processes and provide built-in hardware support for I/O virtualization.
  • Intel APIC Virtualization (APICv) provides increased virtualization performance
  • Hyper-Threading technology allows two threads to “share” a processor core for improved resource usage. Although useful for some workloads, it is not recommended for HPC applications.
  • Improved energy efficiency with Per Core P-States and independent uncore frequency control
  • Hardware Controlled Power Management for more rapid and efficient decisions on optimal P- and C-State operating point
  • DDR4 CRC provides better memory reliability and data integrity by detecting memory bus faults during write
  • ECRC for PCI-Express provides optional data integrity protection for systems using PCI-Express switches or bridges

The post Detailed Specifications of the Intel Xeon E5-2600v4 “Broadwell-EP” Processors appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-intel-xeon-e5-2600v4-broadwell-ep-processors/feed/ 0
Detailed Specifications of the Intel Xeon E5-2600v3 “Haswell-EP” Processors https://www.microway.com/knowledge-center-articles/detailed-specifications-intel-xeon-e5-2600v3-haswell-ep-processors/ https://www.microway.com/knowledge-center-articles/detailed-specifications-intel-xeon-e5-2600v3-haswell-ep-processors/#respond Mon, 08 Sep 2014 17:00:21 +0000 http://https://www.microway.com/?post_type=incsub_wiki&p=4559 This article provides in-depth discussion and analysis of the 22nm Xeon E5-2600v3 series processors (formerly codenamed “Haswell-EP”). “Haswell” processors replace the previous 22nm “Ivy Bridge” microarchitecture and are available for sale as of September 8, 2014. Note: these have since been superceded by Xeon E5-2600v4 Broadwell-EP Processors. Important changes available in E5-2600v3 “Haswell-EP” include: With […]

The post Detailed Specifications of the Intel Xeon E5-2600v3 “Haswell-EP” Processors appeared first on Microway.

]]>
This article provides in-depth discussion and analysis of the 22nm Xeon E5-2600v3 series processors (formerly codenamed “Haswell-EP”). “Haswell” processors replace the previous 22nm “Ivy Bridge” microarchitecture and are available for sale as of September 8, 2014. Note: these have since been superceded by Xeon E5-2600v4 Broadwell-EP Processors.

Important changes available in E5-2600v3 “Haswell-EP” include:

  • Up to 18 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 14- and 16-cores)
  • Support for DDR4 memory speeds up to 2133MHz
  • Advanced Vector Extensions version 2.0 (AVX2 instructions):
    • allow 256-bit wide operations for both integer and floating-point numbers (the older AVX instructions supported only floating-point operations)
    • introduce Fused Multiply Add FMA3 instructions, which allow a multiply and an accumulate instruction to be completed in a single cycle (potentially doubling throughput for floating-point applications – up to 16 FLOPS per cycle)
    • add support for additional instructions, including Gather and vector shift
  • Improved energy efficiency with Per Core P-States and independent uncore frequency control

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC applications.

Exceptional Computational Performance

The Xeon E5-2600v3 processors introduce the highest performance available to date in a socketed CPU. For the first time, a single CPU is capable of more than half a TeraFLOPS (500 GFLOPS). This is made possible through the use of AVX2 with FMA3 instructions. The plot below compares the peak performance of these CPUs with and without FMA instructions:

Plot of Xeon E5-2600v3 Theoretical Peak Performance (GFLOPS)

The colored bars indicate performance using only AVX instructions; the grey bars indicate theoretical peak performance when using AVX with FMA. Note that only a small set of codes will be capable of issuing almost exclusively FMA instructions (e.g., LINPACK). Most applications will issue a variety of instructions, which will result in lower than peak FLOPS. Expect the achieved performance for well-parallelized & optimized applications to fall between the grey and colored bars.

Intel Xeon E5-2600v3 Series Specifications

The tabs below compare the features and specifications of the new model line. Intel has divided the CPUs into several groups:

  • Standard: cost-effective CPUs with moderate performance
  • Advanced: CPUs offering the highest performance for most applications
  • High Core Count: ideal for well-parallelized applications; CPUs providing the highest number of processor cores (sometimes sacrificing clock frequency in favor of core count)
  • Frequency Optimized: ideal for non-parallel/single-threaded applications; CPUs with the highest clock speeds (sacrificing number of cores in order to provide the highest frequencies)

Although these processors introduce significant performance increases, technical readers will see that many of the changes are incremental: increased core counts, improved DDR memory speed, etc. However, processor clock speeds/frequencies have not seen significant improvements.

In fact, in some cases the CPU frequency has been lowered from the previous models. Processor frequency and Turbo Boost behavior have changed significantly with this release. Those metrics are discussed in further detail in the next section.

Clock Speeds & Turbo Boost in Xeon E5-2600v3 series “Haswell” processors

With each new processor line, Intel introduces new architecture optimizations. The design of the “Haswell” architecture acknowledges that highly-parallel/vectorized applications place the highest load on the processor cores (requiring more power and thus generating more heat). While a CPU core is executing intensive vector tasks (AVX instructions), the clock speed may be reduced to keep the processor within its power limits (TDP).

In effect, this may result in the processor running at a lower frequency than the “base” clock speed advertised for each model. For that reason, each “Haswell” processor model is assigned two “base” frequencies:

  1. AVX mode: due to the higher power requirements of AVX instructions, clock speeds may be somewhat lower while executing AVX instructions *
  2. Non-AVX mode: while not executing AVX instructions, the processor will operate at what would traditionally be considered the “stock” frequency

* a CPU core will return to Non-AVX mode 1 millisecond after AVX instructions complete

AVX and Non-AVX Turbo Boost

Just as in previous architectures, “Haswell” CPUs include the Turbo Boost feature which causes each processor core to operate well above the “base” clock speed during most operations. The precise clock speed increase depends upon the number & intensity of tasks running on each CPU. With the “Haswell” architecture, Turbo Boost speed increases also depend upon the types of instructions (AVX vs. Non-AVX).

The two plots below show that processor clock speeds can be categorized as:

  1. All cores on the CPU actively running Non-AVX instructions
  2. All cores on the CPU actively running AVX instructions
  3. A single active core running Non-AVX instructions (all other cores on the CPU must be idle)
  4. A single active core running AVX instructions (all other cores on the CPU must be idle)

Note that despite the clear rules stated above, each value is still a range of clock speeds. Because workloads are so diverse, Intel is unable to guarantee one specific clock speed for AVX or Non-AVX instructions. Users are guaranteed that cores will run within a specific range, but each application will have to be benchmarked to determine which frequencies a CPU will operate at.

When examining the differences between AVX and Non-AVX instructions, notice that Non-AVX instructions typically result in no more than a 100MHz to 200MHz increase in the highest clock speed. However, AVX instructions may cause clock speeds to drop by 300MHz to 400MHz if they are particularly intensive.

Recall that AVX2 introduces support for both integer and floating-point instructions, which means any compute-intensive application will be using such instructions (if it has been properly designed and compiled). HPC users should expect their processors to be running in AVX mode most of the time.

Top Clock Speeds for Specific Core Counts

When workloads leave some CPU cores idle, the Xeon E5-2600v3 processors are able to use that headroom to increase the clock speed of the cores which are performing work. Just as with other Turbo Boost scenarios, the precise speed increase will depend upon the CPU model. It will also depend upon how many CPU cores are active.

We advise users to consider how many CPU cores their application is able to saturate. The tabs below detail the peak Turbo Boost frequencies for each CPU model, sorted by the number of active cores:

All of the above plots show CPU frequencies for applications utilizing AVX instructions. The colored bars indicate the worst-case scenario – CPUs will run at least this fast. The grey bars indicate the expected clock speeds for most workloads.

Cost-Effectiveness and Power Efficiency of Xeon E5-2600v3 CPUs

The “Haswell-EP” processors have nearly the same price structure and power requirements as earlier Xeon E5-2600 products, so their cost-effectiveness and power-efficiency should be quite attractive to HPC users. Savvy readers may find the following facts useful:

  • Although v3 Xeons follow the same price steps as their v2 counterparts, three High-Core-Count models were late additions. These models are higher performing and carry higher prices than previous E5-2600 models.
  • The power requirement (TDP) for each model has increased by 5 Watts over the previous generation. This is due to integration of the Voltage Regulator Modules (VRMs) which were previously placed on the motherboard. Thus, CPU TDP increases 5W and motherboard TDP decreases 5W.
  • The following graphs depict the cost-effectiveness and power-efficiency of only the CPU itself. In many cases, HPC users will find that once they’ve taken the full platform and cluster design into account, the cost-effectiveness of a higher core count CPU may be more beneficial than these plots demonstrate.

Summary of features in Xeon E5-2600v3 “Haswell-EP” processors

In addition to the capabilities mentioned at the top of this article, these processors include many of the successful features from earlier Xeon designs. The list below provides a summary of relevant technology features:

  • Up to 18 processor cores per socket (with options for 4-, 6-, 8-, 10-, 12-, 14- and 16-cores)
  • Support for Quad-channel ECC DDR4 memory speeds up to 2133MHz
  • Direct PCI-Express (generation 3.0) connections between each CPU and peripheral devices such as network adapters, GPUs and coprocessors (40 PCI-E lanes per socket)
  • Advanced Vector Extensions (AVX 2.0):
    • effectively double the throughput of integer and floating-point operations with math units expanded from 128-bits to 256-bits
    • introduce Fused Multiply Add (FMA3) instructions which allow a multiply and an accumulate instruction to be completed in a single cycle (effectively doubling the FLOPS/clock from 8 to 16 for each core of a CPU)
    • add support for additional instructions, including Gather and vector shift
    • F16C 16-bit Floating-Point conversion instructions accelerate data conversion between 16-bit and 32-bit floating point formats
  • Turbo Boost technology improves performance under peak loads by increasing processor clock speeds. With version 2.0, (introduced in “Sandy Bridge”) clock speeds are boosted more frequently, to higher speeds and for longer periods of time. With “Haswell”, top clock speeds depend upon the type of instructions (AVX vs. Non-AVX).
  • Dual Quick Path Interconnect (QPI) links between processor sockets improve communication speeds for multi-threaded applications
  • Improved energy efficiency with Per Core P-States and independent uncore frequency control
  • Intel Data Direct I/O Technology increases performance and reduces latency by allowing Intel ethernet controllers and adapters to talk directly with the processor cache
  • Advanced Encryption Standard New Instructions (AES-NI) accelerate encryption and decryption for fast, affordable data protection and security
  • 32-bit & 64-bit Intel Virtualization Technology (VT/VT-x) forDirected I/O (VT-d) and Connectivity (VT-c) deliver faster performance for core virtualization processes and provide built-in hardware support for I/O virtualization.
  • Intel APIC Virtualization (APICv) provides increased virtualization performance
  • Hyper-Threading technology allows two threads to “share” a processor core for improved resource usage. Although useful for some workloads, it is not recommended for HPC applications.

The post Detailed Specifications of the Intel Xeon E5-2600v3 “Haswell-EP” Processors appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/detailed-specifications-intel-xeon-e5-2600v3-haswell-ep-processors/feed/ 0
In-Depth Comparison of Intel Xeon E5-4600v2 “Ivy Bridge” Processors https://www.microway.com/knowledge-center-articles/depth-comparison-intel-xeon-e5-4600v2-ivy-bridge-processors/ https://www.microway.com/knowledge-center-articles/depth-comparison-intel-xeon-e5-4600v2-ivy-bridge-processors/#respond Wed, 05 Mar 2014 16:18:50 +0000 http://https://www.microway.com/?post_type=incsub_wiki&p=3628 This article provides in-depth discussion and analysis of the 22nm Xeon E5-4600v2 series processors (formerly codenamed “Ivy Bridge”). These “Ivy Bridge” processors improve upon the previous 32nm “Sandy Bridge” microarchitecture and are available for sale as of March 3, 2014. For an introduction, read our blog post reviewing E5-4600v2. Important changes available in E5-4600v2 “Ivy […]

The post In-Depth Comparison of Intel Xeon E5-4600v2 “Ivy Bridge” Processors appeared first on Microway.

]]>
This article provides in-depth discussion and analysis of the 22nm Xeon E5-4600v2 series processors (formerly codenamed “Ivy Bridge”). These “Ivy Bridge” processors improve upon the previous 32nm “Sandy Bridge” microarchitecture and are available for sale as of March 3, 2014. For an introduction, read our blog post reviewing E5-4600v2.

Important changes available in E5-4600v2 “Ivy Bridge” include:

  • Up to 12 processor cores per socket (with options for 4-, 6-, 8- and 10-cores)
  • Support for DDR3 memory speeds up to 1866MHz
  • Improved PCI-Express generation 3.0 support with improved compatibility and new features: atomics, x16 non-transparent bridge & quadrupled read buffers for P2P transfers
  • AVX has been extended to support F16C (16-bit Floating-Point conversion instructions) to accelerate data conversion between 16-bit and 32-bit floating point formats
  • Intel APIC Virtualization (APICv) provides increased virtualization performance

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC applications.

Intel Xeon E5-4600v2 Series Specifications

Intel Turbo Boost in Xeon E5-4600v2 series “Ivy Bridge” processors

Summary of features in Xeon E5-4600v2 “Ivy Bridge” processors

In addition to the capabilities mentioned at the top of this article, these processors include many of the successful features from earlier Xeon designs. The list below provides a summary of relevant technology features:

  • Up to 12 processor cores per socket (with options for 4-, 6-, 8- and 10-cores)
  • Support for Quad-channel ECC DDR3 memory speeds up to 1866MHz
  • Direct PCI-Express (generation 3.0) connections between each CPU and peripheral devices such as network adapters, GPUs and coprocessors (40 PCI-E lanes per socket). Improved PCI-Express generation 3.0 support with improved compatibility and new features: atomics, x16 non-transparent bridge & quadrupled read buffers for P2P transfers
  • Advanced Vector Extensions (AVX) accelerate floating point operations used in HPC & technical computing applications. This technology expands the math unit from 128-bits to 256-bits, effectively doubling throughput. AVX has been extended to support F16C (16-bit Floating-Point conversion instructions) to accelerate data conversion between 16-bit and 32-bit floating point formats
  • Turbo Boost technology improves performance under peak loads by increasing processor clock speeds. With version 2.0, (introduced in “Sandy Bridge”) clock speeds are boosted more frequently, to higher speeds and for longer periods of time.
  • Dual Quick Path Interconnect (QPI) links between processor sockets improve communication speeds for multi-threaded applications
  • Intel Intelligent Power Technology reduces individual idling cores to near-zero power. Power gates adjust processors and memory to the lowest available power state to meet workload requirements without impacting performance.
  • Intel Data Direct I/O Technology increases performance and reduces latency by allowing Intel ethernet controllers and adapters to talk directly with the processor cache
  • Advanced Encryption Standard New Instructions (AES-NI) accelerate encryption and decryption for fast, affordable data protection and security
  • 32-bit & 64-bit Intel Virtualization Technology (VT/VT-x) for Directed I/O (VT-d) and Connectivity (VT-c) deliver faster performance for core virtualization processes and provide built-in hardware support for I/O virtualization.
  • Intel APIC Virtualization (APICv) provides increased virtualization performance
  • Hyper-Threading technology allows two threads to “share” a processor core for improved resource usage. Although useful for some workloads, it is not recommended for HPC applications.

More information is available in Intel’s Xeon E5-4600v2 Product Brief.

The post In-Depth Comparison of Intel Xeon E5-4600v2 “Ivy Bridge” Processors appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/depth-comparison-intel-xeon-e5-4600v2-ivy-bridge-processors/feed/ 0
In-Depth Comparison of Intel Xeon E5-2600v2 “Ivy Bridge” Processors https://www.microway.com/knowledge-center-articles/in-depth-comparison-and-analysis-intel-xeon-e5-2600v2-ivy-bridge-processor/ https://www.microway.com/knowledge-center-articles/in-depth-comparison-and-analysis-intel-xeon-e5-2600v2-ivy-bridge-processor/#respond Tue, 10 Sep 2013 16:01:56 +0000 http://https://www.microway.com/?post_type=incsub_wiki&p=3056 This article provides in-depth discussion and analysis of the 22nm Xeon E5-2600v2 series processors (formerly codenamed “Ivy Bridge”). “Ivy Bridge” processors improve upon the previous 32nm “Sandy Bridge” microarchitecture and are available for sale as of September 10, 2013. For an introduction, read our blog post Intel Xeon E5-2600v2 “Ivy Bridge” Processor Review Important changes […]

The post In-Depth Comparison of Intel Xeon E5-2600v2 “Ivy Bridge” Processors appeared first on Microway.

]]>
This article provides in-depth discussion and analysis of the 22nm Xeon E5-2600v2 series processors (formerly codenamed “Ivy Bridge”). “Ivy Bridge” processors improve upon the previous 32nm “Sandy Bridge” microarchitecture and are available for sale as of September 10, 2013. For an introduction, read our blog post Intel Xeon E5-2600v2 “Ivy Bridge” Processor Review

Important changes available in E5-2600v2 “Ivy Bridge” include:

  • Up to 12 processor cores per socket (with options for 4-, 6-, 8- and 10-cores)
  • Support for DDR3 memory speeds up to 1866MHz
  • Improved PCI-Express generation 3.0 support with improved compatibility and new features: atomics, x16 non-transparent bridge & quadrupled read buffers for P2P transfers
  • AVX has been extended to support F16C (16-bit Floating-Point conversion instructions) to accelerate data conversion between 16-bit and 32-bit floating point formats
  • Intel APIC Virtualization (APICv) provides increased virtualization performance

With a product this complex, it’s very difficult to cover every aspect of the design. Here, we concentrate primarily on the performance of the processors for HPC applications.

Intel Xeon E5-2600v2 Series Specifications

Intel Turbo Boost in Xeon E5-2600v2 series “Ivy Bridge” processors

Summary of features in Xeon E5-2600v2 “Ivy Bridge” processors

In addition to the capabilities mentioned at the top of this article, these processors include many of the successful features from earlier Xeon designs. The list below provides a summary of relevant technology features:

  • Up to 12 processor cores per socket (with options for 4-, 6-, 8- and 10-cores)
  • Support for Quad-channel ECC DDR3 memory speeds up to 1866MHz
  • Direct PCI-Express (generation 3.0) connections between each CPU and peripheral devices such as network adapters, GPUs and coprocessors (40 PCI-E lanes per socket). Improved PCI-Express generation 3.0 support with improved compatibility and new features: atomics, x16 non-transparent bridge & quadrupled read buffers for P2P transfers
  • Advanced Vector Extensions (AVX) accelerate floating point operations used in HPC & technical computing applications. This technology expands the math unit from 128-bits to 256-bits, effectively doubling throughput. AVX has been extended to support F16C (16-bit Floating-Point conversion instructions) to accelerate data conversion between 16-bit and 32-bit floating point formats
  • Turbo Boost technology improves performance under peak loads by increasing processor clock speeds. With version 2.0, (introduced in “Sandy Bridge”) clock speeds are boosted more frequently, to higher speeds and for longer periods of time.
  • Dual Quick Path Interconnect (QPI) links between processor sockets improve communication speeds for multi-threaded applications
  • Intel Intelligent Power Technology reduces individual idling cores to near-zero power. Power gates adjust processors and memory to the lowest available power state to meet workload requirements without impacting performance.
  • Intel Data Direct I/O Technology increases performance and reduces latency by allowing Intel ethernet controllers and adapters to talk directly with the processor cache
  • Advanced Encryption Standard New Instructions (AES-NI) accelerate encryption and decryption for fast, affordable data protection and security
  • 32-bit & 64-bit Intel Virtualization Technology (VT/VT-x) forDirected I/O (VT-d) and Connectivity (VT-c) deliver faster performance for core virtualization processes and provide built-in hardware support for I/O virtualization.
  • Intel APIC Virtualization (APICv) provides increased virtualization performance
  • Hyper-Threading technology allows two threads to “share” a processor core for improved resource usage. Although useful for some workloads, it is not recommended for HPC applications.

The post In-Depth Comparison of Intel Xeon E5-2600v2 “Ivy Bridge” Processors appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/in-depth-comparison-and-analysis-intel-xeon-e5-2600v2-ivy-bridge-processor/feed/ 0
Estimating the Performance of a New Computer System https://www.microway.com/knowledge-center-articles/estimating-the-performance-of-a-new-computer-system/ https://www.microway.com/knowledge-center-articles/estimating-the-performance-of-a-new-computer-system/#respond Mon, 05 Aug 2013 03:36:08 +0000 http://https://www.microway.com/?post_type=incsub_wiki&p=2752 Transistor density doubles every ~2 years. Ideally, this means we can provide you computers that contain twice as much memory and perform twice as fast. In practice, it’s much more nuanced. Benchmark Your Application The best performance comparison will always be benchmark testing of the application(s) that will run on your computer. You can check […]

The post Estimating the Performance of a New Computer System appeared first on Microway.

]]>
Transistor density doubles every ~2 years. Ideally, this means we can provide you computers that contain twice as much memory and perform twice as fast. In practice, it’s much more nuanced.

Benchmark Your Application

The best performance comparison will always be benchmark testing of the application(s) that will run on your computer. You can check with your software vendor to determine if they have a list of such benchmarks results. In our experience, most software vendors don’t have the resources to maintain such lists (or they are several years out of date).

If you’re willing, you may contact one of Microway’s technical experts to request a Test Drive. We can provide a remote log-in to one of our systems to demonstrate the performance increase. Certain scientific applications are pre-installed, so get in touch!

Alternatives to Benchmarking

If no benchmarks are published for your application(s), it’s still possible to find results tailored to your field. Read through the list of applications and problem descriptions below. If any are similar to your application, contact a Microway representative to learn what speedups you should expect.

One of our representatives can also give you rough guidelines on the factors that have changed between the generation of hardware you’re currently using and the latest generation. For example, Intel Xeon CPUs released before 2012 typically completed four floating-point operations per clock cycle. With the release of CPUs supporting AVX, this increased to eight floating-point operations per cycle.

Industry-Standard CPU Performance Benchmarks – SPEC CPU2006

SPEC CPU2006 Benchmarks (Floating Point)

Application AreaBenchmarkBrief Description
CFDLESlie3dLarge-Eddy Simulations with Linear-Eddy Model in 3D. Supports turbulence phenomena such as mixing, combustion, acoustics and general fluid mechanics. For SPEC, solve a subset of such flows, namely the temporal mixing layer. This type of flow occurs in the mixing regions of all combustors that employ fuel injection (which is nearly all combustors).
CFDbwavesSimulates blast waves in 3D transonic transient laminar viscous flow.
CFDLBMImplements the “Lattice-Boltzmann Method” to simulate incompressible fluids in 3D (computationally the most important part of a larger code which is used in the field of material science to simulate the behavior of fluids with free surfaces, in particular the formation and movement of gas bubbles in metal foams).
Computational ElectromagneticsGemsFDTDSolves the Maxwell equations in 3D using the finite-difference time-domain (FDTD) method. For SPEC, the radar cross section (RCS) of a perfectly conducting (PEC) object is computed.
FEACalculiXFinite element code for linear and nonlinear 3D structural applications. Uses the SPOOLES solver library to solve CrunchiX.
FEAdeal IIProgram library targeted at adaptive finite elements and error estimation. For SPEC, solves a Helmholtz-type equation with non-constant coefficients in 3D (used for incompressible fluid flow, static or time-harmonic electromagnetics, static and quasi-static elasto-plasticity, general relativity, and implicit time stepping schemes for seismic, acoustic, and electromagnetic applications).
Image Ray-tracingPOVRAYFor SPEC, render a 1280×1024 anti-aliased image of a chessboard and abstract objects; all object surfaces are procedurally textured (e.g., Perlin noise function).
Linear Programming, OptimizationSoPlexSolves a linear program using a simplex algorithm and sparse linear algebra. Test cases include railroad planning and military airlift models.
Molecular DynamicsGROMACSSimulates the Newtonian equations of motion for systems with hundreds to millions of particles. For SPEC, performs a simulation of the protein Lysozyme in a solution of water and ions (23,179 atoms).
Molecular DynamicsNAMDSimulates large biomolecular systems. For SPEC, simulates apolipoprotein A-I (92,224 atoms).
Quantum ChemistryGAMESSSupports a wide range of quantum chemical computations. For SPEC, self-consistent field (SCF) calculations are performed using the Restricted Hartree Fock method, Restricted open-shell Hartree-Fock, and Multi-Configuration SCF.
Quantum ChemistryTontoOpen source quantum chemistry package. The test case places a constraint on a molecular Hartree-Fock wavefunction calculation to better match experimental X-ray diffraction data.
Quantum ChromodynamicsMILCA gauge field generating program for lattice gauge theory programs with dynamical quarks. For SPEC, the serial version of su3imp is used.
Physics / CFDZEUS-MPA CFD code developed for the simulation of astrophysical phenomena (ideal, non-relativistic, hydrodynamics and magnetohydrodynamics, including externally applied gravitational fields and self-gravity). For SPEC, simulates a 3D blastwave with the presence of a uniform magnetic field along the x-direction.
Physics / General RelativityCactusADMSolves the Einstein evolution equations, which describe how spacetime curves as response to its matter content (a set of ten coupled nonlinear partial differential equations). A staggered-leapfrog numerical method is used to carry out the update.
Speech RecognitionSphinx-3A widely-known speech recognition system from Carnegie Mellon University.
WeatherWRFnext-generation mesocale numerical weather prediction system from scales of meters to thousands of kilometers. For SPEC, simulate a 30km area over 2 days.

Official SPEC CPU2006 Benchmarks (Floating Point) descriptions

SPEC CPU2006 Benchmarks (Integer)

Application AreaBenchmarkBrief Description
Artificial IntelligenceSjengA highly-ranked chess program that also plays several chess variants. It attempts to find the best move via a combination of alpha-beta or priority proof number tree searches, advanced move ordering, positional evaluation and heuristic forward pruning.
Artificial IntelligencegobmkPlays the game of Go, a simply described but deeply complex game. The program plays Go and executes a set of commands to analyze Go positions.
C CompilergccBased on gcc Version 3.2 – generates code for an AMD Opteron processor.
Combinatorial OptimizationMCFVehicle scheduling. Uses a network simplex algorithm (which is also used in commercial products) to schedule public transport.
Compressionbzip2Julian Seward’s bzip2 version 1.0.3, modified to do most work in memory, rather than doing I/O.
Discrete Event SimulationOMNeT++Uses the OMNet++ discrete event simulator to model a large Ethernet campus network (about 8000 computers and 900 switches/hubs).
Gene Sequence SearchHMMERProtein sequence analysis using profile hidden Markov models (profile HMMs). Used in computational biology to search for patterns in DNA sequences.
Path-findingA-starPathfinding library for 2D maps, including the well known A* algorithm.
Physics / Quantum ComputinglibquantumSimulates a quantum computer, running Shor’s polynomial-time factorization algorithm.
Programming LanguageperlbenchDerived from Perl V5.8.7. The workload includes SpamAssassin, MHonArc (an email indexer), and specdiff (SPEC’s tool that checks benchmark outputs).
Video CompressionH264refA reference implementation of H.264/AVC – encodes a videostream using 2 parameter sets. The H.264/AVC standard has largely replaced the older MPEG2 standard.
XML ProcessingXalan-C++Transforms XML documents to other document types (100 MB test data).

Official SPEC CPU2006 Benchmarks (Integer) descriptions

The post Estimating the Performance of a New Computer System appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/estimating-the-performance-of-a-new-computer-system/feed/ 0