High Performance Computing Archives - Microway

2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC

Brett Newman — Wed, 07 Aug 2019 23:00:00 +0000

The 2nd Generation AMD EPYC “Rome” CPUs are here! Rome brings greater core counts, faster memory, and PCI-E Gen4 all to deliver what really matters: up to a 2X increase in HPC application performance. We’re excited to present our thoughts on this advancement, and the return of x86 server CPU competition, in our detailed AMD EPYC Rome review. AMD is unquestionably back to compete for the performance crown in HPC.

2nd Generation AMD EPYC “Rome” CPUs are offered in 8-64 cores and clock speeds from 2.2-3.2Ghz. They are available in dual socket as well as aselect number of single socket only SKUs.

Index of our review:
Changes vs Previous Architectures
Leadership Performance
Class Leading Price-Performance
Chiplets + IO & Compute Dies
Memory Bandwidth
PCI-E Gen4
Infinity Fabric
SKU Selection for Clusters

Important changes in AMD EPYC “Rome” CPUs include:

Up to 64 cores, 2X the max in the previous generation for a massive advancement in aggregate throughput
PCI-E Gen 4 support for 2X the I/O bandwidth of the x86 competition— in a first for an x86 server CPU
2X the FLOPS per core of the previous generation EPYC CPUs with the new Zen2 architecture
DDR4-3200 support for improved memory bandwidth across 8 channels, reaching up to 208GB/sec per socket
Next Generation Infinity Fabric with higher bandwidth for intra and inter-die connection, with roots in PCI-E Gen4
New 14nm + 7nm chiplet architecture that separates the 14nm IO and 7nm compute core dies to yield the performance per watt benefits of the new TSMC 7nm process node

Leadership HPC Performance

There’s no other way to say it: the 2nd Generation AMD EPYC “Rome” CPUs (EPYC 7xx2) break new ground for HPC performance. In our experience, we haven’t seen this type of advancement in CPU performance in many years or without exotic architectural changes. This leap applies across floating point and integer applications.

Note: This article focuses on SPEC benchmark performance (which is rooted in real integer and floating point applications). If you’re hunting for a more raw FLOPS/dollar calculation, please visit our Knowledge Center Article on AMD EPYC 7xx2 “Rome” CPUs.

Floating Point Benchmark Performance

In short: at the top bin, you may see up to 2.12X the performance of the competition. This is compared to the top bin of Xeon Gold Processor (Xeon Gold 6252) on SPECrate2017_fp_base.

Compared to the top Xeon Platinum 8200 series SKU (Xeon Platinum 8280), up to 1.79X the performance.

Integer Benchmark Performance

Integer performance largely mirrors the same story. At the top bin, you may see up to 2.49X the performance of the competition. This is compared to the top bin of Xeon Gold Processor (Xeon Gold 6252) on SPECrate2017_int_base.

Compared to the top Xeon Platinum 8200 series SKU (Xeon Platinum 8280), up to 1.90X the performance.

What Makes EPYC 7xx2 Series Perform Strongly?

Contributions towards this leap in performance come from a combination of:

The 2X the FLOPS per core available in the new architecture
Improved performance of Zen2 microarchitecture
Moderate increases in clock speeds
Most importantly dramatic increases in core count

These last 2 items are facilitated by the new 7nm process node and the chiplet architecture of EPYC. Couple that with the advantages in memory bandwidth, and you have a recipe for HPC performance.

Performance Outlook

The dramatic increase in core count coupled with Zen2 means that we predict that most of the 32 core models and above, about half AMD’s SKU stack, is likely to outperform the top Xeon Platinum 8200 series SKU. Stay tuned for the SPEC benchmarks that confirm this assertion.

If you’re comparing against more modest Xeon Gold 62xx or Silver 52xx/42xx SKUs, we predict even an even more dramatic performance uplift. This is the first time in many years we’ve seen such an incredibly competitive product from the AMD Server Group.

Class Leading Price/Performance

AMD EPYC 7xx2 series isn’t just impressive from an absolute performance perspective. It’s also a price performance machine.

Examine these same two top-bin SKUs once again:

The top-bin AMD SKU does 1.79X the floating point work at approximately 2/3 the price of Xeon Platinum 8280. It delivers 2.13X the floating point performance to the Xeon Gold 6252 for about similar price/performance.

Should you be willing to accept more modest core counts with the lower cost SKUS, these comparisons just get better.

Finally, if you’re looking to roughly match or exceed the performance of the top-bin Xeon Gold 6252 SKU, we predict you’ll be able to do so with the 24-core EPYC 7352. This will be at just over 1/3 the price of the Xeon socket.

This much more typical comparison is emblematic of the price-performance advantage AMD has delivered in the new generation of CPUs. Stay tuned for more benchmark results and charts to support the prediction.

A Few Caveats: Performance Tuning & Out of the Box

Application Performance Engineers have spent years optimizing applications for the most widely available x86 server CPU. For a number of years now, that has meant Intel’s Xeon processors. The benchmarks presented here represent performance-tuned results.

We don’t yet have great data on how easy it is to achieve optimized performance with these new AMD “Rome” CPUs yet. For those of us in HPC for some time, we know out of the box performance and optimized performance often can mean very different things.

AMD does recommend specific compilers (AOCC, GCC, LLVM) and libraries (BLIS over BLAS and FLAME over LAPACK) to achieve optimized results with all EPYC CPUs. We don’t yet have a complete understanding how much these help end users achieve these superior results. Does it require a lot of tuning for the most exceptional performance?

AMD however has released a new Compiler Options Quick Reference Guide for the new CPUs. We strongly recommend using these flags and options for tuning your application.

Chiplet and Multi-Die Architecture: IO and Compute Dies

One of the chief innovations in the 2nd Generation AMD EPYC CPUs is in the evolution of the multi-die architecture pioneered in the first EPYC CPUs.

Rather than create one, monolithic, hard to yield die, AMD has opted to lash together “chiplets” together in a single socket with Infinity Fabric technology.

Compute Dies (now in 7nm)

8 compute chiplets (formally, Core Complex Dies or CCDs) are brought together to create a single socket. These CCDs take advantage of the latest 7nm TSMC process node. By using 7nm for the compute cores in 2nd Generation EPYC, AMD takes advantage of the space and power efficiencies of the latest process—without the yield issues of single monolithic die.

What does it mean for you? More cores than anticipated in a single socket, a reasonable power efficiency for the core count, and a less costly CPU.

The 14nm IO Die

In 2nd Generation EPYC CPUs, AMD has gone a step further with the chiplet architecture. These chiplets are now complemented by an separate I/O die. The IO Die contains the memory controllers, PCI-Express controllers, and Infinity Fabric connection to the remote socket.Also, this resolves any NUMA affinity quirks of the 1st generation EPYC Processors.

Moreover, the I/O die is created in the established 14nm node process. It’s less important that it utilize the same 7nm power efficiencies.

DDR4-3200 and Improved Memory Bandwidth

AMD EPYC 7xx2 series improves its theoretical memory bandwidth when compared to both its predecessor and the competition.

DDR4-3200 DIMMs are supported, and they are clocked 20% faster than DDR4-2666 and 9% faster than DDR4-2933.
In summary, the platform offers:

Compared to Cascade Lake-SP (Xeon Platinum/Gold 82xx, 62xx): Up to a 45% improvement in memory bandwidth
Compared to Skylake-SP (Xeon Platinum/Gold 81xx, 61xx): Up to a 60% improvement in memory bandwidth
Compared to AMD EPYC 7xx1 Series (Naples): Up to a 20% improvement in memory bandwidth

These comparisons are created for a system where only the first DIMM per channel is populated. Part of this memory bandwidth advantage is derived from the increase in DIMM speeds (DDR4-3200 vs 2933/2666); part of it is derived from EPYC’s 8 memory channels (vs 6 on Xeon Skylake/Cascade Lake-SP).

While we’ve yet to see final STREAM testing numbers for the new CPUs, we do anticipate them largely reflecting the changes in theoretical memory bandwidth.

PCI-E Gen4 Support: 2X the I/O bandwidth

EPYC “Rome” CPUs have an integrated PCI-E generation 4.0 controller on the I/O die. Each PCI-E lane doubles in maximum theoretical bandwidth to 4GB/sec (bidirectional).

A 16 lane connection (PCI-E x16 4.0 slot) can now deliver up to 64GB/sec of bidirectional bandwidth (32GB/uni). That’s 2X the bandwidth compared to first generation EPYC and the x86 competition.

Broadening Support for High Bandwidth I/O Devices

The new support allows for higher bandwidth connection to InfiniBand and other fabric adapters, storage adapters, NVMe SSDs, and in the future GPU Accelerators and FPGAs.

Some of these devices, like Mellanox ConnectX-6 200Gb HDR InfiniBand adapters, were unable to realize their maximum bandwidth in a PCI-E Gen3 x16 slot. Their performance should improve in PCI-E Gen4 x16 slot with 2nd Generation AMD EPYC Processors.

2nd Generation AMD EPYC “Rome” is the only x86 server CPU with PCI-E Gen4 support at its launch in 3Q 2019. However, we have seen PCI-E Gen4 support before in the POWER9 platform.

System Support for PCI-E Gen4

Unlike in the previous generation AMD EPYC “Naples” CPUs, there is not strong affinity of PCI-E lanes to a particular chiplet inside the processor. In Rome, all I/O traffic routes through the I/O die and all chiplets reach PCI-E devices through this die.

In order to support PCI-E Gen4, server and motherboard manufacturers are producing brand new versions of their platforms. Not every Rome-ready platform supports Gen4, so if this is a requirement be sure to specify this to your hardware vendor. Our team can help you select a server with full Gen4 capability.

Infinity Fabric

Deeply interrelated with PCI-Express Gen4, AMD has also improved the Infinity Fabric Link between chiplets and sockets with the new generation of EPYC CPUs.

AMD’s Infinity Fabric has many commonalities with PCI-Express used to connect I/O devices. With 2nd Generation AMD EPYC “Rome” CPUs, the link speed of Infinity Fabric has doubled. This allows for higher bandwidth communication between dies on the same socket and to dies on remote sockets.

The result should be improved application performance for NUMA-aware and especially non- NUMA-aware applications. The increased bandwidth should help hide any transport bandwidth issues to I/O devices on a remote socket as well. The overall result is “smoother” performance when applications scale across multiple chiplets and sockets.

SKUs and Strategies to Consider for HPC Clusters

Here are the complete list of SKUs and 1KU (1000 unit) prices (Source: AMD). Please note that these costs are those for CPUs sold to channel integrators, not those for fully integrated systems with these CPUs.

Dual Socket SKUs

SKU	Cores	Base Clock	Boost Clock	L3 Cache	TDP	Price
7742	64	2.25	3.4	256MB	225W	$6950
7702	2.0	3.35	200W	$6450
7642	48	2.3	3.3	225W	$4775
7552	2.2	3.3	192MB	200W	$4025
7542	32	2.9	3.4	128MB	225W	$3400
7502	2.5	3.35	180W	$2600
7452	2.35	3.35	155W	$2025
7402	24	2.8	3.35	128MB	180W	$1783
7352	2.3	3.2	155W	$1350
7302	16	3.0	3.3	128MB	$978
7282	2.8	3.2	64MB	120W	$650
7272	12	2.9	3.2	$625
7262	8	3.2	3.4	128MB	155W	$575
7252	3.2	3.4	64MB	120W	$475

EPYC 7742 or 7702 (64c): Select a High-End SKU, yield up to 2X the performance

Assuming your application scales with core count and maximum performance at a premium cost fits with your budget, you can’t beat the top 64core EPYC 7742 or 7702 SKUs. These will deliver greater throughput on a wide variety of multi-threaded applications.

Anything above EPYC 7452 (32c, 48c): Select a Mid-High Level SKU, reach new performance heights

While these SKUs aren’t inexpensive, they take application performance to new heights and break new benchmark ground. You can take advantage of that performance advantage for your application if it’s multi-threaded. From a price/performance perspective, these SKUs may also be attractive.

EPYC 7452 (32c): Select a Mid Level SKU, improve price performance vs previous generation EPYC

Previous generation AMD EPYC 7xx1 Series CPUs also featured 32 cores. However, the 32 core entrant in the new 7xx2 stack is far less costly than the prior generation while delivering greater memory bandwidth and 2X the FLOPS per core.

EPYC 7452 (32c): Select a Mid Level SKU, match top Xeon Gold and Platinum with far better price/performance

If you’re optimizing for price/performance compared to the top Intel Xeon Platinum 8200 or Xeon Gold 6200 series SKUs, consider this SKU or ones near it. We predict this to be at or near the price/performance sweet-spot for the new platform.

EPYC 7402 (24c): Select a Mid Level SKU, come close to top Xeon Gold and Platinum SKUs

The higher clock speed of this SKU also means it is well suited to some applications.

EPYC 7272-7402 (12, 16 24c):Select an affordable SKU, yield better performance and price performance

Treat these SKUs as much more affordable alternatives to most Xeon Gold or Silver CPUs. We’ll await further benchmarks to see exactly where the further sweet-spots are compared to these SKUs. They also compare favorably from a price/performance standpoint to prior generation 1st Generation EPYC 7xx1 processors with 12, 16, or 24 cores. Same performance, fewer dollars!

Single Socket Performance

As with the previous generation, AMD is heavily promoting the concept of replacing dual socket Intel Xeon servers with single sockets of 2nd Generation AMD EPYC “Rome.” They are producing discounted “P” SKUs with only single socket platform support at reduced prices to help further boost the price-performance advantage of these systems.

Single Socket SKUs

SKU	Cores	Base Clock	Boost Clock	L3 Cache	TDP	Price
7702P	64	2.0	3.35	256MB	200W	$4425
7502P	32	2.5	3.35	128MB	180W	$2300
7402P	24	2.8	3.35	$1250
7302P	16	3.0	3.3	155W	$825
7232P	8	3.1	3.2	32MB	120W	$450

Due to the boosted capability of the new CPUs, a single socket configuration my be increasingly viable comparison to a dual socket Xeon platform for many workloads.

Next Steps: get started today!

If you’d like to read more speeds and feeds about these new processors, check out our article with detailed specifications of the 2nd Gen AMD EPYC “Rome” CPUs. We summarize and compare the specifications of each model, and provide guidance over and beyond what you’ve seen here.

Try 2nd Gen AMD EPYC CPUs for Yourself

Groups which prefer to verify performance before making a design are encouraged to sign up for a Test Drive, which will provide you with access to bare-metal hardware with AMD EPYC CPUs, large-memory, and more.

Browse Our Navion AMD EPYC Product Line

WhisperStation

Ultra-Quiet AMD EPYC workstations

Learn More

Servers

High performance AMD EPYC rackmount servers

Learn More

Clusters

Leadership performance clusters from 5-500 nodes

Learn More

The post 2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC appeared first on Microway.

Intel Xeon Scalable “Cascade Lake SP” Processor Review

Eliot Eshelman — Tue, 02 Apr 2019 17:00:45 +0000

With the launch of the latest Intel Xeon Scalable processors (previously code-named “Cascade Lake SP”), a new standard is set for high performance computing hardware. These latest Xeon CPUs bring increased core counts, faster memory, and faster clock speeds. They are compatible with the existing workstation and server platforms that have been shipping since mid-2017. Starting today, Microway is shipping these new CPUs across our entire line of turn-key Xeon workstations, systems, and clusters.

Important changes in Intel Xeon Scalable “Cascade Lake SP” Processors include:

Higher CPU core counts for many SKUs in the product stack
Improved CPU clock speeds (with Turbo Boost up to 4.4GHz)
Introduction of the new AVX-512 VNNI instruction for Intel Deep Learning Boost (VNNI)
provides significant, more efficient deep learning inference acceleration
Higher memory capacity & performance:
- Most CPU models provide increased memory speeds
- Support for DDR4 memory speeds up to 2933MHz
- Large-memory capabilities with Intel Optane DC Persistent Memory
- Support for up to 4.5TB-per-socket system memory
Integrated hardware-based security mitigations against side-channel attacks

More for Your Dollar: performance uplift

With an increase in core counts, clock speeds, and memory speeds, applications will achieve better performance across the board. Particularly in the lower-end Xeon 4200- and 5200-series CPUs, the cost-effectiveness of the processors has increased considerably. The plot below compares the price of each processor against its performance. Both the current “Cascade Lake CP” and previous-generation “Skylake-SP” are shown:

In the diagram above, the wide colored bars indicate the price performance of these new Xeon CPUs. The dots indicate the price performance of the previous generation, which allows us to compare the two generations SKU by SKU (though a few of the newer models do not have previous-generation counterparts). In this comparison, lower values are better and indicate a higher quantity of computation per dollar spent.

Same SKU – More Performance

As shown above, many models offer more performance than their previous-generation counterpart. Here we highlight models which are showing particularly substantial improvements:

Xeon 4210 is 34% more price-performant than Xeon 4110
Xeon 4214 is 30% more price-performant than Xeon 4114
Xeon 4216 is 25% more price-performant than Xeon 4116
Xeon 5218 is 40% more price-performant than Xeon 5118
Xeon 5220 is 34% more price-performant than Xeon 5120
Xeon 6242 saw an 8% increase in clock speed and ~10% reduction in price
Xeon 8270 is 28% more price-performant than Xeon 8170

To summarize: this latest generation will provide more performance for the same cost if you stick with the model numbers you’ve been using. In the next section, we’ll review opportunities for cost reduction.

More for Less: Select a more modest Cascade Lake SKU for the same core count or performance

With generational improvements, it’s not unusual for a new CPU to replace a higher-end version of the older generation. There are many cases where this is true in the Cascade Lake Xeon CPUs, so be sure to consider if you can leverage such savings.

Guaranteed savings

Xeon 4208 replaces the Xeon 4110: providing the same 8 cores for a lower price
Xeon 4210 replaces the Xeon 4114: providing the same 10 cores for a lower price
Xeon 4214 surpasses the Xeon 4116: providing the same 12 cores at higher clock speeds
Xeon 5218 surpasses the Xeon 5120: providing more cores, higher clock speeds, and faster memory speeds

Worthy of consideration

Xeon 4216 may replace most of the 5100-series: Xeon 5115, 5118 and 5120
Nearly all specifications are equivalent, but the UPI speed of the Xeon 4216 is 9.6GT/s rather than 10.4GT/s
Xeon 6230 likely replaces the Xeon 6130, 6138, 6140: providing the same or more cores for a lower price
Xeon 6240 competes with every Xeon 6100-series model
with the exception that it does not provide 3+GHz processor frequencies

Greater Memory Bandwidth

For computationally-intensive applications, rapid access to data is critical. Thus, memory speed increases are valuable improvements. This generation of CPUs brings a 10% improvement to the Xeon 5200-series (2666MHz; up from 2400MHz) and the Xeon 6200-/8200-series (2933MHz; up from 2666MHz). This means that the Xeon 5200-series CPUs are more competitive (they’re running memory at the same speed as last generation’s Xeon 6100- and 8100-series processors). And the higher-end Xeon 6200-/8200-series CPUs have a 10% memory performance advantage over all others.

While a 10% improvement may seem to be only a modest improvement, keep in mind that it’s essentially a free upgrade. Combined with the other features and improvements discussed above, you can be confident you’re making the right choice by upgrading to these newest Intel Xeon Scalable CPUs.

Enabling Very Large Memory Capacity

With the official launch of Intel Optane DC Persistent Memory, it is now possible to deploy systems with multiple terabytes of system memory. Well-equipped systems provide each Xeon CPU with six Optane memory modules (alongside six standard memory modules). This results in up to 3TB of Optane memory and 1.5TB of standard DRAM per CPU! Look for more information on these possibilities as HPC sites begin adopting and exploring this new technology.

Transitioning from the “Skylake-SP” Intel Xeon Scalable CPUs

Because the new “Cascade Lake SP” CPUs are socket-compatible with the previous-generation “Skylake SP” CPUs, the upgrade path is simple. All existing platforms that support the earlier CPUs can also accept these new CPUs. This also simplifies the choice for those considering a new system: the new CPUs use existing, proven platforms. There’s little risk in selecting the latest and highest-performance components. HPC sites adding to existing clusters will find they have a choice: spend the same for increased performance or spend less for the same performance. Below are peak performance comparisons of the previous generation CPUs with the new generation:

Performance for applications with AVX-512 instructions
Performance for applications with AVX2 instructions

The wider/colored bars indicate peak performance for the new Xeon CPUs. The slim grey bars indicate peak performance for the previous-generation Xeon CPUs. Without exception, the new CPUs are expected to outperform their predecessors. The widest margins of improvement are in the lower-end Xeon 4200- and 5200-series.

Standout performance in a single socket

This generation introduces three CPU models designed for single-socket systems (providing very high throughput at relatively low-cost). They provide 20+ CPU cores at prices as much as $2,000 less than their multi-socket counterparts. If your workload performs well with a single CPU, these SKUs will be incredibly valuable:

Xeon 6209U outperforms nearly all of last generation’s Xeon Gold 6100-series CPUs
Xeon 6210U outperforms all Xeon 6100-series and many 6200-series CPUs
Xeon 6212U outperforms several of the Xeon 8100-series CPUs

The only exception to the above would be for applications which require very high clock speeds, as these single-socket CPU models do not provide base processor frequencies higher than 2.5GHz. The strength of these single-socket processors is in high throughput (via high core count) and decent clock speeds.

Next Steps: get started today!

If you’d like to read more about these new processors, check out our article with detailed specifications of the Intel Xeon “Cascade Lake SP” CPUs. We summarize and compare the specifications of each model, and provide guidance on which models are likely to be best suited to computationally-intensive HPC & Deep Learning applications.

Try Intel Xeon Scalable CPUs for Yourself

Groups which prefer to verify performance before making a design are encouraged to sign up for a Test Drive, which will provide you with access to bare-metal hardware with Intel Xeon Scalable CPUs, large-memory, and more.

Speak with an Expert

If you’re expecting to be upgrading or deploying new systems in the coming months, our experts would be happy to help you consider your options and design a custom cluster optimized to your workloads. We also help groups writing budget proposals to ensure they’re requesting the correct resources. Please get in touch!

The post Intel Xeon Scalable “Cascade Lake SP” Processor Review appeared first on Microway.

Tesla V100 “Volta” GPU Review

Brett Newman — Thu, 28 Sep 2017 13:50:32 +0000

The next generation NVIDIA Volta architecture is here. With it comes the new Tesla V100 “Volta” GPU, the most advanced datacenter GPU ever built.

Volta is NVIDIA’s 2nd GPU architecture in ~12 months, and it builds upon the massive advancements of the Pascal architecture. Whether your workload is in HPC, AI, or even remote visualization & graphics acceleration, Tesla V100 has something for you.

Review Index:

Speeds and Feeds
Which GPU is for me?
Enhanced NVLink
Programming Improvements
What Volta Means for me?

Two Flavors, one giant leap: Tesla V100 PCI-E & Tesla V100 with NVLink

For those who love speeds and feeds, here’s a summary of the key enhancements vs Tesla P100 GPUs

	Tesla V100 with NVLink	Tesla V100 PCI-E	Tesla P100 with NVLink	Tesla P100 PCI-E	Ratio Tesla V100:P100
DP TFLOPS	7.8 TFLOPS	7.0 TFLOPS	5.3 TFLOPS	4.7 TFLOPS	~1.4-1.5X
SP TFLOPS	15.7 TFLOPS	14 TFLOPS	9.3 TFLOPS	8.74 TFLOPS	~1.4-1.5X
TensorFLOPS	125 TFLOPS	112 TFLOPS	21.2 TFLOPS 1/2 Precision	18.7 TFLOPS 1/2 Precision	~6X
Interface (bidirec. BW)	300GB/sec	32GB/sec	160GB/sec	32GB/sec	1.88X NVLink 9.38X PCI-E
Memory Bandwidth	900GB/sec	900GB/sec	720GB/sec	720GB/sec	1.25X
CUDA Cores (Tensor Cores)	5120 (640)	5120 (640)	3584	3584

Performance of Tesla GPUs, Generation to Generation

Selecting the right Tesla V100 for you:

With Tesla P100 “Pascal” GPUs, there was a substantial price premium to the NVLink-enabled SXM2.0 form factor GPUs. We’re excited to see things even out for Tesla V100.

However, that doesn’t mean selecting a GPU is as simple as picking one that matches a system design. Here’s some guidance to help you evaluate your options:

Tesla V100 PCI-E
Tesla V100 with NVLink

Tesla V100 with NVLink

Where performance matters
Where GPU:GPU or CPU:GPU bandwidth are paramount
Where advanced system typologies are required
Where your priority is Coherence and CORAL system emulation (building mini-versions of Summit & Sierra)

Tesla V100 with NVLink is available today in DGX-1V, NumberSmasher 1U Tesla GPU Server with NVLink, and Power Systems AC922

Performance Summary

Increases in relative performance are widely workload dependent. But early testing demonstates HPC performance advancing approximately 50%, in just a 12 month period.

If you haven’t made the jump to Tesla P100 yet, Tesla V100 is an even more compelling proposition.

For Deep Learning, Tesla V100 delivers a massive leap in performance. This extends to training time, and it also brings unique advancements for inference as well.

Enhancements for Every Workload

Now that we’ve learned a bit about each GPU, let’s review the breakthrough advancements for Tesla V100.

Next Generation NVIDIA NVLink Technology (NVLink 2.0)

NVLink helps you to transcend the bottlenecks in PCI-Express based systems and dramatically improve the data flow to GPU accelerators.

The total data pipe for each Tesla V100 with NVLink GPUs is now 300GB/sec of bidirectional bandwidth—that’s nearly 10X the data flow of PCI-E x16 3.0 interfaces!

Bandwidth

NVIDIA has enhanced the bandwidth of each individual NVLink brick by 25%: from 40GB/sec (20+20GB/sec in each direction) to 50GB/sec (25+25GB/sec) of bidirectional bandwidth.

This improved signaling helps carry the load of data intensive applications.

Number of Bricks

But enhanced NVLink isn’t just about simple signaling improvements. Point to point NVLink connections are divided into “bricks” or links. Each brick delivers

From 4 bricks to 6

Over and above the signaling improvements, Tesla V100 with NVLink increases the number of NVLink “bricks” that are embedded into each GPU: from 4 bricks to 6.

This 50% increase in brick quantity delivers a large increase in bandwidth for the world’s most data intensive workloads. It also allows for more diverse set of system designs and configurations.

The NVLink Bank Account

NVIDIA NVLink technology continues to be a breakthrough for both HPC and Deep Learning workloads. But as with previous generations of NVLink, there’s a design choice related to a simple question:

Where do you want to spend your NVLink bandwidth?

Think about NVLink bricks as a “spending” or “bank account.” Each NVLink system-design strikes a different balance between where they “spend the funds.” You may wish to:

Spend links on GPU:GPU communication
Focus on increasing the number of GPUs
Broaden CPU:GPU bandwidth to overcome the PCI-E bottleneck
Create a balanced system, or prioritize design choices solely for a single workload

There are good reasons for each, or combinations, of these choices. DGX-1V, NumberSmasher 1U GPU Server with NVLink, and future IBM Power Systems products all set different priorities.

We’ll dive more deeply into these choices in a future post.

Net-net: the NVLink bandwidth increases from 160GB/sec to 300GB/sec (bidirectional), and a number of new, diverse HPC and AI hardware designs are now possible.

Programming Improvements

Tesla V100 and CUDA 9 bring a host of improvements for GPU programmers. These include:

Cooperative Groups
A new L1 cache + shared memory, that simplifies programming
A new SIMT model, that relieves the need to program to fit 32 thread warps

We won’t explore these in detail in this post, but we encourage you to read the following resources for more:

What does Tesla V100 mean for me?

Tesla V100 GPUs mean a massive change for many workloads. But each workload will see different impacts. Here’s a short summary of what we see:

An increase in performance on HPC applications (on paper FLOPS increase of 50%, diverse real-world impacts)
A massive leap for Deep Learning Training
1 GPU, many Deep Learning workloads
New system designs, better tuned to your applications
Radical new, and radically simpler, programming paradigms (coherence)

The post Tesla V100 “Volta” GPU Review appeared first on Microway.

Can I use Deep Learning?

Eliot Eshelman — Thu, 30 Jun 2016 07:30:21 +0000

If you’ve been reading the press this year, you’ve probably seen mention of deep learning or machine learning. You’ve probably gotten the impression they can do anything and solve every problem. It’s true that computers can be better than humans at recognizing people’s faces or playing the game Go. However, it’s not the solution to every problem. We want to help you understand if you can use deep learning. And if so, how it will help you.

Just as they have for decades, computers performing deep learning are running a specific set of instructions specified by their programmers. Only now, we have a method which allows them to learn from their mistakes until they’re doing the task with high accuracy.

If you have a lot of data (images, videos, text, numbers, etc), you can use that data to train your computers on what you want done with the information. The result, an artificial neural network trained for this specific task, can then process any new data you provide.

We’ve written a detailed post on recent developments in Deep Learning applications. Below is a brief summary.

What types of problems are being solved using Deep Learning?

Computer Vision

If you have a lot imaging data or photographs, then deep learning should certainly be considered. Deep learning has been used extensively in the field of computer vision. For example, image classification (describing the items in a picture) and image enhancement (removing defects or fog from photographs). It is also vital to many of the self-driving car projects.

Written Language and Speech

Deep Learning has also been used extensively with language. Certain types of networks are able to picks clues and meaning from written text. Others have been created to translate between different languages. You may have noticed that smartphones have recently become much more accurate at recognizing spoken language – a clear demonstration of the ability of deep learning.

Scientific research, engineering, and medicine

Materials scientists have used deep learning to predict how alloys will perform – allowing them to investigate 800,000 candidates while conducting only 36 actual, real-world tests. Such success promises dramatic improvements in the speed and efficiency of such projects in the future.

Physicists researching the Higgs boson have used deep learning to clean up their data and better understand what happens when they witness one of these particles. Simply dealing with the data from CERN’s Large Hadron Collider has been a significant challenge for these scientists.

Those studying life science and medicine are looking to use these methods for a variety of tasks, such as:

determining the shape of correctly-folded proteins (some diseases are caused by proteins that are not shaped correctly)
processing large quantities of bioinformatics data (such as the genomes in DNA)
categorizing the possible uses of drugs
detecting new information simply by examining blood

If you have large quantities of data, consider using deep learning

Meteorologists are working to predict thunderstorms by sending weather data through a specialized neural network. Astronomers may be able to get a handle on the vast quantities of images and data that are captured by modern telescopes. Hospitals are expected to be using deep learning for cancer detection. There are many other success stories, and new papers are being published every month.

For details on recent projects, read our blog post on deep learning applications.

Want to use Deep Learning?

If you think you could use deep learning, Microway’s experts will design and build a high-performance deep learning system for you. We’d love to talk with you.

The post Can I use Deep Learning? appeared first on Microway.

Intel Xeon E5-2600 v4 “Broadwell” Processor Review

Eliot Eshelman — Thu, 31 Mar 2016 16:30:00 +0000

Today we begin shipping Intel’s new Xeon E5-2600 v4 processors. They provide more CPU cores, more cache, faster memory access and more efficient operation. These are based upon the Intel microarchitecture code-named “Broadwell” – we expect them to be the HPC processors of choice.

Important changes in Xeon E5-2600 v4 include:

Up to 22 processor cores per CPU
Support for DDR4 memory speeds up to 2400MHz
Faster Floating Point Instruction performance
Improved parallelism in scheduling micro-operations
Improved performance for large data sets

Move faster with Xeon E5-2600 v4

Expect these new processors to be more nimble than their predecessors. A variety of microarchitecture improvements have been added to increase parallelism, speed up processing time, and strip out inefficiencies from previous models. Broadwell reduces the time to complete a multiplication by 40% (division operations also complete more quickly). Each core’s ability to optimize instruction ordering has been improved by ~6%. The tables which manage on-die L2 cache have been expanded to speed up memory operations. Several CPU instruction latencies have been reduced. Overall, Intel expects these new CPUs to complete at least 5% more instructions on every clock cycle.

For complete details, please see our Detailed Analysis of the Intel Xeon E5-2600v4 “Broadwell-EP” Processors

Transitioning from “Haswell” E5-2600 v3 Series Xeons

Because the new “Broadwell” CPUs are socket-compatible with the previous-generation “Haswell” CPUs, the upgrade path is simple. All existing platforms that support v3 CPUs can also accept v4 CPUs. This also simplifies the choice for those considering a new system: the new CPUs use existing, proven platforms. There’s little risk in selecting the latest and highest-performance components. Those who are adding to existing HPC clusters will find they have a choice: spend the same for increased performance or spend less for the same performance. Here is a comparison of the older generation with this new generation:

Get more for less – improved cost-effectiveness

Because each CPU core offers increased performance, and many models offer a higher core count, a lower-end CPU model can match performance with many of the older CPU models. Here are a few comparisons of note:

Xeon E5-2630v4 offers performance equivalent to the E5-2640v3 (and can even challenge the E5-2650v3)
Xeon E5-2640v4 matches the E5-2650v3 in nearly every case
Xeon E5-2650v4 matches the E5-2660v3 in nearly every case (and challenges the E5-2670v3)
Xeon E5-2660v4 will beat the E5-2670v3 and E5-2680v3 on well-parallelized applications
Xeon E5-2680v4 and Xeon E5-2690v4 best almost every E5-2600v3 CPU

Notable adjustments

Note that the E5-2670 CPU model has been removed from the line-up. This simplifies choice and did not come as a surprise to us: the majority of our customers had been selecting the E5-2680 and E5-2690 over the E5-2670. As noted above, the E5-2650 v4 or E5-2660 v4 can easily stand in for the older E5-2670 v3.

The E5-2623 CPU model has been modified in such a way that it isn’t ideal for the same workloads. Previously, it was a relatively high-clock-speed model available at a low price. However, the base clock speed has been adjusted downwards by 18%.

Next Steps – Putting Xeon E5-2600 v4 into Production

All of our Xeon workstations, servers & clusters are immediately available with these new CPUs. They are socket-compatible with all Xeon E5-2600 v3 platforms, so your existing systems can also be upgraded. Our most popular products which leverage these new Xeon processors are:

The post Intel Xeon E5-2600 v4 “Broadwell” Processor Review appeared first on Microway.

Intel Xeon E5-4600v3 “Haswell” 4-socket CPU Review

Brett Newman — Mon, 01 Jun 2015 07:01:08 +0000

Intel has launched new 4-socket Xeon E5-4600v3 CPUs. They are the perfect choice for “just beyond dual socket” system scaling. Leverage them for larger memory capacity, faster memory bandwidth, and higher core-count when you aren’t ready for a multi-system purchase.

Here are a few of the main technical improvements:

DDR4-2133 memory support, for increased memory bandwidth
Up to 18 cores per socket, faster QPI links up to 9.6GT/sec between sockets
Up to 48 DIMMs per server, for a maximum of 3TB memory
Haswell core microarchitecture with new instructions

Why pick a 4-socket Xeon E5-4600v3 CPU over a 2 socket solution?

Increased memory space vs 2 socket

Dual socket systems max out at 512GB affordably (1TB at cost); however, many HPC users have models that outgrow that memory space. Xeon E5-4600v3 systems double the DIMM count for up to 1.5TB affordably (3TB at higher cost).

For applications like ANSYS, COMSOL, and other CAE, multiphysics, and CFD suites, this can be a game changer. Traditionally, achieving these types of memory capacities required large multi-node cluster installations. Usage of such a cluster to run simulations is almost always more effort. The Xeon E5-4600v3 permits larger models to run on a single system with a familiar single OS instance. Don’t underestimate the power of ease-of-use.

Increased core count vs 2 socket

Hand-in-hand with the memory space comes core count. What good are loading up big models if you can’t scale compute throughput to run the simulations? The Xeon E5-4600v3 CPUs mean systems deliver up to 72 cores. Executing on that scale means a faster time to solution for you and more work accomplished.

Increased aggregate memory bandwidth

One overlooked aspect of 4P systems is superior memory bandwidth. Intel integrates the same memory controller in the Xeon E5-2600v3 CPUs into each Xeon E5-4600v3 socket. However, there’s twice as many CPUs in each system: the net result is 2X the aggregate memory bandwidth per system.

Increased memory bandwidth per core (by selecting 4 sockets but fewer cores per socket)

Users might be concerned about memory bandwidth per CPU core. We find that CFD and multiphysics applications are especially sensitive. But a 4-socket system presents unique opportunities: you may select fewer cores per socket while achieving the same core count.

If you select smartly, you will have 2X the memory bandwidth per core available in your system vs. a 2 socket solution. This strategy can also be used to maximize throughput for a software license with a hard core count ceiling.

Detailed Technical Improvements

You’ve heard the why, but the nuts and bolts generation-to-generation improvements matter too. Let’s review in detail:

DDR4-2133 memory support- bandwidth and efficiency

Memory bandwidth is critical for HPC users. CFD, CAE/simulation, life-sciences and custom coded applications benefit most. With the new CPUs, you’ll see the following improvements over Xeon E5-4600v2:

Entry-level “Basic” CPU operates memory at 1600Mhz (increase of 20%)
Mid-level “Standard” CPUs now operate memory at 1866Mhz (increase of 16%)
Higher-end “Advanced”, “High Core Count” & “Frequency Optimized” CPUs now support up to 4 DIMMs per socket at 2133MHz (increase of 14%), 8 DIMMs per socket with LR-DIMMS

The increase in memory clocks means Xeon E5-4600v3 delivers more memory bandwidth per socket, up to 68GB/sec. Moreover, DDR4 DIMMs operate at 1.2v resulting in a substantial power-efficiency gain.

Increased core counts – more for your money

Throughout the stack, core counts are increasing:

Xeon E5-4610v3 and E5-4620v3: 10 cores per socket, a 25% core count increase over the previous generation
Xeon E5-4640v3, E5-4650v3: 12 cores per socket, a 50% core count increase over the previous generation
E5-4669v3: 18 cores per socket, a 33% core count increase over the previous generation
New E5-4660v3 SKU delivers 14 cores per socket with a reasonable 120W TDP

Increased core counts means deploying larger jobs, scheduling more HPC users on the same system, and deploying more virtual machines. It also helps increase the aggregate throughput of your systems. You can do far more work with Xeon E5-4600v3.

Memory latency and DIMM size

DDR4 doesn’t just mean faster clocks – it also brings with it support for fewer compromises and larger DIMM sizes. 32GB DIMMs are now available as registered as well as load reduced (32GB DDR4-2133 RDIMMs vs. 32GB DDR4-2133 LRDIMMs) modules. The shift to a traditional register in an RDIMM from a specialty buffer in an LRDIMM means a substantial latency decrease.

Advances in manufacturing for DDR4 also mean larger DIMM sizes. 64GB LRDIMMs are now being manufactured to help support that outstanding 3TB memory capacity.

Haswell microarchitecture and AVX2

AVX2 is an advanced CPU instruction set that debuted in the Haswell architecture and has shown strong benefits:

New floating point FMA, with up to 2X the FLOPS per core (16 FLOPS/clock)
256-bit wide integer vector instructions

These new instructions are extremely consequential. We encourage you to learn more about these improvements, and how to compile for the new instructions, with our post on AVX2 Optimization.

Intel Xeon E5-4600v3 Series Specifications

Model	Frequency	Frequency (AVX)	Turbo Boost	Core Count	L3 Cache	QPI Speed	Memory Speed	TDP (Watts)
E5-4669v3	2.10 GHz	1.80 GHz	2.90 GHz	18	45MB	9.6 GT/s	2133 MHz	135W
E5-4667v3	2.00 GHz	1.70 GHz	2.90 GHz	16	40MB
E5-4660v3	2.10 GHz	1.80 GHz	2.90 GHz	14	35MB	120W
E5-4650v3	2.10 GHz	1.80 GHz	2.80 GHz	12	30MB	105W
E5-4640v3	1.90 GHz	1.60 GHz	2.60 GHz	8.0 GT/s	1866 MHz
E5-4620v3	2.00 GHz	1.70 GHz	2.60 GHz	10	25MB
E5-4610v3	1.70 GHz	1.70 GHz	None	6.4 GT/s	1600 MHz

HPC groups do not typically choose Intel’s “Basic” models (e.g., E5-4610v3)

Intel Xeon E5-4600v3 Frequency Optimized SKUs

Model	Frequency	Frequency (AVX)	Turbo Boost	Core Count	L3 Cache	QPI Speed	Memory Speed	TDP (Watts)
E5-4655v3	2.90 GHz	2.60 GHz	3.20 GHz	6	30MB	9.6 GT/s	2133 MHz	135W
E5-4627v3	2.60 GHz	2.30 GHz	3.20 GHz	10	25MB

The above SKUs offer better memory bandwidth per core

Next steps

We think the improvements in the Xeon E5-4600v3 CPUs make them a unique alternative to far more complicated HPC installations and a worthwhile upgrade from their predecessors. Want to learn more about the Xeon E5-4600v3 CPUs? Talk with an expert and assess how they might fit your HPC needs.

The post Intel Xeon E5-4600v3 “Haswell” 4-socket CPU Review appeared first on Microway.

Introduction to RAID for HPC Customers

jan — Mon, 06 Apr 2015 23:05:48 +0000

There is a lot of material available on RAID, describing the technologies, the options, and the pitfalls. However, there isn’t a great deal on RAID from an HPC perspective. We’d like to provide an introduction to RAID, clear up a few misconceptions, share with you some best practices, and explain what sort of configurations we recommend for different use cases.

What is RAID?

Originally known as Redundant Array of Inexpensive Disks, the acronym is now more commonly considered to stand for Redundant Array of Independent Disks. The main benefits to RAID are improved disk read/write performance, increased redundancy, and the ability to increase logical volume sizes.

RAID is able to perform these functions primarily through striping, mirroring, and parity. Striping is when files are broken down into segments, which are then placed on different drives. Because the files are spread across multiple drives that are running in parallel, performance is improved. Mirroring is when data is duplicated on the fly across drives. Parity within the context of RAID refers to when data redundancy is distributed across all drives so that when one or more (depending on the RAID level) drives fail, the data can be reconstructed from the remaining drives.

RAID comes in a variety of flavors, with the most common being the following.

RAID 0 (striping)

Redundancy: none
Number of drives providing space: all drives

The riskiest RAID setup, RAID 0 stripes blocks of data across drives and provides no redundancy. If one drive fails, all data is lost. The benefits of RAID 0 are that you get increased drive performance and no storage space taken up by data parity. Due to the real risk of total data loss, though, we normally do not recommend RAID 0 except for certain use cases. A fast, temporary scratch volume is one of these exceptions.

RAID 1 (mirroring)

Redundancy: typically 1 drive
Number of drives providing space: typically n-1

RAID 1 is in many ways the opposite of RAID 0. Data is mirrored on all disks in the RAID array so if one or more drives fail, as long as at least one half of the mirror is functioning, the data remains intact. The main downside is that you realize only half the spindle count in usable space. Some performance gains can be realized during drive read, but not during write. We usually recommend that operating systems be housed on 2 drives in RAID 1.

RAID 5 (single parity)

Redundancy: 1 drive
Number of drives providing space: n-1

Next we have the very common but very misunderstood RAID 5. In this case, data blocks are striped across disks like in RAID 0, but so are parity blocks. These parity blocks allow the RAID array to still function in the event of a single drive failure. This redundancy comes at the cost of losing a single drive’s worth of space. Due to the striping similarity that it shares with RAID 0, RAID 5 enjoys increased read/write performance.

The misconception concerning RAID 5 is that many people think that single-drive parity is a robust safeguard again data loss. Single-drive parity becomes very risky after a drive fails and you wait to or start to rebuild your array. The increased IO activity of a rebuild is exactly the type of situation likely to create a second drive failure, and no protection remains.

RAID 6 (double parity)

Redundancy: 2 drives
Number of drives providing space: n-2

RAID 6 builds upon RAID 5 with a second parity block, tolerating two drive failures instead of one. Generally we find that losing a second disk’s worth of capacity is a fair tradeoff for the increased redundancy and we often recommend RAID 6 for larger arrays still requiring strong performance.

RAID 10 (striped mirroring)

Redundancy: 1 drive per volume in the span
Number of drives providing space: n/2

As you can see from the diagram, RAID 10 is a combination of RAID 1 mirroring and RAID 0 striping. It is, in essence, a stripe of mirrors, so creating a RAID 10 is possible with even drive-counts greater than 2 (i.e. 4, 6, 8, etc.). RAID 10 can offer a very reasonable balance of performance and redundancy, with the primary concern for some users being the reduced storage space. Since each volume in the RAID 0 span is made up of RAID 1 mirrors, a full half of the drives are used for redundancy. Also, while the risk of data loss is greatly reduced, it is still possible. If multiple drives within one volume fail at the same time, information could be lost. In practice, though, this is uncommon, and RAID 10 is normally considered to be an extremely secure form of RAID.

RAID 60 (striped double parity)

Redundancy: 2 drives per volume in the span
Number of drives providing space: n-2 * number of spans

RAID 60 is a less common configuration for many consumers but sees a lot of use in enterprise and HPC environments. Similar in concept to RAID 10, RAID 60 is a group of RAID 6 volumes striped into a single RAID 0 span. Unlike most common RAID 10 configurations, where each pair of drives may add another volume to the RAID 10 span, administrators have control over the number of RAID 6 volumes within the RAID 60 span. For example, 24 drives can be arranged in four different RAID 60 configurations:

Valid configurations are those that meet the following two criteria: 1) The number of drives has to be evenly divisible by the number of volumes and 2) Each volume can be no fewer than four drives (the minimum required for a RAID 6 volume). As the number of volumes increases, so does the redundancy and performance, but also the amount of wasted space. Each usage case is unique, but the rule of thumb for most users is a drive number of between 8 and 12 per volume. As the chart indicates, the higher the number of stripe volumes, the less usable capacity you have. More stripe volumes does improve performance, however. Note that the example with six stripe volumes has the same capacity as a RAID 10 volume and thus would be better served by such a configuration.

RAID 50 (striped single parity)

Redundancy: 1 drive per volume in the span
Number of drives providing space: n-1 * number of spans

It’s worth mentioning that RAID 50 configurations have very similar structures to RAID 60, only with striped RAID 5 volumes instead of RAID 6. RAID 50 does improve slightly on RAID 5’s redundancy characteristics, but we still don’t always recommend it. Consequently, RAID 60 configurations are far more common for our customers.

Conclusion

If you prefer a visual review of these concepts, the Intel RAID group has produced a strong video:

There are other RAID configurations, but those listed above are the most common. If you have other questions about storage or other HPC topics, be sure to contact us below.

The post Introduction to RAID for HPC Customers appeared first on Microway.

Intel Xeon E5-2600 v3 “Haswell” Processor Review

Brett Newman — Mon, 08 Sep 2014 17:00:48 +0000

Update:

As of March 31, 2016 we recommend version four of these Intel Xeon CPUs. Please see our new post Intel Xeon E5-2600 v4 “Broadwell” Processor Review

Intel has launched brand new Xeon E5-2600 v3 CPUs with groundbreaking new features. These CPUs build upon the leading performance of their predecessors with more a robust microarchitecture, faster memory, wider buses, and increased core counts and clock speed. The result is dramatically improved performance for HPC.

Important changes available in E5-2600 v3 “Haswell” include:

Support for brand new DDR4-2133 memory
Up to 18 processor cores per socket (with options for 6- to 16-cores)
Improved AVX 2.0 Instructions with:
- New floating point FMA, with up to 2X the FLOPS per core (16 FLOPS/clock)
- 256-bit wide integer vector instructions
A revised C610 Series Chipset delivering substantially improved I/O for every server (SATA, USB 3.0)
Increased L1, L2 cache bandwidth and faster QPI links
Slightly tweaked “Grantley” socket (Socket R3) and platforms

DDR4: Memory Architecture for the Present and Future

Xeon E5-2600 v3 is one of the first server CPUs to support DDR4 memory. DDR4 is big news: it takes advantage of a new design with fewer chips on each module, lower voltages, and superior power efficiency (20% less power per module). Apart from the benefits today, these changes ensure DDR4 DIMMs are primed to accept ever higher chip densities and clocks that exceed those of today’s DDR4-2133 modules. Physical characteristics of the DIMMs themselves have changed too: a slight curvature for easier seating and more PINs on each module.

Memory Performance

On top of the new JDEC standard for DIMMs themselves, Intel has increased the memory speed stepping for all Xeon E5-2600 v3 CPU SKUs. The result is a 13-20% increase in memory performance:

Entry-level “Basic” CPUs now support 1600MHz memory (a 20% increase)
Mid-level “Standard” CPUs now support 1866MHz memory (a 16% increase)
Higher-end “Advanced”, “High Core Count” & “Frequency Optimized” CPUs now support up to 4 DIMMs per socket at 2133MHz (a 14% increase)

Finally, it’s worth noting that configurations that populate 3 DIMMs per channel (up to a 40% performance penalty with older Xeons) or LR-DIMMs (14-40% penalty on previous gen, depending on population) see far higher frequencies than on current CPUs.

In short, DDR4 means even higher memory bandwidth today – a critical driver of HPC performance. It pairs nicely with the increased core counts of the new CPUs.

New Instructions – AVX 2.0

One of the primary drivers of the Xeon E5-2600 CPUs’ robust performance has been wider instructions, termed AVX (Advanced Vector Instructions). Intel has made its largest improvement to AVX in 3 years with Haswell’s addition of AVX 2.0:

256-bit integer instructions

Sandy-Bridge and Ivy Bridge CPUs delivered class leading floating point performance due to a 256-bit floating point unit in each core. This unit was twice as wide as that in previous Xeon CPUs and enabled twice the FLOPS of competing CPUs.

The integer unit remained at 128-bit (identical in Sandy Bridge and Ivy Bridge), but integer performance was buttressed with comparatively high clock speeds and Turbo Boost features.

With Xeon E5-2600 v3, Intel has widened the integer unit to the same 256-bits. The result is faster performance on many integer codes, even on CPUs with lower clock speeds. For example, the integer performance of the 12-core 2.7GHz IvyBridge E5-2697 v2 lies roughly between the two Haswell processors the E5-2660 v3 (10-core, 2.6Ghz) and the E5-2670 v3 (12-core, 2.3GHz).

FMA

AVX 2.0 also features a new fused-multiply-add instruction. For codes that perform multiply and add instructions in short succession, FMA reduces the number of cycles in half. 2X the FLOPS for areas of code leveraging these instructions proves extremely consequential for math and science algorithms. Since floating point performance is most important to our customers, we discuss these improvements in more detail below.

Performance – Faster in Nearly Every Metric

Much like with the Sandy-Bridge generation of Xeons, Intel has plugged in a new architecture, improved memory performance, and increased core counts and clock speeds all at once.

Users generally should expect at least a 10% increase in performance per core, excluding the new instructions. Coupled with the memory change and new instructions, this means dramatic changes (SPEC CPU2006 benchmarks):

Xeon E5-2620 v2 to v3: 18% performance improvement
Xeon E5-2630 – E5-2697 v2 to E5-2630 – E5-2697 v3: between 22% and 29% performance improvement ¹
Xeon E5-2697 v2 to Xeon E5-2698 v3/E5-2699 v3: between 27% and 32% performance improvement ²

¹ Transitioning from the same number v2 SKU to v3 SKU for these models (ex: Xeon E5-2640 v2, to Xeon E5-2640 v3, 2.0 vs. 2.6Ghz) often bundles an increase in core count, clock speed, memory performance, and the architecture improvements. Performance increase stated represents the net gain of these factors. DDR4 memory might result in a higher system cost.

² These two new high-end Haswell processors have no equivalent IvyBridge SKU and thus enjoy the largest performance deltas.

Theoretical Performance and LINPACK

Below is a chart with the theoretical peak performance (FLOPS) of the new Haswell-EP (Xeon E5-2600v3) CPUs with the new instructions. If you look at the graph below, you’ll see that the Haswell E5-2630 v3 is roughly equivalent to the flagship IvyBridge E5-2697 v2 (whose performance suffers without the new instruction support).

Keep in mind, however, that that these are peak theoretical numbers; depending upon how much your applications can take advantage of FMA, the performance gains could be far lower (see our Detailed Specifications). The 20% – 30% increases mentioned earlier come from the SPEC CPU2006 benchmarks, which execute a suite of real world applications.

Another dramatic comparison is the Xeon E5-2697 v2 (2.7Ghz, 12-core) to the new Xeon E5-2699 v3 (2.3Ghz, 18 core) on LINPACK. The new model represents a 91% increase in performance. The main reason for this substantial improvement is the new AVX 2.0 instruction set, specifically FMA. The increase in core count also contributes.

Should you prefer the most apples-to-apples architecture comparison of Xeon E5-2697 v2 (2.7Ghz, 12-core) to the Xeon E5-2690 v3 (2.6Ghz, 12-core), there is a 54% increase in LINPACK performance.

Transitioning from “Ivy Bridge” E5-2600 v2 Series Xeons

Xeon E5-2600 v3 and Xeon E5-2600 v2 CPUs do not use the same CPU socket, and DDR4 does come with a cost premium. Some large installations may still find a price/performance argument for the Ivy Bridge CPUs, and a few platforms (e.g., complex Phi- & GPU-accelerated servers) will take time to transition to the new CPU socket.

However, end users who are willing to invest slightly more will find attractive new SKUs to leverage in their clusters, servers, and workstations. All new CPUs offer faster memory speeds and QPI transfers. Applications which effectively leverage the new FMA instructions should be able to achieve higher performance than flagship v2 CPUs using almost any of the v3 CPUs.

Comparisons of note (providing increased value for your dollar):

Xeon E5-2640 v2 transition to Xeon E5-2630 v3: same core count, faster clock speeds, faster memory; lower price
Xeon E5-2650 v2 transition to Xeon E5-2640 v3: identical core count, clock speed, and turbo boost speed yet costs are also lower
Xeon E5-2695 v2 and E5-2697 v2 transition to Xeon E5-2690 v3: provides similar base and turbo speeds and at a lower price.
Xeon E5-2695v2 and E5-2697 v2 transition to Xeon E5-2683 v3: for well-threaded applications able to accept a lower clock speed, the two extra cores in Xeon E5-2683 v3 will outperform at a much lower price

Nearly all processor transitions come at similar or lower costs on the CPU-side. Customers may choose to apply the savings towards their DDR4 memory capacity.

Further Grantley Platform Improvements

C610 Series Chipset

Some end-users found the earlier C600 chipset needed to be supplemented to meet their needs. Intel has added features that address many of these situations:

SATA: Increase from 2 SATA3 + 4 SATA2 to at least 6 SATA3 ports
USB: USB 3.0 support now native to the chipset, rather than board manufacturers adding a supplemental chip
Ethernet: More common deployment of RJ45-based 10GigE; a new 40GigE controller (Fortville)

QPI Links

Intel’s Quick Path Interconnect link between the two CPU sockets now features faster speeds for every SKU:

Entry-level “Basic” CPUs at 7.2 GT/sec
Mid-level “Standard” CPUs at 8.0 GT/sec
Higher-end “Advanced”, “High Core Count” & “Frequency Optimized” CPUs at 9.6 GT/sec

QPI allows for rapid access to memory on the non-local CPU socket.

Next Steps – Putting Xeon E5-2600 v3 into Production

As always, please contact an HPC expert if you would like to discuss in further detail. You may also wish to review our products which leverage these new Xeon processors:

For more analysis of the Xeon E5-2600 v3 processor series, please read:

Detailed Specifications of the Intel Xeon E5-2600v3 “Haswell-EP” Processors

Intel’s Xeon E5 Resource Page

Summary of Intel Xeon E5-2600 v3 Series Specifications

Model	Stock Frequency	Max Turbo Boost	Core Count	Memory Speed	L3 Cache	QPI Speed	TDP (Watts)
E5-2699 v3	2.30 GHz	3.60 GHz	18	2133 MHz	45MB	9.6 GT/s	145W
E5-2698 v3	16	40MB	135W
E5-2697 v3	2.60 GHz	3.60 GHz	14	35MB	145W
E5-2695 v3	2.30 GHz	3.30 GHz	120W
E5-2683 v3	2.00 GHz	3.00 GHz
E5-2690 v3	2.60 GHz	3.50 GHz	12	30MB	135W
E5-2680 v3	2.50 GHz	3.30 GHz	120W
E5-2670 v3	2.30 GHz	3.10 GHz
E5-2687W v3	3.10 GHz	3.50 GHz	10	25MB	160W
E5-2660 v3	2.60 GHz	3.30 GHz	105W
E5-2650 v3	2.30 GHz	3.00 GHz
E5-2667 v3	3.20 GHz	3.60 GHz	8	20MB	135W
E5-2640 v3	2.60 GHz	3.40 GHz	1866 MHz	8 GT/s	90W
E5-2630 v3	2.40 GHz	3.20 GHz	85W
E5-2643 v3	3.40 GHz	3.70 GHz	6	2133 MHz	9.6 GT/s	135W
E5-2620 v3	2.40 GHz	3.20 GHz	1866 MHz	15MB	8 GT/s	85W
E5-2637 v3	3.50 GHz	3.70 GHz	4	2133 MHz	9.6 GT/s	135W
E5-2623 v3	3.00 GHz	3.50 GHz	1866 MHz	10MB	8 GT/s	105W

HPC groups do not typically choose Intel’s “Basic” and “Low Power” models – those skus are not shown.

The post Intel Xeon E5-2600 v3 “Haswell” Processor Review appeared first on Microway.

PCI-Express Root Complex Confusion?

Eliot Eshelman — Fri, 02 May 2014 21:12:16 +0000

I’ve had several customers comment to me that it’s difficult to find someone that can speak with them intelligently about PCI-E root complex questions. And yet, it’s of vital importance when considering multi-CPU systems that have various PCI-Express devices (most often GPUs or coprocessors).

First, please feel free to contact one of Microway’s experts. We’d be happy to work with you on your project to ensure your design will function correctly (both in theory and in practice). We also diagram most GPU platforms we sell, as well as explain their advantages, in our GPU Solutions Guide.

It is tempting to just look at the number of PCI-Express slots in the systems you’re evaluating and assume they’re all the same. Unfortunately, it’s not so simple, because each CPU only has a certain amount of bandwidth available. Additionally, certain high-performance features – such as NVIDIA’s GPU Direct technology – require that all components be attached to the same PCI-Express root complex. Servers and workstations with multiple processors have multiple PCI-Express root complexes. We dive deeply into these issues in our post about Common PCI-Express Myths.

To illustrate, let’s look at the PCI-Express design of Microway’s latest 8-GPU Octoputer server:

It’s a bit difficult to parse, but the important points are:

Two CPUs are shown in blue at the bottom of the diagram. Each CPU contains one PCI-Express tree.
Each CPU provides 32 lanes of PCI-Express generation 3.0 (split as two x16 connections).
PCI-Express switches (the purple boxes labeled PEX8747) further expand each CPU’s tree out to four x16 PCI-Express gen 3.0 slots.
The remaining 8 lanes of PCI-E from each CPU (along with 4 lanes from the Southbridge chipset) provide connections for the remaining PCI-E slots. Although these slots are not compatible with accelerator cards, they are excellent for networking and/or storage cards.
Having one additional x8 slot on each CPU allows for the accelerators to communication directly with storage or high-speed networks without leaving the PCI-E root complex. For technologies such as GPU Direct, this means rapid RDMA transfers between the GPUs and the network (which can significantly improve performance).

In total, you end up with eight x16 slots and two x8 slots evenly divided between two PCI-Express root complexes. The final x4 slot can be used for low-end devices.

While the layout above may not be ideal for all projects, it performs well for many applications. We have a variety of other options available (including large amounts of devices on a single PCI-E root complex). We’d be happy to discuss further with you.

The post PCI-Express Root Complex Confusion? appeared first on Microway.

Intel Xeon E5-4600 v2 “Ivy Bridge” Processor Review

jan — Tue, 04 Mar 2014 15:13:49 +0000

Many within the HPC community have been eagerly awaiting the new Intel Xeon E5-4600 v2 CPUs. To those already familiar with the “Ivy Bridge” architecture in the Xeon E5-2600 v2 processors, many of the updated features of these 4-socket Xeon E5-4600 v2 “Ivy-Bridge” CPUs should seem very familiar. Read on to learn the details.

Important changes available in the Xeon E5-4600 v2 “Ivy Bridge” CPUs include:

Up to 12 processor cores per socket (with options for 4-, 6-, 8- and 10-cores)
Support for DDR3 memory speeds up to 1866MHz
AVX has been extended to support F16C (16-bit Floating-Point conversion instructions) to accelerate data conversion between 16-bit and 32-bit floating point formats. These operations are of particular importance to graphics and image processing applications.
Intel APIC Virtualization (APICv) provides increased virtualization performance
Improved PCI-Express generation 3.0 support with superior compatibility and new features: atomics, x16 non-transparent bridge & quadrupled read buffers for point-to-point transfers

Intel Xeon E5-4600 v2 Series Specifications

Model	Frequency	Turbo Boost	Core Count	Memory Speed	L3 Cache	QPI Speed	TDP (Watts)
E5-4657L v2	2.40 GHz	2.90 GHz	12	1866 MHz	30MB	8 GT/S	115
E5-4650 v2	2.40 GHz	2.90 GHz	10	25MB	95W
E5-4640 v2	2.20 GHz	2.70 GHz	20MB
E5-4627 v2	3.30 GHz	3.60 GHz	8	16MB	7.2 GT/S	130W
E5-4620 v2	2.60 GHz	3.00 GHz	1600 MHz	20MB	95W
E5-4610 v2	2.30 GHz	2.70 GHz	16MB

HPC groups do not typically choose Intel’s “Basic” and “Low Power” models – those skus are not shown.

More for Your Dollar – Performance Uplift

With an increase in core count, clock speed and memory speed, HPC applications will achieve better performance on these new Xeons. Depending on the choice of SKU, users should expect to see 10% to 30% performance improvement for floating-point applications (model-to-model) without spending more. Even greater speedups are possible by upgrading to the new 12-core Xeon E5-4657L v2:

Xeon E5-4620 transition to Xeon E5-4657L v2: 63% performance improvement
Xeon E5-4640 transition to Xeon E5-4657L v2: 50% performance improvement
Xeon E5-4650 transition to Xeon E5-4657L v2: 33% performance improvement

More for Less – Switch SKUs without a Performance Penalty

Rather than spending the same amount for more performance, some users may prefer to spend less to achieve the same performance they’re seeing today. Given the microarchitecture improvements in “Ivy Bridge,” you’re still likely to come out at least a few percent ahead at the same core count and clock speed.

Replacing Old Servers & Clusters

If your systems are a few years old, you may be able to replace several with a single new server. The AVX instruction set, introduced with the previous generation of Xeons, provides a solid 2X performance improvement by increasing the width of the math units from 128-bits to 256-bits. Combined with other improvements in Xeon E5-4600 v2, you will be able to achieve the performance of older systems using just a single core from the “Ivy Bridge” architecture.

Transitioning from “Sandy Bridge” E5-4600 series Xeons

Given the increased core counts & higher memory speeds, lower-end Xeon E5-4600 v2 processors may replace older Xeon E5-4600 processors with improved aggregate performance.

Rather than increasing clock speeds for all “Ivy Bridge” SKUs, Intel has decided to offer some of the E5-4600 v2 processors with more cores but at a slightly slower clock speed. Specifically, the eight-core, 2.3GHz E5-4610 v2 (vs the six-core, 2.4GHz E5-4610) and the ten-core, 2.2GHz E5-4640 v2 (vs the eight-core, 2.4GHz E5-4650). Increased core counts, improved memory speeds, and Turbo Boost capabilities nearly always result in superior server performance.

Comparisons of note include:

Xeon E5-4610 v2 delivers additional value over Xeon E5-4610: the new CPU has 33% more cores and faster memory at the same cost, but also a slower clock speed (a disadvantage only for poorly-threaded applications)
Xeon E5-4620 transitions to Xeon E5-4610 v2: same core count, but a faster clock speed and faster memory at a lower cost
Xeon E5-4640 v2 delivers additional value over E5-4640: 25% more cores and faster memory at the same cost, but also a slower clock speed (a disadvantage only for poorly-threaded applications)
Xeon E5-4650 transitions to Xeon E5-4627 v2: Same physical core count, but faster memory and clock speed at a much lower price point. Caveats include a slower QPI speed and no hyperthreading, the latter of which is of lesser importance to HPC.

Intel’s strategy with these CPUs makes sense, since four-socket systems tend to run software that takes advantage of higher core count more than higher clock speed. Sacrificing 100MHz or 200MHz for two extra cores is almost always going to be a very favorable exchange.

Surprising Benchmark Performance Results

Despite some of the caveats mentioned above, the performance results achieved so far have been quite impressive. Benchmark numbers for the industry-standard floating-point SPEC fp_rate2006 suggest that even the modest Xeon E5-4610 v2 CPUs will stand up against the best of the dual-socket “Ivy Bridge” Xeon CPUs and the best of the previous-generation “Sandy Bridge” quad-socket CPUs.

Considering that a quad-socket server equipped with E5-4610 v2 is equivalent to the price of a dual-socket server with E5-2697 v2 and considerably less than a server with E5-4650 CPUs, we expect great success for this product line.

Greater Memory Bandwidth

Similar to what Intel did with the Xeon E5-2600 v2 series, memory performance is boosted across the board with Xeon E5-4600 v2 (Ivy Bridge):

Entry-level “Basic” CPUs now support 1333MHz memory
Mid-level “Standard” CPUs now support 1600MHz memory
Higher-end “Advanced”, “High Performance” & “Frequency Optimized” CPUs now support up to 4 DIMMs per socket at 1866MHz (in select configurations)

This 16-20% memory performance uplift for Xeon E5-4600 v2 is a critical performance boost for memory-intensive applications.

Special Note, Xeon E5-4627 v2 for CFD, FEA, and Multiphysics

Many CFD, FEA, and Multiphysics applications prioritize clock speed (least threaded areas), core count (well threaded solvers), and most of all memory bandwidth at once. The Xeon E5-4627 v2 SKU pairs the memory performance and core count of a 4-socket system with a high base clock speed. Previously, customers had to sacrifice one or the other.

Microway thinks this will be a winning combination for users whose models exceed the memory capacity of a 2-socket system. We anticipate extended discussion of this SKU with users of these applications.

Conclusion

As always, please contact an HPC expert if you would like to discuss in further detail. Intel has produced an Intel Xeon E5-4600 v2 Product Brief that’s available on our Knowledge Base. You may also wish to review our products which leverage these new Xeon processors:

For more analysis of the Xeon E5-4600 v2 processor series, please read:
In-Depth Comparison of Intel Xeon E5-4600v2 “Ivy Bridge” Processors

The post Intel Xeon E5-4600 v2 “Ivy Bridge” Processor Review appeared first on Microway.

High Performance Computing Archives - Microway

2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC

Important changes in AMD EPYC “Rome” CPUs include:

Leadership HPC Performance

Floating Point Benchmark Performance

Integer Benchmark Performance

What Makes EPYC 7xx2 Series Perform Strongly?

Performance Outlook

Class Leading Price/Performance

A Few Caveats: Performance Tuning & Out of the Box

Chiplet and Multi-Die Architecture: IO and Compute Dies

Compute Dies (now in 7nm)

The 14nm IO Die

DDR4-3200 and Improved Memory Bandwidth

PCI-E Gen4 Support: 2X the I/O bandwidth

Broadening Support for High Bandwidth I/O Devices

System Support for PCI-E Gen4

Infinity Fabric

SKUs and Strategies to Consider for HPC Clusters

Dual Socket SKUs

EPYC 7742 or 7702 (64c): Select a High-End SKU, yield up to 2X the performance

Anything above EPYC 7452 (32c, 48c): Select a Mid-High Level SKU, reach new performance heights

EPYC 7452 (32c): Select a Mid Level SKU, improve price performance vs previous generation EPYC

EPYC 7452 (32c): Select a Mid Level SKU, match top Xeon Gold and Platinum with far better price/performance

EPYC 7402 (24c): Select a Mid Level SKU, come close to top Xeon Gold and Platinum SKUs

EPYC 7272-7402 (12, 16 24c):Select an affordable SKU, yield better performance and price performance

Single Socket Performance

Single Socket SKUs

Next Steps: get started today!

Read More

Try 2nd Gen AMD EPYC CPUs for Yourself

Browse Our Navion AMD EPYC Product Line

WhisperStation

Servers

Clusters

Intel Xeon Scalable “Cascade Lake SP” Processor Review

Important changes in Intel Xeon Scalable “Cascade Lake SP” Processors include:

More for Your Dollar: performance uplift

Same SKU – More Performance

More for Less: Select a more modest Cascade Lake SKU for the same core count or performance

Guaranteed savings

Worthy of consideration

Greater Memory Bandwidth

Enabling Very Large Memory Capacity

Transitioning from the “Skylake-SP” Intel Xeon Scalable CPUs

Standout performance in a single socket

Next Steps: get started today!

Read More

Try Intel Xeon Scalable CPUs for Yourself

Speak with an Expert

Tesla V100 “Volta” GPU Review

Two Flavors, one giant leap: Tesla V100 PCI-E & Tesla V100 with NVLink

Selecting the right Tesla V100 for you:

Performance Summary

Enhancements for Every Workload

Next Generation NVIDIA NVLink Technology (NVLink 2.0)

Bandwidth

Number of Bricks

From 4 bricks to 6

The NVLink Bank Account

Where do you want to spend your NVLink bandwidth?

Programming Improvements

What does Tesla V100 mean for me?

Can I use Deep Learning?

What types of problems are being solved using Deep Learning?

Computer Vision

Written Language and Speech

Scientific research, engineering, and medicine

If you have large quantities of data, consider using deep learning

Want to use Deep Learning?

Intel Xeon E5-2600 v4 “Broadwell” Processor Review

Important changes in Xeon E5-2600 v4 include:

Move faster with Xeon E5-2600 v4

Transitioning from “Haswell” E5-2600 v3 Series Xeons

Get more for less – improved cost-effectiveness

Notable adjustments

Next Steps – Putting Xeon E5-2600 v4 into Production

Intel Xeon E5-4600v3 “Haswell” 4-socket CPU Review

Why pick a 4-socket Xeon E5-4600v3 CPU over a 2 socket solution?

Increased memory space vs 2 socket