HPC Archives - Microway https://www.microway.com/tag/hpc/ We Speak HPC & AI Thu, 30 May 2024 19:42:14 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 NVIDIA Tesla V100 Price Analysis https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/#respond Wed, 09 May 2018 00:52:23 +0000 https://www.microway.com/?p=10150 Now that NVIDIA has launched their new Tesla V100 32GB GPUs, the next questions from many customers are “What is the Tesla V100 Price?” “How does it compare to Tesla P100?” “How about Tesla V100 16GB?” and “Which GPU should I buy?” Tesla V100 32GB GPUs are shipping in volume, and our full line of […]

The post NVIDIA Tesla V100 Price Analysis appeared first on Microway.

]]>
Now that NVIDIA has launched their new Tesla V100 32GB GPUs, the next questions from many customers are “What is the Tesla V100 Price?” “How does it compare to Tesla P100?” “How about Tesla V100 16GB?” and “Which GPU should I buy?”

Tesla V100 32GB GPUs are shipping in volume, and our full line of Tesla V100 GPU-accelerated systems are ready for the new GPUs. If you’re planning a new project, we’d be happy to help steer you towards the right choices.

Tesla V100 Price

The table below gives a quick breakdown of the Tesla V100 GPU price, performance and cost-effectiveness:

Tesla GPU modelPriceDouble-Precision Performance (FP64)Dollars per TFLOPSDeep Learning Performance (TensorFLOPS or 1/2 Precision)Dollars per DL TFLOPS
Tesla V100 PCI-E 16GB or 32GB$10,664* $11,458* for 32GB7 TFLOPS$1,523 $1,637 for 32GB112 TFLOPS$95.21 $102.30 for 32GB
Tesla P100 PCI-E 16GB$7,374*4.7 TFLOPS$1,56918.7 TFLOPS$394.33
Tesla V100 SXM 16GB or 32GB$10,664* $11,458* for 32GB7.8 TFLOPS$1,367 $1,469 for 32GB125 TFLOPS$85.31 $91.66 for 32GB
Tesla P100 SXM2 16GB$9,428*5.3 TFLOPS$1,77921.2 TFLOPS$444.72

* single-unit list price before any applicable discounts (ex: EDU, volume)

Key Points

  • Tesla V100 delivers a big advance in absolute performance, in just 12 months
  • Tesla V100 PCI-E maintains similar price/performance value to Tesla P100 for Double Precision Floating Point, but it has a higher entry price
  • Tesla V100 delivers dramatic absolute performance & dramatic price/performance gains for AI
  • Tesla P100 remains a reasonable price/performance GPU choice, in select situations
  • Tesla P100 will still dramatically outperform a CPU-only configuration

Tesla V100 Double Precision HPC: Pay More for the GPU, Get More Performance

VMD visualization of a nucleosome

You’ll notice that Tesla V100 delivers an almost 50% increase in double precision performance. This is crucial for many HPC codes. A variety of applications have been shown to mirror this performance boost. In addition, Tesla V100 now offers the option of 2X the memory of Tesla P100 16GB for memory bound workloads.

Tesla V100 can is a compelling choice for HPC workloads: it will almost always deliver the greatest absolute performance. However, in the right situation a Tesla P100 can still deliver reasonable price/performance as well.

Both Tesla P100 and V100 GPUs should be considered for GPU accelerated HPC clusters and servers. A Microway expert can help you evaluate what’s best for your needs and applications and/or provide you remote benchmarking resources.

Tesla V100 for Deep Learning: Enormous Advancement & Value- The New Standard


If your goal is maximum Deep Learning performance, Tesla V100 is an enormous on-paper leap in performance. The dedicated TensorCores have huge performance potential for deep learning applications. NVIDIA has even termed a new “TensorFLOP” to measure this gain. Tesla V100 delivers a 6X on-paper advancement.

If your budget allows you to purchase at least 1 Tesla V100, it’s the right GPU to invest in for deep learning performance. For the first time, the beefy Tesla V100 GPU is compelling for not just AI Training, but AI Inference as well (unlike Tesla P100).

Moreover, only a selection of Deep Learning frameworks are fully taking advantage of the TensorCore today. As more and more DL Frameworks are optimized to use these new TensorCores and their instructions, the gains will grow. Even before many major optimizations, many workloads have advanced 3X-4X.

Finally, there is no more SXM cost premium for Tesla V100 GPUs (and only a modest premium for SXM-enabled host-servers). Nearly all DL applications benefit greatly from the NVLink interface from GPU:GPU; a selection of HPC applications (ex: AMBER) do today.

If you’re running DL frameworks, select Tesla V100 and if possible the SXM-enabled GPUs and servers.

FLOPS vs Real Application Performance

Unless you firmly know your workload correlates, we strongly discourage anyone from making purchasing decisions strictly based upon raw $/FLOP calculations.

While the generalizations above are useful, application performance differs dramatically from any simplistic FLOPS calculation. Device/device bandwidth, host-device bandwidth, GPU memory bandwidth, code maturity, are all equal levers to FLOPS on realized application performance.

Here’s some of NVIDIA’s own application performance testing across some real applications


You’ll see that some codes scale similarly to the on-paper FLOPS gains, and others are frankly far more removed.

At the most, use such simplistic FLOPS and price/performance calculations to guide the following higher level decision-making: to help predict new hardware relative to prior testing of FLOPS vs. actual performance, to steer what GPUs to consider, to decide what to purchase for POCs, or as the way to identify appropriate GPUs to remotely test to validate actual application performance.

No one should buy based upon price/performance per FLOP; most should buy based upon price/performance per workload (or basket of workloads).

When Paper Performance + Intuition Collide with Reality

While the above guidelines are helpful, there are still a wide diversity of workloads out there in the field. Apart from testing that steers you to one GPU or another, here’s some good reasons we’ve seen or advised customers to use to make other selections:

Tesla V100 SXM 2.0 GPU
  • Your application has shown diminishing returns to advances in GPU performance in the past (Tesla P100 might be a price/performance choice)
  • Your budget doesn’t allow for even a single Tesla V100 (pick Tesla P100, still great speedups)
  • Your budget allows for a server with 2 Tesla P100s, but not 2 Tesla V100s (Pick 2 Tesla P100s vs 1 Tesla V100)
  • Your application is GPU memory capacity-bound (pick Tesla V100 32GB)
  • There are workload sharing considerations (ex: preferred scheduler only allocates whole GPUs)
  • Your application isn’t multi-GPU enabled (pick Tesla V100, the most powerful single GPU)
  • Your application is GPU memory bandwidth limited (test it, but potential case for Tesla P100)

Further Resources

You may wish to reference our Comparison of Tesla “Volta” GPUs, which summarizes the technical improvements made in these new GPUs or Tesla V100 GPU Review for more extended discussion.

If you’re looking to see how these GPUs will be deployed in production, read our NVIDIA GPU Clusters page. As always, please feel free to reach out to us if you’d like to get a better understanding of these latest HPC systems and what they can do for you.

The post NVIDIA Tesla V100 Price Analysis appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/feed/ 0
Tesla V100 “Volta” GPU Review https://www.microway.com/hpc-tech-tips/tesla-v100-volta-gpu-review/ https://www.microway.com/hpc-tech-tips/tesla-v100-volta-gpu-review/#respond Thu, 28 Sep 2017 13:50:32 +0000 https://www.microway.com/?p=9401 The next generation NVIDIA Volta architecture is here. With it comes the new Tesla V100 “Volta” GPU, the most advanced datacenter GPU ever built. Volta is NVIDIA’s 2nd GPU architecture in ~12 months, and it builds upon the massive advancements of the Pascal architecture. Whether your workload is in HPC, AI, or even remote visualization […]

The post Tesla V100 “Volta” GPU Review appeared first on Microway.

]]>

The next generation NVIDIA Volta architecture is here. With it comes the new Tesla V100 “Volta” GPU, the most advanced datacenter GPU ever built.

Volta is NVIDIA’s 2nd GPU architecture in ~12 months, and it builds upon the massive advancements of the Pascal architecture. Whether your workload is in HPC, AI, or even remote visualization & graphics acceleration, Tesla V100 has something for you.

Two Flavors, one giant leap: Tesla V100 PCI-E & Tesla V100 with NVLink

For those who love speeds and feeds, here’s a summary of the key enhancements vs Tesla P100 GPUs

Tesla V100 with NVLinkTesla V100 PCI-ETesla P100 with NVLinkTesla P100 PCI-ERatio Tesla V100:P100
DP TFLOPS7.8 TFLOPS7.0 TFLOPS5.3 TFLOPS4.7 TFLOPS~1.4-1.5X
SP TFLOPS15.7 TFLOPS14 TFLOPS9.3 TFLOPS8.74 TFLOPS~1.4-1.5X
TensorFLOPS125 TFLOPS112 TFLOPS21.2 TFLOPS 1/2 Precision18.7 TFLOPS 1/2 Precision~6X
Interface (bidirec. BW) 300GB/sec32GB/sec160GB/sec32GB/sec1.88X NVLink
9.38X PCI-E
Memory Bandwidth900GB/sec900GB/sec720GB/sec720GB/sec1.25X
CUDA Cores (Tensor Cores) 5120 (640)5120 (640)35843584
Performance of Tesla GPUs, Generation to Generation

Selecting the right Tesla V100 for you:

With Tesla P100 “Pascal” GPUs, there was a substantial price premium to the NVLink-enabled SXM2.0 form factor GPUs. We’re excited to see things even out for Tesla V100.

However, that doesn’t mean selecting a GPU is as simple as picking one that matches a system design. Here’s some guidance to help you evaluate your options:

Performance Summary

Increases in relative performance are widely workload dependent. But early testing demonstates HPC performance advancing approximately 50%, in just a 12 month period.
Tesla V100 HPC PerformanceTesla V100 HPC Performance
If you haven’t made the jump to Tesla P100 yet, Tesla V100 is an even more compelling proposition.

For Deep Learning, Tesla V100 delivers a massive leap in performance. This extends to training time, and it also brings unique advancements for inference as well.
Deep Learning Performance Summary -Tesla V100

Enhancements for Every Workload

Now that we’ve learned a bit about each GPU, let’s review the breakthrough advancements for Tesla V100.

Next Generation NVIDIA NVLink Technology (NVLink 2.0)

NVLink helps you to transcend the bottlenecks in PCI-Express based systems and dramatically improve the data flow to GPU accelerators.

The total data pipe for each Tesla V100 with NVLink GPUs is now 300GB/sec of bidirectional bandwidth—that’s nearly 10X the data flow of PCI-E x16 3.0 interfaces!

Bandwidth

NVIDIA has enhanced the bandwidth of each individual NVLink brick by 25%: from 40GB/sec (20+20GB/sec in each direction) to 50GB/sec (25+25GB/sec) of bidirectional bandwidth.

This improved signaling helps carry the load of data intensive applications.

Number of Bricks

But enhanced NVLink isn’t just about simple signaling improvements. Point to point NVLink connections are divided into “bricks” or links. Each brick delivers

From 4 bricks to 6

Over and above the signaling improvements, Tesla V100 with NVLink increases the number of NVLink “bricks” that are embedded into each GPU: from 4 bricks to 6.

This 50% increase in brick quantity delivers a large increase in bandwidth for the world’s most data intensive workloads. It also allows for more diverse set of system designs and configurations.

The NVLink Bank Account

NVIDIA NVLink technology continues to be a breakthrough for both HPC and Deep Learning workloads. But as with previous generations of NVLink, there’s a design choice related to a simple question:

Where do you want to spend your NVLink bandwidth?

Think about NVLink bricks as a “spending” or “bank account.” Each NVLink system-design strikes a different balance between where they “spend the funds.” You may wish to:

  • Spend links on GPU:GPU communication
  • Focus on increasing the number of GPUs
  • Broaden CPU:GPU bandwidth to overcome the PCI-E bottleneck
  • Create a balanced system, or prioritize design choices solely for a single workload

There are good reasons for each, or combinations, of these choices. DGX-1V, NumberSmasher 1U GPU Server with NVLink, and future IBM Power Systems products all set different priorities.

We’ll dive more deeply into these choices in a future post.

Net-net: the NVLink bandwidth increases from 160GB/sec to 300GB/sec (bidirectional), and a number of new, diverse HPC and AI hardware designs are now possible.

Programming Improvements

Tesla V100 and CUDA 9 bring a host of improvements for GPU programmers. These include:

  • Cooperative Groups
  • A new L1 cache + shared memory, that simplifies programming
  • A new SIMT model, that relieves the need to program to fit 32 thread warps

We won’t explore these in detail in this post, but we encourage you to read the following resources for more:

What does Tesla V100 mean for me?

Tesla V100 GPUs mean a massive change for many workloads. But each workload will see different impacts. Here’s a short summary of what we see:

  1. An increase in performance on HPC applications (on paper FLOPS increase of 50%, diverse real-world impacts)
  2. A massive leap for Deep Learning Training
  3. 1 GPU, many Deep Learning workloads
  4. New system designs, better tuned to your applications
  5. Radical new, and radically simpler, programming paradigms (coherence)

The post Tesla V100 “Volta” GPU Review appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/tesla-v100-volta-gpu-review/feed/ 0
GPU-accelerated HPC Containers with Singularity https://www.microway.com/hpc-tech-tips/gpu-accelerated-hpc-containers-singularity/ https://www.microway.com/hpc-tech-tips/gpu-accelerated-hpc-containers-singularity/#respond Tue, 11 Apr 2017 16:44:44 +0000 https://www.microway.com/?p=8673 Fighting with application installations is frustrating and time consuming. It’s not what domain experts should be spending their time on. And yet, every time users move their project to a new system, they have to begin again with a re-assembly of their complex workflow. This is a problem that containers can help to solve. HPC […]

The post GPU-accelerated HPC Containers with Singularity appeared first on Microway.

]]>
Fighting with application installations is frustrating and time consuming. It’s not what domain experts should be spending their time on. And yet, every time users move their project to a new system, they have to begin again with a re-assembly of their complex workflow.

This is a problem that containers can help to solve. HPC groups have had some success with more traditional containers (e.g., Docker), but there are security concerns that have made them difficult to use on HPC systems. Singularity, the new tool from the creator of CentOS and Warewulf, aims to resolve these issues.

Singularity helps you to step away from the complex dependencies of your software apps. It enables you to assemble these complex toolchains into a single unified tool that you can use just as simply as you’d use any built-in Linux command. A tool that can be moved from system to system without effort.

Surprising Simplicity

Of course, HPC tools are traditionally quite complex, so users seem to expect Singularity containers to also be complex. Just as virtualization is hard for novices to wrap their heads around, the operation of Singularity containers can be disorienting. For that reason, I encourage you to think of your Singularity containers as a single file; a single tool. It’s an executable that you can use just like any other program. It just happens to have all its dependencies built in.

This means it’s not doing anything tricky with your data files. It’s not doing anything tricky with the network. It’s just a program that you’ll be running like any other. Just like any other program, it can read data from any of your files; it can write data to any local directory you specify. It can download data from the network; it can accept connections from the network. InfiniBand, Omni-Path and/or MPI are fully supported. Once you’ve created it, you really don’t think of it as a container anymore.

GPU-accelerated HPC Containers

When it comes to utilizing the GPUs, Singularity will see the same GPU devices as the host system. It will respect any device selections or restrictions put in place by the workload manager (e.g., SLURM). You can package your applications into GPU-accelerated HPC containers and leverage the flexibilities provided by Singularity. For example, run Ubuntu containers on an HPC cluster that uses CentOS Linux; run binaries built for CentOS on your Ubuntu system.

As part of this effort, we have contributed a Singularity image for TensorFlow back to the Singularity community. This image is available pre-built for all users on our GPU Test Drive cluster. It’s a fantastically easy way to compare the performance of CPU-only and GPU-accelerated versions of TensorFlow. All one needs to do is switch between executables:

Executing the pre-built TensorFlow for CPUs

[eliot@node2 ~]$ tensorflow_cpu ./hello_world.py
Hello, TensorFlow!
42

Executing the pre-built TensorFlow with GPU acceleration

[eliot@node2 ~]$ tensorflow_gpu ./hello_world.py
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla P100-SXM2-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.4805
pciBusID 0000:06:00.0
Total memory: 15.89GiB
Free memory: 15.61GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties:
name: Tesla P100-SXM2-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.4805
pciBusID 0000:07:00.0
Total memory: 15.89GiB
Free memory: 15.61GiB

[...]

I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:06:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-SXM2-16GB, pci bus id: 0000:07:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla P100-SXM2-16GB, pci bus id: 0000:84:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0)
Hello, TensorFlow!
42

As shown above, the tensorflow_cpu and tensorflow_gpu executables include everything that’s needed for TensorFlow. You can just think of them as ready-to-run applications that have all their dependencies built in. All you need to know is where the Singularity container image is stored on the filesystem.

Caveats of GPU-accelerated HPC containers with Singularity

In earlier versions of Singularity, the nature of NVIDIA GPU drivers required a couple extra steps during the configurations of GPU-accelerated containers. Although GPU support is still listed as experimental, Singularity now offers a --nv flag which passes through the appropriate driver/library files. In most cases, you will find that no additional steps are needed to access NVIDIA GPUs with a Singularity container. Give it a try!

Taking the next step on GPU-accelerated HPC containers

There are still many use cases left to be discovered. Singularity containers open up a lot of exciting capabilities. As an example, we are leveraging Singularity on our OpenPower systems (which provide full NVLink connectivity between CPUs and GPUs). All the benefits of Singularity are just as relevant on these platforms. The Singularity images cannot be directly transferred between x86 and POWER8 CPUs, but the same style Singularity recipes may be used. Users can run a pre-built Tensorflow image on x86 nodes and a complimentary image on POWER8 nodes. They don’t have to keep all the internals and dependencies in mind as they build their workflows.

Generating reproducible results is another anticipated benefit of Singularity. Groups can publish complete and ready-to-run containers alongside their results. Singularity’s flexibility will allow those containers to continue operating flawlessly for years to come – even if they move to newer hardware or different operating system versions.

If you’d like to see Singularity in action for yourself, request an account on our GPU Test Drive cluster. For those looking to deploy systems and clusters leveraging Singularity, we provide fully-integrated HPC clusters with Singularity ready-to-run. We can also assist by building optimized libraries, applications, and containers. Contact an HPC expert.

This post was updated 2017-06-02 to reflect recent changes in GPU support.

The post GPU-accelerated HPC Containers with Singularity appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/gpu-accelerated-hpc-containers-singularity/feed/ 0
NVIDIA Tesla P100 Price Analysis https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-price-analysis/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-price-analysis/#respond Mon, 01 Aug 2016 14:15:02 +0000 https://www.microway.com/?p=8060 Now that NVIDIA has launched their new Pascal GPUs, the next question is “What is the Tesla P100 Price?” Although it’s still a month or two before shipments of P100 start, the specifications and pricing of Microway’s Tesla P100 GPU-accelerated systems are available. If you’re planning a new project for delivery later this year, we’d […]

The post NVIDIA Tesla P100 Price Analysis appeared first on Microway.

]]>
Now that NVIDIA has launched their new Pascal GPUs, the next question is “What is the Tesla P100 Price?”

Although it’s still a month or two before shipments of P100 start, the specifications and pricing of Microway’s Tesla P100 GPU-accelerated systems are available. If you’re planning a new project for delivery later this year, we’d be happy to help you get on board. These new GPUs are exceptionally powerful.

Tesla P100 Price

The table below gives a quick breakdown of the Tesla P100 GPU price, performance and cost-effectiveness:

Tesla GPU modelPriceDouble-Precision Performance (FP64)Dollars per TFLOPS
Tesla P100 PCI-E 12GB$5,899*4.7 TFLOPS$1,255
Tesla P100 PCI-E 16GB$7,374*4.7 TFLOPS$1,569
Tesla P100 SXM2 16GB$9,428*5.3 TFLOPS$1,779

* single-unit price before any applicable discounts

As one would expect, the price does increase for the higher-end models with more memory and NVLink connectivity. However, the cost-effectiveness of these new P100 GPUs is quite clear: the dollars per TFLOPS of the previous-generation Tesla K40 and K80 GPUs are $2,342 and $1,807 (respectively). That makes any of the Tesla P100 GPUs an excellent choice. Depending upon the comparison, HPC centers should expect the new “Pascal” Tesla GPUs to be as much as twice as cost-effective as the previous generation. Additionally, the Tesla P100 GPUs provide much faster memory and include a number of powerful new features.

You may wish to reference our Comparison of Tesla “Pascal” GPUs, which summarizes the technical improvements made in these new GPUs and compares each of the new Tesla P100 GPU models. If you’re looking to see how these GPUs will be deployed in production, read our Tesla GPU Clusters page. As always, please feel free to reach out to us if you’d like to get a better understanding of these latest HPC systems and what they can do for you.

The post NVIDIA Tesla P100 Price Analysis appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-p100-price-analysis/feed/ 0
Intel Xeon E5-4600v3 “Haswell” 4-socket CPU Review https://www.microway.com/hpc-tech-tips/intel-xeon-e5-4600v3-cpu-review/ https://www.microway.com/hpc-tech-tips/intel-xeon-e5-4600v3-cpu-review/#respond Mon, 01 Jun 2015 07:01:08 +0000 http://https://www.microway.com/?p=5258 Intel has launched new 4-socket Xeon E5-4600v3 CPUs. They are the perfect choice for “just beyond dual socket” system scaling. Leverage them for larger memory capacity, faster memory bandwidth, and higher core-count when you aren’t ready for a multi-system purchase. Here are a few of the main technical improvements: Why pick a 4-socket Xeon E5-4600v3 CPU […]

The post Intel Xeon E5-4600v3 “Haswell” 4-socket CPU Review appeared first on Microway.

]]>
Intel has launched new 4-socket Xeon E5-4600v3 CPUs. They are the perfect choice for “just beyond dual socket” system scaling. Leverage them for larger memory capacity, faster memory bandwidth, and higher core-count when you aren’t ready for a multi-system purchase.

Here are a few of the main technical improvements:

  • DDR4-2133 memory support, for increased memory bandwidth
  • Up to 18 cores per socket, faster QPI links up to 9.6GT/sec between sockets
  • Up to 48 DIMMs per server, for a maximum of 3TB memory
  • Haswell core microarchitecture with new instructions

Why pick a 4-socket Xeon E5-4600v3 CPU over a 2 socket solution?

Increased memory space vs 2 socket

Dual socket systems max out at 512GB affordably (1TB at cost); however, many HPC users have models that outgrow that memory space. Xeon E5-4600v3 systems double the DIMM count for up to 1.5TB affordably (3TB at higher cost).

For applications like ANSYS, COMSOL, and other CAE, multiphysics, and CFD suites, this can be a game changer. Traditionally, achieving these types of memory capacities required large multi-node cluster installations. Usage of such a cluster to run simulations is almost always more effort. The Xeon E5-4600v3 permits larger models to run on a single system with a familiar single OS instance. Don’t underestimate the power of ease-of-use.

Increased core count vs 2 socket

Hand-in-hand with the memory space comes core count. What good are loading up big models if you can’t scale compute throughput to run the simulations? The Xeon E5-4600v3 CPUs mean systems deliver up to 72 cores. Executing on that scale means a faster time to solution for you and more work accomplished.

Increased aggregate memory bandwidth

One overlooked aspect of 4P systems is superior memory bandwidth. Intel integrates the same memory controller in the Xeon E5-2600v3 CPUs into each Xeon E5-4600v3 socket. However, there’s twice as many CPUs in each system: the net result is 2X the aggregate memory bandwidth per system.

Increased memory bandwidth per core (by selecting 4 sockets but fewer cores per socket)

Users might be concerned about memory bandwidth per CPU core. We find that CFD and multiphysics applications are especially sensitive. But a 4-socket system presents unique opportunities: you may select fewer cores per socket while achieving the same core count.

If you select smartly, you will have 2X the memory bandwidth per core available in your system vs. a 2 socket solution. This strategy can also be used to maximize throughput for a software license with a hard core count ceiling.

Detailed Technical Improvements

You’ve heard the why, but the nuts and bolts generation-to-generation improvements matter too. Let’s review in detail:

DDR4-2133 memory support- bandwidth and efficiency

Memory bandwidth is critical for HPC users. CFD, CAE/simulation, life-sciences and custom coded applications benefit most. With the new CPUs, you’ll see the following improvements over Xeon E5-4600v2:

  • Entry-level “Basic” CPU operates memory at 1600Mhz (increase of 20%)
  • Mid-level “Standard” CPUs now operate memory at 1866Mhz (increase of 16%)
  • Higher-end “Advanced”“High Core Count” & “Frequency Optimized” CPUs now support up to 4 DIMMs per socket at 2133MHz (increase of 14%), 8 DIMMs per socket with LR-DIMMS

The increase in memory clocks means Xeon E5-4600v3 delivers more memory bandwidth per socket, up to 68GB/sec. Moreover, DDR4 DIMMs operate at 1.2v resulting in a substantial power-efficiency gain.

Increased core counts – more for your money

Throughout the stack, core counts are increasing:

  • Xeon E5-4610v3 and E5-4620v3: 10 cores per socket, a 25% core count increase over the previous generation
  • Xeon E5-4640v3, E5-4650v3: 12 cores per socket, a 50% core count increase over the previous generation
  • E5-4669v3: 18 cores per socket, a 33% core count increase over the previous generation
  • New E5-4660v3 SKU delivers 14 cores per socket with a reasonable 120W TDP

Increased core counts means deploying larger jobs, scheduling more HPC users on the same system, and deploying more virtual machines. It also helps increase the aggregate throughput of your systems. You can do far more work with Xeon E5-4600v3.

Memory latency and DIMM size

DDR4 doesn’t just mean faster clocks – it also brings with it support for fewer compromises and larger DIMM sizes. 32GB DIMMs are now available as registered as well as load reduced (32GB DDR4-2133 RDIMMs vs. 32GB DDR4-2133 LRDIMMs) modules. The shift to a traditional register in an RDIMM from a specialty buffer in an LRDIMM means a substantial latency decrease.

Advances in manufacturing for DDR4 also mean larger DIMM sizes. 64GB LRDIMMs are now being manufactured to help support that outstanding 3TB memory capacity.

Haswell microarchitecture and AVX2

AVX2 is an advanced CPU instruction set that debuted in the Haswell architecture and has shown strong benefits:

  • New floating point FMA, with up to 2X the FLOPS per core (16 FLOPS/clock)
  • 256-bit wide integer vector instructions

These new instructions are extremely consequential. We encourage you to learn more about these improvements, and how to compile for the new instructions, with our post on AVX2 Optimization.

Intel Xeon E5-4600v3 Series Specifications

Model Frequency Frequency (AVX) Turbo Boost Core Count L3 Cache QPI Speed Memory Speed TDP (Watts)
E5-4669v3 2.10 GHz 1.80 GHz 2.90 GHz 18 45MB 9.6 GT/s 2133 MHz 135W
E5-4667v3 2.00 GHz 1.70 GHz 2.90 GHz 16 40MB
E5-4660v3 2.10 GHz 1.80 GHz 2.90 GHz 14 35MB 120W
E5-4650v3 2.10 GHz 1.80 GHz 2.80 GHz 12 30MB 105W
E5-4640v3 1.90 GHz 1.60 GHz 2.60 GHz 8.0 GT/s 1866 MHz
E5-4620v3 2.00 GHz 1.70 GHz 2.60 GHz 10 25MB
E5-4610v3 1.70 GHz 1.70 GHz None 6.4 GT/s 1600 MHz

HPC groups do not typically choose Intel’s “Basic” models (e.g., E5-4610v3)

Intel Xeon E5-4600v3 Frequency Optimized SKUs

Model Frequency Frequency (AVX) Turbo Boost Core Count L3 Cache QPI Speed Memory Speed TDP (Watts)
E5-4655v3 2.90 GHz 2.60 GHz 3.20 GHz 6 30MB 9.6 GT/s 2133 MHz 135W
E5-4627v3 2.60 GHz 2.30 GHz 3.20 GHz 10 25MB

The above SKUs offer better memory bandwidth per core

Next steps

We think the improvements in the Xeon E5-4600v3 CPUs make them a unique alternative to far more complicated HPC installations and a worthwhile upgrade from their predecessors. Want to learn more about the Xeon E5-4600v3 CPUs? Talk with an expert and assess how they might fit your HPC needs.

The post Intel Xeon E5-4600v3 “Haswell” 4-socket CPU Review appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/intel-xeon-e5-4600v3-cpu-review/feed/ 0