Pitfalls Archives - Microway https://www.microway.com/category/pitfalls/ We Speak HPC & AI Tue, 28 May 2024 17:18:52 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 Common PCI-Express Myths for GPU Computing Users https://www.microway.com/hpc-tech-tips/common-pci-express-myths-gpu-computing/ https://www.microway.com/hpc-tech-tips/common-pci-express-myths-gpu-computing/#comments Mon, 04 May 2015 16:22:19 +0000 http://https://www.microway.com/?p=3417 At Microway we design a lot of GPU computing systems. One of the strengths of GPU-compute is the flexibility PCI-Express bus. Assuming the server has appropriate power and thermals, it enables us to attach GPUs with no special interface modifications. We can even swap to new GPUs under many circumstances. However, we encounter a lot of misinformation […]

The post Common PCI-Express Myths for GPU Computing Users appeared first on Microway.

]]>
At Microway we design a lot of GPU computing systems. One of the strengths of GPU-compute is the flexibility PCI-Express bus. Assuming the server has appropriate power and thermals, it enables us to attach GPUs with no special interface modifications. We can even swap to new GPUs under many circumstances. However, we encounter a lot of misinformation about PCI-Express and GPUs. Here are a number of myths about PCI-E:

1. PCI-Express is controlled through the chipset

No longer in modern Intel CPU-based platforms. Beginning with the Sandy Bridge CPU architecture in 2012 (Xeon E5 series CPUs, Xeon E3 series CPUs, Core i7-2xxx and newer) Intel integrated the PCI-Express controller into the the CPU die itself. Bringing PCI-Express onto the CPU die came with a substantial latency benefit. This was a major change in platform design, and Intel coupled it with the addition of PCI-Express Gen3 support.

AMD Opteron 6300/4300 CPUs are still the exception: PCI-Express is delivered only through the AMD SR56xx chipset (PCI-E Gen2 only) for these platforms. They will slightly underperform competing Intel Xeon platforms when paired with PCI-Express Gen2 GPUs (Tesla K20/K20X) due to the latency differential. Opteron 6300/4300 CPUs will substantially underperform competing Xeon platforms when PCI-Express Gen3 GPUs are installed.

2. A host system with the newest Intel CPU architecture always delivers optimal performance

Not always true. Intel tends to launch its newest CPU architectures on it’s lowest end CPU products first. Once they are proven in lower end applications, the architecture migrates up to higher end segments months or even years later. The problem? The lowest end, newest architecture CPUs can feature the least number of PCI-Express lanes per socket:

CPUCore i7-5xxx?Xeon E3-1200v3/Core i7-47xx/48xxXeon E5-1600v3, Core i7 58xx/59xxXeon E5-2400v2Xeon E5-2600v3
CPU SocketLikely Socket 1150Socket 1150Socket 2011-3/R3Socket 1356Socket 2011-3/R3
CPU Core ArchitectureBroadwellHaswellHaswellIvy BridgeHaswell
Launch Date2015Q2 2013Q3 2014Q1 2014Q3 2014
PCI-Express Lanes Per MotherboardLikely 16 Gen316 Gen340 Gen3 (Xeon) 28-40 Gen3 (Core i7)48 Gen3 (both CPUs populated)80 Gen3 (both CPUs populated)

Socket 1150 CPUs debuted in mid-2013 and were the only offering with the latest and greatest Haswell architecture for over a year; however, the CPUs available only delivered 16 PCI-Express Gen3 lanes per socket. It was tempting for some users to outfit a system with a modestly priced (and “latest”) Core i7-4700 series “Haswell” CPU during this period. However, this choice could have fundamentally hindered application performance. We’ll see this again when Intel debuts Broadwell for the same socket in 2015.

 

3. The least expensive host system possible is best when paired with multiple GPUs

Not necessarily and in many cases certainly not. It all comes down to how your application works, how long execution time is, and whether PCI Express transfers are happening throughout. An attempt at small cost savings could have big consequences. Here’s a few examples that counter this myth:

a. Applications running entirely on GPUs with many device-to-device transfers

Your application may be performing almost all of its work on the GPUs and orchestrating constant CUDA device-to-device transfers throughout its run. But a host motherboard and CPU with insufficient PCI-Express lanes may not allow full bandwidth transfers between the GPUs, and that could cripple your job performance.

Many inexpensive Socket 1150 motherboards (max 16 PCI-E lanes) have this issue: install 2 GPUs into what appear as x16 slots, and both operate as x8 links electrically. The forced operation at x8 speeds means that a maximum of half the optimal bandwidth is available for your device-to-device transfers. A capable PCI-Express switch may change the game for your performance in this situation.

b. Applications with extremely short execution time on each GPU

In this case, the data transfer may be the largest piece of total job execution time. If you purchase a low-end CPU without sufficient PCI-Express lanes (and bandwidth) to serve simultaneous transfers to/from all your GPUs, the contention will result in poor application performance.

c. Applications constantly streaming data into and out of the GPU

The classic example here is video/signals processing. If you have a constant stream of HD video or signals data being processed by the GPU in real-time, restricting the size of the pipe to your processing devices (GPUs) is a poor design decision.

I don’t know if any of the above fit me…

If you are unable to analyze your job, we do have some reluctant secondary recommendations. The least expensive CPU configuration providing enough lanes for PCI-Express x16 links to all your GPUs is in our experience the safest purchase. An inexpensive CPU SKU in a specific CPU/platform series (ex: no need to purchase an E5-2690v3 CPU vs. E5-2620v3) is fine if you don’t need fast CPU performance. There are very notable exceptions.

4. PCI-Express switches always mean poor application performance

This myth is very common. In reality performance highly application dependent, and sometimes switches yield superior performance.

Where switching matters

There’s no question that PLX Switches have capacity constraints: 16 PCI-E lanes are nearly always driving 2-4 PCI-Express x16 devices. But PLX switching also has one critical advantage: it fools each device into believing it has a full x16 link present, and it will deliver all 16 lanes of bandwidth to a device if available upstream. 2-4 GPUs attached to a single PLX switch @ PCI-E x16 links will nearly always outperform 2-4 GPUs operating at PCI-E x8 speeds without one.

Furthermore, if you hide latency with staggered CUDA Host-Device transfers, the benefits of a denser compute platform (no MPI coding) could far outweigh the PCI- E bandwidth constraints. First, profile your code to learn more about it. Then optimize your transfers for the architecture.

Superior Performance in the Right Situation

In certain cases PLX switches deliver superior performance or additional features. A few examples:

a. In 2:1 configurations utilizing device-device transfers, full bandwidth is available between neighboring devices. AMBER is a great example of an application where this of strong benefit.

Octoputer PCI-E Block Diagram
Microway Octoputer PCI-E Block Diagram

b. Next, in applications leveraging GPU Direct RDMA, switches deliver superior performance. This feature enables a direct transfer between GPU and another device (typically an IB adapter).

Courtesy of NVIDIA
Courtesy of NVIDIA

See this presentation for more information on this feature.

c. For multi-GPU configurations where maximum device-device bandwidth between pairs of GPUs at once is of paramount importance, 2 PCI-E switches off of a single socket are likely to offer higher performance vs. an unswitched dual socket configuration. This is due to the added latency and bandwidth constraint of a QPI-hop from CPU0 to CPU1 in a dual socket configuration. Our friends at Cirrascale have explored this bandwidth challenge skillfully.

d. For 4-GPU configurations where maximum device-device bandwidth between all GPUs at once is of paramount importance and host-device bandwidth is not, 1 switch off of single socket may be even better.

Switched-PCI-E-4GPUs
PCI-E Configuration for 4 GPUs on Single Switch

4:1 designs with an appropriate switch offer full bandwidth device-device transfers for 48-96 total PCI-E lanes (including uplink lanes) on the same PCI-E tree. This is impossible with any switchless configuration, and it provides maximum bandwidth for P2P transfers.

However, please don’t assume you can find your ideal PCI-E switch configuration for sale today. Switches are embedded down on motherboards and designs take years to make it to market. For example, we have yet to see many devices with PEX 8796 switch come to market.

Switching is…complex

We’re just starting to see data honestly assessing switched performance in the marketplace. What happens to a particular application’s performance if you sacrifice host-device for device-device bandwidth by pairing a CPU with weak PCI-E I/O capability with a healthy PCI-E switch? Is your total runtime faster? On which applications? How about a configuration that has no switch and simply restricts bandwidth?  Does either save you much in total system cost?

Studies of ARM + GPU platform performance (switched and unswitched) and the availability of more platforms with single-socket x86 CPUs + PCI-E switches are starting to tell us more. We’re excited to see the data, but we treat these dilemmas very conservatively until someone can prove to us that restricted bandwidth to the host will result in superior performance for an application.

Concluding thoughts

No one said GPU computing was easy. Understanding your application’s behavior during runs is critical to designing the proper system to run it. Use resource monitors, profilers, and any tool you can to assist. We have a whole blog post series that may help you. Take a GPU Test Drive with Microway to verify.

We encourage you to enlist an expert to help design your system once you know. Complete information about your application’s behavior ensures we can design the system that will perform best for you. As an end-user, you realize a system that is ready to do useful work immediately after delivery. This ensures you get the most complete value out of your hardware purchase.

Finally, we have guidance for when you are in doubt or when you have no data. In this case we recommend any Xeon E5-1600v3 or Xeon E5-2600v3 CPU: they deliver the most PCI-E lanes per socket (40). It’s one of the most robust configurations that keeps you out of trouble. Still, only comprehensive testing will determine the choice with the absolute best price-performance calculation for you. Understand these myths, test your code, let us guide you, and you will procure the best system for your needs!

The post Common PCI-Express Myths for GPU Computing Users appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/common-pci-express-myths-gpu-computing/feed/ 3
Introduction to RAID for HPC Customers https://www.microway.com/hpc-tech-tips/introduction-raid-hpc-customers/ https://www.microway.com/hpc-tech-tips/introduction-raid-hpc-customers/#respond Mon, 06 Apr 2015 23:05:48 +0000 http://https://www.microway.com/?p=4121 There is a lot of material available on RAID, describing the technologies, the options, and the pitfalls.  However, there isn’t a great deal on RAID from an HPC perspective.  We’d like to provide an introduction to RAID, clear up a few misconceptions, share with you some best practices, and explain what sort of configurations we […]

The post Introduction to RAID for HPC Customers appeared first on Microway.

]]>
There is a lot of material available on RAID, describing the technologies, the options, and the pitfalls.  However, there isn’t a great deal on RAID from an HPC perspective.  We’d like to provide an introduction to RAID, clear up a few misconceptions, share with you some best practices, and explain what sort of configurations we recommend for different use cases.

What is RAID?

Originally known as Redundant Array of Inexpensive Disks, the acronym is now more commonly considered to stand for Redundant Array of Independent Disks.  The main benefits to RAID are improved disk read/write performance, increased redundancy, and the ability to increase logical volume sizes.

RAID is able to perform these functions primarily through striping, mirroring, and parity.  Striping is when files are broken down into segments, which are then placed on different drives.  Because the files are spread across multiple drives that are running in parallel, performance is improved.  Mirroring is when data is duplicated on the fly across drives.  Parity within the context of RAID refers to when data redundancy is distributed across all drives so that when one or more (depending on the RAID level) drives fail, the data can be reconstructed from the remaining drives.

RAID comes in a variety of flavors, with the most common being the following.

RAID_0RAID 0 (striping)

Redundancy: none
Number of drives providing space: all drives

The riskiest RAID setup, RAID 0 stripes blocks of data across drives and provides no redundancy.  If one drive fails, all data is lost.  The benefits of RAID 0 are that you get increased drive performance and no storage space taken up by data parity.  Due to the real risk of total data loss, though, we normally do not recommend RAID 0 except for certain use cases. A fast, temporary scratch volume is one of these exceptions.

RAID_1RAID 1 (mirroring)

Redundancy: typically 1 drive
Number of drives providing space: typically n-1

RAID 1 is in many ways the opposite of RAID 0. Data is mirrored on all disks in the RAID array so if one or more drives fail, as long as at least one half of the mirror is functioning, the data remains intact.  The main downside is that you realize only half the spindle count in usable space.  Some performance gains can be realized during drive read, but not during write.  We usually recommend that operating systems be housed on 2 drives in RAID 1.

RAID_5RAID 5 (single parity)

Redundancy: 1 drive
Number of drives providing space: n-1

Next we have the very common but very misunderstood RAID 5.  In this case, data blocks are striped across disks like in RAID 0, but so are parity blocks.  These parity blocks allow the RAID array to still function in the event of a single drive failure. This redundancy comes at the cost of losing a single drive’s worth of space.  Due to the striping similarity that it shares with RAID 0, RAID 5 enjoys increased read/write performance.

The misconception concerning RAID 5 is that many people think that single-drive parity is a robust safeguard again data loss.  Single-drive parity becomes very risky after a drive fails and you wait to or start to rebuild your array. The increased IO activity of a rebuild is exactly the type of situation likely to create a second drive failure, and no protection remains.

RAID_6RAID 6 (double parity)

Redundancy: 2 drives
Number of drives providing space: n-2

RAID 6 builds upon RAID 5 with a second parity block, tolerating two drive failures instead of one.  Generally we find that losing a second disk’s worth of capacity is a fair tradeoff for the increased redundancy and we often recommend RAID 6 for larger arrays still requiring strong performance.

RAID_10RAID 10 (striped mirroring)

Redundancy: 1 drive per volume in the span
Number of drives providing space: n/2

As you can see from the diagram, RAID 10 is a combination of RAID 1 mirroring and RAID 0 striping.  It is, in essence, a stripe of mirrors, so creating a RAID 10 is possible with even drive-counts greater than 2 (i.e. 4, 6, 8, etc.).  RAID 10 can offer a very reasonable balance of performance and redundancy, with the primary concern for some users being the reduced storage space.  Since each volume in the RAID 0 span is made up of RAID 1 mirrors, a full half of the drives are used for redundancy.  Also, while the risk of data loss is greatly reduced, it is still possible.  If multiple drives within one volume fail at the same time, information could be lost.  In practice, though, this is uncommon, and RAID 10 is normally considered to be an extremely secure form of RAID.

RAID 60 (striped double parity)

Redundancy: 2 drives per volume in the span
Number of drives providing space: n-2 * number of spans

RAID_60

RAID 60 is a less common configuration for many consumers but sees a lot of use in enterprise and HPC environments.  Similar in concept to RAID 10, RAID 60 is a group of RAID 6 volumes striped into a single RAID 0 span.  Unlike most common RAID 10 configurations, where each pair of drives may add another volume to the RAID 10 span, administrators have control over the number of RAID 6 volumes within the RAID 60 span.  For example, 24 drives can be arranged in four different RAID 60 configurations:

RAID 60 Chart

Valid configurations are those that meet the following two criteria: 1) The number of drives has to be evenly divisible by the number of volumes and 2) Each volume can be no fewer than four drives (the minimum required for a RAID 6 volume).  As the number of volumes increases, so does the redundancy and performance, but also the amount of wasted space.  Each usage case is unique, but the rule of thumb for most users is a drive number of between 8 and 12 per volume.  As the chart indicates, the higher the number of stripe volumes, the less usable capacity you have.  More stripe volumes does improve performance, however.  Note that the example with six stripe volumes has the same capacity as a RAID 10 volume and thus would be better served by such a configuration.

RAID 50 (striped single parity)

Redundancy: 1 drive per volume in the span
Number of drives providing space: n-1 * number of spans

It’s worth mentioning that RAID 50 configurations have very similar structures to RAID 60, only with striped RAID 5 volumes instead of RAID 6.  RAID 50 does improve slightly on RAID 5’s redundancy characteristics, but we still don’t always recommend it. Consequently, RAID  60 configurations are far more common for our customers.

Conclusion

If you prefer a visual review of these concepts, the Intel RAID group has produced a strong video:

There are other RAID configurations, but those listed above are the most common.  If you have other questions about storage or other HPC topics, be sure to contact us below.

The post Introduction to RAID for HPC Customers appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/introduction-raid-hpc-customers/feed/ 0
AVX2 Optimization and Haswell-EP (Xeon E5-2600v3) CPU Features https://www.microway.com/hpc-tech-tips/avx2-optimization-and-haswell-ep-cpu-features/ https://www.microway.com/hpc-tech-tips/avx2-optimization-and-haswell-ep-cpu-features/#respond Fri, 03 Oct 2014 22:14:29 +0000 http://https://www.microway.com/?p=4747 We’re very excited to be delivering systems with the new Xeon E5-2600v3 and E5-1600v3 CPUs. If you are the type who loves microarchitecture details and compiler optimization, there’s a lot to gain. If you haven’t explored the latest techniques and instructions for optimization, it’s never a bad time to start. Many end users don’t always […]

The post AVX2 Optimization and Haswell-EP (Xeon E5-2600v3) CPU Features appeared first on Microway.

]]>
We’re very excited to be delivering systems with the new Xeon E5-2600v3 and E5-1600v3 CPUs. If you are the type who loves microarchitecture details and compiler optimization, there’s a lot to gain. If you haven’t explored the latest techniques and instructions for optimization, it’s never a bad time to start.

Many end users don’t always see instruction changes as consequential. However, they can be absolutely critical to achieving optimal application performance. Here’s a comparison of Theoretical Peak Performance of the latest CPUs with and without FMA3:
Plot of Xeon E5-2600v3 Theoretical Peak Performance (GFLOPS)

Only a small set of codes will be capable of issuing almost exclusively FMA instructions (e.g., LINPACK). Achieved performance for well-parallelized & optimized applications is likely to fall between the grey and colored bars. Still, without employing a compiler optimized for FMA3 instructions, you are leaving significant potential performance of your Xeon E5-2600v3-based hardware purchase on the table.

Know your CPUs, know your instructions

With that in mind, we would like to summarize and link to these new resources from Intel:

Intel: Xeon E5-2600v3 Technical Overview

  • A brief summary of Haswell-NI (Haswell New Instructions) that add dedicated instructions for signal processing, encryption, and math functions
  • Summary of power improvements in the Haswell architecture
  • Detailed comparison of C600 and C610 series chipsets
  • Virtualization improvements and new security features

Intel: How AVX2 Improves Performance on Server Applications

  • Instructions on how to recompile your code for AVX2 instructions and supported compilers
  • Other methods of employing AVX2: Intel MKL, coding with intrinsic instructions, and assembly
  • Summary of LINPACK performance gains delivered simply by using AVX2

Deliver the highest performance for your applications by taking of advantage of the latest the Intel architecture. For more information, contact a Microway HPC expert:

 

The post AVX2 Optimization and Haswell-EP (Xeon E5-2600v3) CPU Features appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/avx2-optimization-and-haswell-ep-cpu-features/feed/ 0
PCI-Express Root Complex Confusion? https://www.microway.com/hpc-tech-tips/pci-express-root-complex-confusion/ https://www.microway.com/hpc-tech-tips/pci-express-root-complex-confusion/#respond Fri, 02 May 2014 21:12:16 +0000 http://https://www.microway.com/?p=3878 I’ve had several customers comment to me that it’s difficult to find someone that can speak with them intelligently about PCI-E root complex questions. And yet, it’s of vital importance when considering multi-CPU systems that have various PCI-Express devices (most often GPUs or coprocessors). First, please feel free to contact one of Microway’s experts. We’d […]

The post PCI-Express Root Complex Confusion? appeared first on Microway.

]]>
I’ve had several customers comment to me that it’s difficult to find someone that can speak with them intelligently about PCI-E root complex questions. And yet, it’s of vital importance when considering multi-CPU systems that have various PCI-Express devices (most often GPUs or coprocessors).

First, please feel free to contact one of Microway’s experts. We’d be happy to work with you on your project to ensure your design will function correctly (both in theory and in practice). We also diagram most GPU platforms we sell, as well as explain their advantages, in our GPU Solutions Guide.

It is tempting to just look at the number of PCI-Express slots in the systems you’re evaluating and assume they’re all the same. Unfortunately, it’s not so simple, because each CPU only has a certain amount of bandwidth available. Additionally, certain high-performance features – such as NVIDIA’s GPU Direct technology – require that all components be attached to the same PCI-Express root complex. Servers and workstations with multiple processors have multiple PCI-Express root complexes. We dive deeply into these issues in our post about Common PCI-Express Myths.

To illustrate, let’s look at the PCI-Express design of Microway’s latest 8-GPU Octoputer server:
Diagram of Microway's OctoPuter PCI-E Tree

It’s a bit difficult to parse, but the important points are:

  • Two CPUs are shown in blue at the bottom of the diagram. Each CPU contains one PCI-Express tree.
  • Each CPU provides 32 lanes of PCI-Express generation 3.0 (split as two x16 connections).
  • PCI-Express switches (the purple boxes labeled PEX8747) further expand each CPU’s tree out to four x16 PCI-Express gen 3.0 slots.
  • The remaining 8 lanes of PCI-E from each CPU (along with 4 lanes from the Southbridge chipset) provide connections for the remaining PCI-E slots. Although these slots are not compatible with accelerator cards, they are excellent for networking and/or storage cards.

    Having one additional x8 slot on each CPU allows for the accelerators to communication directly with storage or high-speed networks without leaving the PCI-E root complex. For technologies such as GPU Direct, this means rapid RDMA transfers between the GPUs and the network (which can significantly improve performance).

In total, you end up with eight x16 slots and two x8 slots evenly divided between two PCI-Express root complexes. The final x4 slot can be used for low-end devices.

While the layout above may not be ideal for all projects, it performs well for many applications. We have a variety of other options available (including large amounts of devices on a single PCI-E root complex). We’d be happy to discuss further with you.

The post PCI-Express Root Complex Confusion? appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/pci-express-root-complex-confusion/feed/ 0
Take Care When Updating Your Cluster https://www.microway.com/hpc-tech-tips/take-care-when-updating-your-cluster/ https://www.microway.com/hpc-tech-tips/take-care-when-updating-your-cluster/#respond Wed, 06 Jul 2011 20:29:36 +0000 http://https://www.microway.com/hpc-tech-tips/?p=7 Although modern Linux distributions have made it very easy to keep your software packages up-to-date, there are some pitfalls you might encounter when managing your compute cluster. Cluster software packages are usually not managed from the same software repository as the standard Linux packages, so the updater will unknowingly break compatibility. In particular, upgrading or […]

The post Take Care When Updating Your Cluster appeared first on Microway.

]]>
Although modern Linux distributions have made it very easy to keep your software packages up-to-date, there are some pitfalls you might encounter when managing your compute cluster.

Cluster software packages are usually not managed from the same software repository as the standard Linux packages, so the updater will unknowingly break compatibility. In particular, upgrading or changing the Linux kernel on your cluster may require manual re-configuration – particularly for systems with large storage, InfiniBand and/or GPU compute processor components. These types of systems usually require that kernel modules or other packages be recompiled against the new kernel.

Please keep in mind that updating the software on your cluster may break existing functionality, so don’t update just for the sake of updating! Plan an update schedule and notify users in case there is downtime from unexpected snags.

You may always contact Microway technical support before you update to find out what problems you should expect from running a software update.

The post Take Care When Updating Your Cluster appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/take-care-when-updating-your-cluster/feed/ 0