coprocessor Archives - Microway https://www.microway.com/tag/coprocessor/ We Speak HPC & AI Tue, 28 May 2024 16:52:02 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 PCI-Express Root Complex Confusion? https://www.microway.com/hpc-tech-tips/pci-express-root-complex-confusion/ https://www.microway.com/hpc-tech-tips/pci-express-root-complex-confusion/#respond Fri, 02 May 2014 21:12:16 +0000 http://https://www.microway.com/?p=3878 I’ve had several customers comment to me that it’s difficult to find someone that can speak with them intelligently about PCI-E root complex questions. And yet, it’s of vital importance when considering multi-CPU systems that have various PCI-Express devices (most often GPUs or coprocessors). First, please feel free to contact one of Microway’s experts. We’d […]

The post PCI-Express Root Complex Confusion? appeared first on Microway.

]]>
I’ve had several customers comment to me that it’s difficult to find someone that can speak with them intelligently about PCI-E root complex questions. And yet, it’s of vital importance when considering multi-CPU systems that have various PCI-Express devices (most often GPUs or coprocessors).

First, please feel free to contact one of Microway’s experts. We’d be happy to work with you on your project to ensure your design will function correctly (both in theory and in practice). We also diagram most GPU platforms we sell, as well as explain their advantages, in our GPU Solutions Guide.

It is tempting to just look at the number of PCI-Express slots in the systems you’re evaluating and assume they’re all the same. Unfortunately, it’s not so simple, because each CPU only has a certain amount of bandwidth available. Additionally, certain high-performance features – such as NVIDIA’s GPU Direct technology – require that all components be attached to the same PCI-Express root complex. Servers and workstations with multiple processors have multiple PCI-Express root complexes. We dive deeply into these issues in our post about Common PCI-Express Myths.

To illustrate, let’s look at the PCI-Express design of Microway’s latest 8-GPU Octoputer server:
Diagram of Microway's OctoPuter PCI-E Tree

It’s a bit difficult to parse, but the important points are:

  • Two CPUs are shown in blue at the bottom of the diagram. Each CPU contains one PCI-Express tree.
  • Each CPU provides 32 lanes of PCI-Express generation 3.0 (split as two x16 connections).
  • PCI-Express switches (the purple boxes labeled PEX8747) further expand each CPU’s tree out to four x16 PCI-Express gen 3.0 slots.
  • The remaining 8 lanes of PCI-E from each CPU (along with 4 lanes from the Southbridge chipset) provide connections for the remaining PCI-E slots. Although these slots are not compatible with accelerator cards, they are excellent for networking and/or storage cards.

    Having one additional x8 slot on each CPU allows for the accelerators to communication directly with storage or high-speed networks without leaving the PCI-E root complex. For technologies such as GPU Direct, this means rapid RDMA transfers between the GPUs and the network (which can significantly improve performance).

In total, you end up with eight x16 slots and two x8 slots evenly divided between two PCI-Express root complexes. The final x4 slot can be used for low-end devices.

While the layout above may not be ideal for all projects, it performs well for many applications. We have a variety of other options available (including large amounts of devices on a single PCI-E root complex). We’d be happy to discuss further with you.

The post PCI-Express Root Complex Confusion? appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/pci-express-root-complex-confusion/feed/ 0
Parallel Code: Maximizing your Performance Potential https://www.microway.com/hpc-tech-tips/parallel-code-maximizing-your-performance-potential/ https://www.microway.com/hpc-tech-tips/parallel-code-maximizing-your-performance-potential/#respond Mon, 03 Jun 2013 21:57:35 +0000 http://https://www.microway.com/hpc-tech-tips/?p=282 No matter what the purpose of your application is, one thing is certain. You want to get the most bang for your buck. You see research papers being published and presented making claims of tremendous speed increases by running algorithms on the GPU (e.g. NVIDIA Tesla), in a cluster, or on a hardware accelerator (such […]

The post Parallel Code: Maximizing your Performance Potential appeared first on Microway.

]]>
No matter what the purpose of your application is, one thing is certain. You want to get the most bang for your buck. You see research papers being published and presented making claims of tremendous speed increases by running algorithms on the GPU (e.g. NVIDIA Tesla), in a cluster, or on a hardware accelerator (such as the Xeon Phi or Cell BE). These architectures allow for massively parallel execution of code that, if done properly, can yield lofty performance gains.

Unlike most aspects of programming, the actual writing of the programs is (relatively) simple. Most hardware accelerators support (or are very similar to) C based programming languages. This makes hitting the ground running with parallel coding an actually doable task. While mastering the development of massively parallel code is an entirely different matter, with a basic understanding of the principles behind efficient, parallel code, one can obtain substantial performance increases compared to traditional programming and serial execution of the same algorithms.

In order to ensure that you’re getting the most bang for your buck in terms of performance increases, you need to be aware of the bottlenecks associated with coprocessor/GPU programming. Fortunately for you, I’m here to make this an easier task. By simply avoiding these programming “No-No’s” you can optimize the performance of your algorithm without having to spend hundreds of hours learning about every nook and cranny of the architecture of your choice. This series will discuss and demystify these performance-robbing bottlenecks, and provide simple ways to make these a non-factor in your application.

Parallel Thread Management – Topic #1

First and foremost, the most important thing with regard to parallel programming is the proper management of threads. Threads are the smallest sequence of programmed instructions that are able to be utilized by an operating system scheduler. Your application’s threads must be kept busy (not waiting) and non-divergent. Properly scheduling and directing threads is imperative to avoid wasting precious computing time.

Read: CUDA Parallel Thread Management, Divergence and Profiling

Host/Device Transfers and Data Movement – Topic #2

Transferring data between the host and device is a very costly move. It is not uncommon to have code making multiple transactions between the host and device without the programmer’s knowledge. Cleverly structuring code can save tons of processing time! On top of that, it is imperative to understand the cost of these host device transfers. In some cases, it may be more beneficial to run certain algorithms or pieces of code on the host due to the costly transfer time associated with farming data to the device.

Read: Profile CUDA Host-to-Device Transfers and Data Movement
Read: Optimize CUDA Host-to-Device Transfers

Cache and Shared Memory Optimizations – Topic #3

In addition to managing the threads running in your application, properly utilizing the various memory types available on your device is paramount to ensuring that you’re squeezing every drop of performance from your application. Shared memory, local memory, and register memory all have their advantages and disadvantages and need to be used very carefully to avoid wasting valuable clock cycles. Phenomena such as bank conflicts, memory spilling (too much data being placed in registers and spilling into local memory),  improper loop unrolling, as well as the amount of shared memory, all play pivotal roles in obtaining the greatest performance.

Read: GPU Memory Types and Memory Performance Comparison
Read: GPU Shared Memory Performance Optimization
Read: Avoiding GPU Memory Performance Bottlenecks

More to come…

All in all, utilizing devices like the NVIDIA GPU, Cell BE or Intel Xeon Phi to increase the performance of your application doesn’t have to be a daunting task. Over the next several posts, this blog will outline and identify effective techniques to make troubleshooting the performance leaks of your application an easy matter. Each of these common bottlenecks will be discussed in detail in an effort to provide programmers insight into how to make use of all the resources that many popular architectures provide.

The post Parallel Code: Maximizing your Performance Potential appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/parallel-code-maximizing-your-performance-potential/feed/ 0