HPC Guidance Archives - Microway https://www.microway.com/category/hpc-guidance/ We Speak HPC & AI Thu, 30 May 2024 20:48:13 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 Deploying GPUs for Classroom and Remote Learning https://www.microway.com/hpc-tech-tips/deploying-gpus-for-classroom-and-remote-learning-2/ https://www.microway.com/hpc-tech-tips/deploying-gpus-for-classroom-and-remote-learning-2/#respond Fri, 22 May 2020 16:34:37 +0000 https://www.microway.com/?p=12601 As one of NVIDIA’s Elite partners, we see a lot of GPU deployments in higher education. GPUs have been proving themselves in HPC for over a decade, and they are the de-facto standard for deep learning research. They’re also becoming essential for other types of machine learning and data science. But GPUs are not always […]

The post Deploying GPUs for Classroom and Remote Learning appeared first on Microway.

]]>
As one of NVIDIA’s Elite partners, we see a lot of GPU deployments in higher education. GPUs have been proving themselves in HPC for over a decade, and they are the de-facto standard for deep learning research. They’re also becoming essential for other types of machine learning and data science. But GPUs are not always available to students, particularly undergraduate students.

GPU-accelerated Classrooms at MSOE

Photo of the ROSIE cluster, with artwork featuring a rose tattoo
Photo of MSOE’s ROSIE cluster

One deployment I’m particularly proud of runs at the Milwaukee School of Engineering, where it is used for undergraduate education, as well as for faculty and industry research. This cluster leverages a combination of NVIDIA’s Volta-generation DGX systems, as well as NVIDIA Tesla T4 GPUs, Mellanox Ethernet, and NetApp storage.

Rather than having to learn a more arcane supercomputer interface, students are able to start GPU-accelerated Jupyter sessions with the click of a button in their web browser.

The cluster is connected to NVIDIA’s NGC hub, providing pre-built containers with the latest HPC & AI software stacks. The DGX systems do the heavy lifting and the Tesla T4 systems service less demanding needs (such as student sessions during class).

Microway’s team delivered all of this fully integrated and ready-to-run, allowing MSOE’s undergrads to get hands on the latest, highest-performing hardware and software tools. And they don’t have to dive down into huge levels of complexity until they’re ready.

Close up photo of the equipment in the ROSIE cluster
Close up photo of the DGX-1, servers, and storage in ROSIE

Multi-Instance GPU amplifies Remote Learning

DGX A100 Hero ImageWhat changed this month is that NVIDIA’s new DGX A100 simplifies your infrastructure. Institutions won’t need one set of systems for the most demanding work and a separate set of systems for less intensive classes/labs. Instead DGX A100 wraps all these capabilities into one powerful and configurable HPC/AI system. It can handle anything from a huge neural network training to a classroom of 56 students. Or a combination of the two.

NVIDIA calls this capability Multi-Instance GPU (MIG). The details might sound a bit hairy, but think of MIG as providing the same kinds of flexibility that virtualization has been providing for years. You can use the whole GPU, or divide it up to support several different applications/users.

DGX A100 is the only system currently providing this capability, and provides anywhere from 8 to 56 GPU instances (other NVIDIA A100 GPU systems will be shipping later this year).

The diagram below depicts seven students/users each running their own GPU-accelerated workload on a single NVIDIA A100 GPU. Each of the eight GPUs in the DGX A100 supports up to seven GPU instances, for a total of 56 instances.

Diagram of NVIDIA Multi-Instance GPU demonstrating seven separate user instances on one GPU
NVIDIA Multi-Instance GPU supports seven separate user instances on one GPU

Consider how these new capabilities might enable your institution. For example, by offering GPU-accelerated sessions to each student in a remote learning course. The traditional classroom of lab PCs might be replaced by a single DGX system.

Each DGX A100 system can serve 56 separate Jupyter notebooks, each with GPU performance similar to a Tesla T4. Microway deploys these systems with a workload manager that supports resource sharing between classroom requests and other types of work, so the full horsepower of the DGX can be leveraged for faculty research when class is not in session. Further, your IT team no longer needs to support dozens of physical workstations – the computer resources are centralized and can be managed from a single location.

Flexible Platforms Support Diverse Workloads

These types of high-performance computer labs are likely familiar for curriculums in traditionally compute-demanding fields (e.g., computer science, engineering, computational chemistry). However, we hear increasing calls for these computational resources from other departments across campuses. As the power of data analytics and machine learning become utilized in other fields, this type of deployment might even be an opportunity for cost-sharing between traditionally disconnected departments.

This year, we’re all being challenged to conceive of new, seamless methods for remote access, collaboration, and instruction. Our team would be thrilled to be a part of transformation at your institution. The first DGX A100 units in academia will be at the University of Florida next month, where Microway will be performing the integration. I know NVIDIA’s DGX A100 systems will prove invaluable to current GPU users, and I hope they will also extend into the hands of graduate and even undergraduate students. Let’s talk about what’s possible now.

The post Deploying GPUs for Classroom and Remote Learning appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/deploying-gpus-for-classroom-and-remote-learning-2/feed/ 0
2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC https://www.microway.com/hpc-tech-tips/amd-epyc-rome-cpu-review/ https://www.microway.com/hpc-tech-tips/amd-epyc-rome-cpu-review/#comments Wed, 07 Aug 2019 23:00:00 +0000 https://www.microway.com/?p=11787 The 2nd Generation AMD EPYC “Rome” CPUs are here! Rome brings greater core counts, faster memory, and PCI-E Gen4 all to deliver what really matters: up to a 2X increase in HPC application performance. We’re excited to present our thoughts on this advancement, and the return of x86 server CPU competition, in our detailed AMD […]

The post 2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC appeared first on Microway.

]]>

The 2nd Generation AMD EPYC “Rome” CPUs are here! Rome brings greater core counts, faster memory, and PCI-E Gen4 all to deliver what really matters: up to a 2X increase in HPC application performance. We’re excited to present our thoughts on this advancement, and the return of x86 server CPU competition, in our detailed AMD EPYC Rome review. AMD is unquestionably back to compete for the performance crown in HPC.

2nd Generation AMD EPYC “Rome” CPUs are offered in 8-64 cores and clock speeds from 2.2-3.2Ghz. They are available in dual socket as well as aselect number of single socket only SKUs.

Important changes in AMD EPYC “Rome” CPUs include:

  • Up to 64 cores, 2X the max in the previous generation for a massive advancement in aggregate throughput
  • PCI-E Gen 4 support for 2X the I/O bandwidth of the x86 competition— in a first for an x86 server CPU
  • 2X the FLOPS per core of the previous generation EPYC CPUs with the new Zen2 architecture
  • DDR4-3200 support for improved memory bandwidth across 8 channels, reaching up to 208GB/sec per socket
  • Next Generation Infinity Fabric with higher bandwidth for intra and inter-die connection, with roots in PCI-E Gen4
  • New 14nm + 7nm chiplet architecture that separates the 14nm IO and 7nm compute core dies to yield the performance per watt benefits of the new TSMC 7nm process node

Leadership HPC Performance

There’s no other way to say it: the 2nd Generation AMD EPYC “Rome” CPUs (EPYC 7xx2) break new ground for HPC performance. In our experience, we haven’t seen this type of advancement in CPU performance in many years or without exotic architectural changes. This leap applies across floating point and integer applications.

Note: This article focuses on SPEC benchmark performance (which is rooted in real integer and floating point applications). If you’re hunting for a more raw FLOPS/dollar calculation, please visit our Knowledge Center Article on AMD EPYC 7xx2 “Rome” CPUs.

Floating Point Benchmark Performance

In short: at the top bin, you may see up to 2.12X the performance of the competition. This is compared to the top bin of Xeon Gold Processor (Xeon Gold 6252) on SPECrate2017_fp_base.

Compared to the top Xeon Platinum 8200 series SKU (Xeon Platinum 8280), up to 1.79X the performance.
AMD Rome SPECfp 2017 vs Xeon CPUs - Top Bin

Integer Benchmark Performance

Integer performance largely mirrors the same story. At the top bin, you may see up to 2.49X the performance of the competition. This is compared to the top bin of Xeon Gold Processor (Xeon Gold 6252) on SPECrate2017_int_base.

Compared to the top Xeon Platinum 8200 series SKU (Xeon Platinum 8280), up to 1.90X the performance.
AMD Rome SPECint 2017 vs Xeon CPUs - Top Bin

What Makes EPYC 7xx2 Series Perform Strongly?

Contributions towards this leap in performance come from a combination of:

  • The 2X the FLOPS per core available in the new architecture
  • Improved performance of Zen2 microarchitecture
  • Moderate increases in clock speeds
  • Most importantly dramatic increases in core count

These last 2 items are facilitated by the new 7nm process node and the chiplet architecture of EPYC. Couple that with the advantages in memory bandwidth, and you have a recipe for HPC performance.

Performance Outlook


The dramatic increase in core count coupled with Zen2 means that we predict that most of the 32 core models and above, about half AMD’s SKU stack, is likely to outperform the top Xeon Platinum 8200 series SKU. Stay tuned for the SPEC benchmarks that confirm this assertion.

If you’re comparing against more modest Xeon Gold 62xx or Silver 52xx/42xx SKUs, we predict even an even more dramatic performance uplift. This is the first time in many years we’ve seen such an incredibly competitive product from the AMD Server Group.

Class Leading Price/Performance

AMD EPYC 7xx2 series isn’t just impressive from an absolute performance perspective. It’s also a price performance machine.

Examine these same two top-bin SKUs once again:
AMD Rome SPECfp 2017 vs Xeon CPUs - Price Performance

The top-bin AMD SKU does 1.79X the floating point work at approximately 2/3 the price of Xeon Platinum 8280. It delivers 2.13X the floating point performance to the Xeon Gold 6252 for about similar price/performance.

Should you be willing to accept more modest core counts with the lower cost SKUS, these comparisons just get better.

Finally, if you’re looking to roughly match or exceed the performance of the top-bin Xeon Gold 6252 SKU, we predict you’ll be able to do so with the 24-core EPYC 7352. This will be at just over 1/3 the price of the Xeon socket.

This much more typical comparison is emblematic of the price-performance advantage AMD has delivered in the new generation of CPUs. Stay tuned for more benchmark results and charts to support the prediction.

A Few Caveats: Performance Tuning & Out of the Box

Application Performance Engineers have spent years optimizing applications for the most widely available x86 server CPU. For a number of years now, that has meant Intel’s Xeon processors. The benchmarks presented here represent performance-tuned results.

We don’t yet have great data on how easy it is to achieve optimized performance with these new AMD “Rome” CPUs yet. For those of us in HPC for some time, we know out of the box performance and optimized performance often can mean very different things.

AMD does recommend specific compilers (AOCC, GCC, LLVM) and libraries (BLIS over BLAS and FLAME over LAPACK) to achieve optimized results with all EPYC CPUs. We don’t yet have a complete understanding how much these help end users achieve these superior results. Does it require a lot of tuning for the most exceptional performance?

AMD however has released a new Compiler Options Quick Reference Guide for the new CPUs. We strongly recommend using these flags and options for tuning your application.

Chiplet and Multi-Die Architecture: IO and Compute Dies

AMD EPYC Rome Die

One of the chief innovations in the 2nd Generation AMD EPYC CPUs is in the evolution of the multi-die architecture pioneered in the first EPYC CPUs.

Rather than create one, monolithic, hard to yield die, AMD has opted to lash together “chiplets” together in a single socket with Infinity Fabric technology.

Compute Dies (now in 7nm)

8 compute chiplets (formally, Core Complex Dies or CCDs) are brought together to create a single socket. These CCDs take advantage of the latest 7nm TSMC process node. By using 7nm for the compute cores in 2nd Generation EPYC, AMD takes advantage of the space and power efficiencies of the latest process—without the yield issues of single monolithic die.

What does it mean for you? More cores than anticipated in a single socket, a reasonable power efficiency for the core count, and a less costly CPU.

The 14nm IO Die

In 2nd Generation EPYC CPUs, AMD has gone a step further with the chiplet architecture. These chiplets are now complemented by an separate I/O die. The IO Die contains the memory controllers, PCI-Express controllers, and Infinity Fabric connection to the remote socket.Also, this resolves any NUMA affinity quirks of the 1st generation EPYC Processors.

Moreover, the I/O die is created in the established 14nm node process. It’s less important that it utilize the same 7nm power efficiencies.

DDR4-3200 and Improved Memory Bandwidth

AMD EPYC 7xx2 series improves its theoretical memory bandwidth when compared to both its predecessor and the competition.

DDR4-3200 DIMMs are supported, and they are clocked 20% faster than DDR4-2666 and 9% faster than DDR4-2933.
In summary, the platform offers:

  • Compared to Cascade Lake-SP (Xeon Platinum/Gold 82xx, 62xx): Up to a 45% improvement in memory bandwidth
  • Compared to Skylake-SP (Xeon Platinum/Gold 81xx, 61xx): Up to a 60% improvement in memory bandwidth
  • Compared to AMD EPYC 7xx1 Series (Naples): Up to a 20% improvement in memory bandwidth



These comparisons are created for a system where only the first DIMM per channel is populated. Part of this memory bandwidth advantage is derived from the increase in DIMM speeds (DDR4-3200 vs 2933/2666); part of it is derived from EPYC’s 8 memory channels (vs 6 on Xeon Skylake/Cascade Lake-SP).

While we’ve yet to see final STREAM testing numbers for the new CPUs, we do anticipate them largely reflecting the changes in theoretical memory bandwidth.

PCI-E Gen4 Support: 2X the I/O bandwidth

EPYC “Rome” CPUs have an integrated PCI-E generation 4.0 controller on the I/O die. Each PCI-E lane doubles in maximum theoretical bandwidth to 4GB/sec (bidirectional).

A 16 lane connection (PCI-E x16 4.0 slot) can now deliver up to 64GB/sec of bidirectional bandwidth (32GB/uni). That’s 2X the bandwidth compared to first generation EPYC and the x86 competition.

Broadening Support for High Bandwidth I/O Devices

Mellanox ConnectX-6 Adapter
The new support allows for higher bandwidth connection to InfiniBand and other fabric adapters, storage adapters, NVMe SSDs, and in the future GPU Accelerators and FPGAs.

Some of these devices, like Mellanox ConnectX-6 200Gb HDR InfiniBand adapters, were unable to realize their maximum bandwidth in a PCI-E Gen3 x16 slot. Their performance should improve in PCI-E Gen4 x16 slot with 2nd Generation AMD EPYC Processors.

2nd Generation AMD EPYC “Rome” is the only x86 server CPU with PCI-E Gen4 support at its launch in 3Q 2019. However, we have seen PCI-E Gen4 support before in the POWER9 platform.

System Support for PCI-E Gen4

Unlike in the previous generation AMD EPYC “Naples” CPUs, there is not strong affinity of PCI-E lanes to a particular chiplet inside the processor. In Rome, all I/O traffic routes through the I/O die and all chiplets reach PCI-E devices through this die.

In order to support PCI-E Gen4, server and motherboard manufacturers are producing brand new versions of their platforms. Not every Rome-ready platform supports Gen4, so if this is a requirement be sure to specify this to your hardware vendor. Our team can help you select a server with full Gen4 capability.

Infinity Fabric

AMD Infinity Fabric DiagramDeeply interrelated with PCI-Express Gen4, AMD has also improved the Infinity Fabric Link between chiplets and sockets with the new generation of EPYC CPUs.

AMD’s Infinity Fabric has many commonalities with PCI-Express used to connect I/O devices. With 2nd Generation AMD EPYC “Rome” CPUs, the link speed of Infinity Fabric has doubled. This allows for higher bandwidth communication between dies on the same socket and to dies on remote sockets.

The result should be improved application performance for NUMA-aware and especially non- NUMA-aware applications. The increased bandwidth should help hide any transport bandwidth issues to I/O devices on a remote socket as well. The overall result is “smoother” performance when applications scale across multiple chiplets and sockets.

SKUs and Strategies to Consider for HPC Clusters

Here are the complete list of SKUs and 1KU (1000 unit) prices (Source: AMD). Please note that these costs are those for CPUs sold to channel integrators, not those for fully integrated systems with these CPUs.

Dual Socket SKUs

SKUCoresBase ClockBoost ClockL3 CacheTDPPrice
7742642.253.4256MB225W$6950
77022.03.35200W$6450
7642482.33.3225W$4775
75522.23.3192MB200W$4025
7542322.93.4128MB225W$3400
75022.53.35180W$2600
74522.353.35155W$2025
7402242.83.35128MB180W$1783
73522.33.2155W$1350
7302163.03.3128MB$978
72822.83.264MB120W$650
7272122.93.2$625
726283.23.4128MB155W$575
72523.23.464MB120W$475

EPYC 7742 or 7702 (64c): Select a High-End SKU, yield up to 2X the performance

Assuming your application scales with core count and maximum performance at a premium cost fits with your budget, you can’t beat the top 64core EPYC 7742 or 7702 SKUs. These will deliver greater throughput on a wide variety of multi-threaded applications.

Anything above EPYC 7452 (32c, 48c): Select a Mid-High Level SKU, reach new performance heights

While these SKUs aren’t inexpensive, they take application performance to new heights and break new benchmark ground. You can take advantage of that performance advantage for your application if it’s multi-threaded. From a price/performance perspective, these SKUs may also be attractive.

EPYC 7452 (32c): Select a Mid Level SKU, improve price performance vs previous generation EPYC

Previous generation AMD EPYC 7xx1 Series CPUs also featured 32 cores. However, the 32 core entrant in the new 7xx2 stack is far less costly than the prior generation while delivering greater memory bandwidth and 2X the FLOPS per core.

EPYC 7452 (32c): Select a Mid Level SKU, match top Xeon Gold and Platinum with far better price/performance

If you’re optimizing for price/performance compared to the top Intel Xeon Platinum 8200 or Xeon Gold 6200 series SKUs, consider this SKU or ones near it. We predict this to be at or near the price/performance sweet-spot for the new platform.

EPYC 7402 (24c): Select a Mid Level SKU, come close to top Xeon Gold and Platinum SKUs

The higher clock speed of this SKU also means it is well suited to some applications.

EPYC 7272-7402 (12, 16 24c):Select an affordable SKU, yield better performance and price performance

Treat these SKUs as much more affordable alternatives to most Xeon Gold or Silver CPUs. We’ll await further benchmarks to see exactly where the further sweet-spots are compared to these SKUs. They also compare favorably from a price/performance standpoint to prior generation 1st Generation EPYC 7xx1 processors with 12, 16, or 24 cores. Same performance, fewer dollars!

Single Socket Performance

As with the previous generation, AMD is heavily promoting the concept of replacing dual socket Intel Xeon servers with single sockets of 2nd Generation AMD EPYC “Rome.” They are producing discounted “P” SKUs with only single socket platform support at reduced prices to help further boost the price-performance advantage of these systems.

Single Socket SKUs

SKUCoresBase ClockBoost ClockL3 CacheTDPPrice
7702P642.03.35256MB200W$4425
7502P322.53.35128MB180W$2300
7402P242.83.35$1250
7302P163.03.3155W$825
7232P83.13.232MB120W$450

Due to the boosted capability of the new CPUs, a single socket configuration my be increasingly viable comparison to a dual socket Xeon platform for many workloads.

Next Steps: get started today!

Read More

If you’d like to read more speeds and feeds about these new processors, check out our article with detailed specifications of the 2nd Gen AMD EPYC “Rome” CPUs. We summarize and compare the specifications of each model, and provide guidance over and beyond what you’ve seen here.

Try 2nd Gen AMD EPYC CPUs for Yourself

Groups which prefer to verify performance before making a design are encouraged to sign up for a Test Drive, which will provide you with access to bare-metal hardware with AMD EPYC CPUs, large-memory, and more.

Browse Our Navion AMD EPYC Product Line

WhisperStation

Ultra-Quiet AMD EPYC workstations

Learn More

Servers

High performance AMD EPYC rackmount servers

Learn More

Clusters

Leadership performance clusters from 5-500 nodes

Learn More

The post 2nd Gen AMD EPYC “Rome” CPU Review: A Groundbreaking Leap for HPC appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/amd-epyc-rome-cpu-review/feed/ 2
Improvements in scaling of Bowtie2 alignment software and implications for RNA-Seq pipelines https://www.microway.com/hpc-tech-tips/improvements-in-scaling-of-bowtie2-alignment-software-and-implications-for-rna-seq-pipelines/ https://www.microway.com/hpc-tech-tips/improvements-in-scaling-of-bowtie2-alignment-software-and-implications-for-rna-seq-pipelines/#respond Fri, 28 Jun 2019 17:48:05 +0000 https://www.microway.com/?p=11665  This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries What is Bowtie2? Bowtie2 is a commonly used, open-source, fast, and memory efficient application used as part of a Next Generation Sequencing (NGS) workflow.It aligns the sequencing reads, which are the genomic data output […]

The post Improvements in scaling of Bowtie2 alignment software and implications for RNA-Seq pipelines appeared first on Microway.

]]>
 
This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries

What is Bowtie2?

Bowtie2 is a commonly used, open-source, fast, and memory efficient application used as part of a Next Generation Sequencing (NGS) workflow.It aligns the sequencing reads, which are the genomic data output from an NGS device such as an Illumina HiSeq Sequencer, to a reference genome.Applications like Bowtie2 are used as the first step in pipelines such as those for variant determination, and an area of continuously growing research interest, RNA-Seq.

What is RNA-Seq?

RNA Sequencing (RNA-Seq) is a type of NGS that seeks to identify the presence and quantity of RNA in a sample at a given point in time.This can be used to quantify changes in gene expression, which can be a result of time, external stimuli, healthy or diseased states, and other factors.Through this quantification, researchers can obtain a unique snapshot of the genomic status of the organism to identify genomic information previously undetectable with other technologies.

There is considerable research effort being put into RNA-Seq, and the number of publications has grown steadily since its first use in 2009.

Plot of the number of RNA-Seq research publications accepted each year
Figure 1. RNA-Seq research publications published per year as of April 2019.Note the continuous growth.At the current rate, there will be 60% more publications in 2019 as compared to 2018. Source: NCBI PubMed

RNA-Seq is being applied to many research areas and diseases, and a few notable examples of using the technology include:

  • Oral Cancer: Researchers used an RNA-Seq approach to identify differences in gene expression between oral cancer and normal tissue samples.
  • Alzheimer’s Disease: Researchers compared the gene expression of different lobes of deceased Alzheimer’s Disease patients brain with the brain of healthy individuals.They were able to identify genomic differences between the diseased and unaffected individuals.
  • Diabetes: Researchers identified novel gene expression information from pancreatic beta-cells, which are cells critical for glycemic control.

Compute Infrastructure for aligning with Bowtie2

Designing a compute resource to meet the sequence analysis needs of Bioinformatics researchers can be a daunting task for IT staff.Limited information is available about multithreading and performance increases in the diverse portfolio of software related to NGS analysis.To further complicate things, processors are now available in a variety of models, with a large range of core counts and clock speeds, from both AMD and Intel. See, for example, the latest Intel Xeon “Cascade Lake” CPUs: Intel Xeon Scalable “Cascade Lake SP” Processor Review

Though many sequence analysis tools have multithreading options, the ability to scale is often limited, and rarely linear.In some cases, performance can decrease as more threads are added.Multithreading applications does not guarantee a performance improvement.

ThreadsRun Time (seconds)
8620
16340
32260
48385
64530

Table 1. Research data showing previous version of Bowtie2 scaling with thread count.Performance would decrease above 32 threads.

Plot of Bowtie2 run time as the number of threads increases
Figure 2. Plot of thread scaling of previous version of Bowtie2.Performance decreases after 32 threads due to a variety of factors.Non-linear scaling and performance decreases with core count have been shown in other scientific applications as well.

However, researchers recently greatly improved the thread scaling of Bowtie2.Original versions of this tool did not scale linearly, and demonstrated reduced performance when using more than 32 threads.Aware of these problems, the developers of Bowtie2 have implemented superior multithread scaling in their applications.Depending on processor type, their results show:

  • Removal of performance decreases over 32 threads
  • An increase in read throughput of up to 44%
  • Reduced memory usage with thread scaling
  • Up to a 4 hour reduction in time to align 40x coverage human genome

This new version of the software is open-source and available for download.

Right Sizing your NGS Cluster

With the recent release of Intel’s Cascade Lake-AP Xeons providing up to 112 threads per socket, as well as high density AMD EPYC processors, it can be tempting to assume that more cores will result in more performance for NGS applications.However, this is not always the case, and some applications will show reduced performance with higher thread count.

When selecting compute systems for NGS analysis, researchers and IT staff need to evaluate which software products will be used, and how they scale with threads.Depending on the use cases, more nodes with fewer, faster, threads could provide better performance than high thread density nodes.Unfortunately there is no “one size fits all” solution, and applications are in constant development, so research into the most recent versions of analysis software is always required.

References

[1] https://www.ncbi.nlm.nih.gov/pubmed/
[2] https://doi.org/10.1371/journal.pone.0016266
[3] https://doi.org/10.1101/205328
[4] https://link.springer.com/article/10.1007/s10586-017-1015-0


If you are interested in testing your NGS workloads on the latest Intel and AMD HPC systems, please consider our free HPC Test Drive. We provide bare-metal benchmarking access to HPC and Deep Learning systems.

The post Improvements in scaling of Bowtie2 alignment software and implications for RNA-Seq pipelines appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/improvements-in-scaling-of-bowtie2-alignment-software-and-implications-for-rna-seq-pipelines/feed/ 0
CryoEM takes center stage: how compute, storage, and networking needs are growing with CryoEM research https://www.microway.com/hpc-tech-tips/cryoem-takes-center-stage-how-compute-storage-networking-needs-growing/ https://www.microway.com/hpc-tech-tips/cryoem-takes-center-stage-how-compute-storage-networking-needs-growing/#respond Thu, 11 Apr 2019 14:05:48 +0000 https://www.microway.com/?p=11409  This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries Background and history Cryogenic Electron Microscopy (CryoEM) is a type of electron microscopy that images molecular samples embedded in a thin layer of non-crystalline ice, also called vitreous ice.Though CryoEM experiments have been performed […]

The post CryoEM takes center stage: how compute, storage, and networking needs are growing with CryoEM research appeared first on Microway.

]]>
 
This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries

Background and history

Cryogenic Electron Microscopy (CryoEM) is a type of electron microscopy that images molecular samples embedded in a thin layer of non-crystalline ice, also called vitreous ice.Though CryoEM experiments have been performed since the 1980s, the majority of molecular structures have been determined with two other techniques, X-ray crystallography and Nuclear Magnetic Resonance (NMR).The primary advantage of X-ray crystallography and NMR is that molecules were able to be determined at very high resolution, several fold better than historical CryoEM results.

However, recent advancements in CryoEM microscope detector technology and analysis software have greatly improved the capability of this technique.Before 2012, CryoEM structures could not achieve the resolution of X-ray Crystallography and NMR structures.The imaging and analysis improvements since that time now allow researchers to image structures of large molecules and complexes at high resolution.The primary advantages of Cryo-EM over X-ray Crystallography and NMR are:

  • Much larger structures can be determined than by X-ray or NMR
  • Structures can be determined in a more native state than by using X-ray

The ability to generate these high resolution large molecular structures through CryoEM enables better understanding of life science processes and improved opportunities for drug design. CryoEM has been considered so impactful, that the inventors won the 2017 Nobel Prize in chemistry.

CryoEM structure and publication growth

While the number of molecular structures determined by CryoEM is much lower than those determined by X-ray crystallography and NMR, the rate at which these structures are released has greatly increased in the past decade. In 2016, the number CryoEM structures deposited in the Protein Data Bank (PDB) exceeded those of NMR for the first time.

The importance of CryoEM in research is growing, as shown by the steady increase in publications over the past decade (Table 1, Figure 1).Interestingly, though there are ~10 times as many X-ray crystallography publications per year, the number of related publications per year has decreased consistently since 2013.

Experimental Structure TypeApproximate total number of structures as of March 2019
X-ray Crystallography134,000
NMR12,500
Cryo-EM3,000

Table 1. Total number of CryoEM structures available in the publicly accessible Protein Data Bank (PDB). Source: RCSB

Chart showing the growth of CryoEM structures available in the Protein Data Bank
Figure 1. Total number of structures available in the Protein Data Bank (PDB) by year.
Note the rapid and consistent growth. Source: RCSB
Plot of the number of CryoEM publications accepted each year
Figure 2. Number of CryoEm publications accepted each year.
Note the rapid increase in publications. Source: NCBI PubMed
Plot showing the declining number of X-ray crystallography publications per year
Figure 3. Number of X-ray crystallography publications per year. Note the steady decline in publications. While publications related to X-ray crystallography may be decreasing, opportunities exist for integrating both CryoEM and X-ray crystallography data to further our understanding of molecular structure. Source: NCBI PubMed

CryoEM is part of our research – do I need to add GPUs to my infrastructure?

A major challenge facing researchers and IT staff is how to appropriately build out infrastructure for CryoEM demands.There are several software products that are used for CryoEM analysis, with RELION being one of the most widely used open source packages.While GPUs can greatly accelerate RELION workflows, support for them has only existed since Version 2 (released in 2016).Worldwide, the vast majority of individual servers and centralized resources available to researchers are not GPU accelerated.Those systems that do have professional grade GPUs are often oversubscribed and can have considerable queue wait times.The relative high cost of server grade GPU systems can put those devices out of the reach of many individual research labs.

While advanced GPU hardware like the DGX-1 continue to give the best analysis times, not every GPU system provides the same throughput. Large datasets can create issues with consumer grade GPUs, in that the dataset must fit within the GPU memory to fully take advantage of the acceleration.Though RELION can parallelize the datasets, GPU memory is still limited when compared to the large amounts of system memory available to CPUs that can be installed in a single device (DGX-1 provides 256GB GPU memory; DGX-2 provides 512GB).This problem is amplified if the researcher has access to only a single consumer grade graphic card (e.g., an NVIDIA GeForce GTX 1080 Ti GPU with 11GB memory).

With the Version 3 release of the software (late 2018), RELION authors have implemented CPU acceleration to broaden the usable hardware for efficient CryoEM reconstruction.The authors have shown a 1.5x improvement on Broadwell processors and a 2.5x improvement on Skylake over the previous code.However, taking advantage of AVX instructions during compilation can further improve performance, with the authors demonstrating a 5.4x improvement on Skylake processors.This improvement is approaching the performance increases of professional grade GPUs without the additional cost.

Additional infrastructure considerations

CryoEM datasets are being generated at a higher rate and with larger data sizes than ever before.Currently, the largest raw dataset in the Electron Microscopy Public Image Archive (EMPIAR) is 12.4TB, with a median dataset size of approximately 2TB.Researchers and IT staff can expect datasets in this order of magnitude to become the norm as CryoEM continues to grow as an experimental resource in the life sciences space.

Many CryoEM labs function as microscopy cores, where they provide the service of generating the 2D datasets for different researchers, which are then analyzed by individual labs.Given the high cost of professional GPUs as compared to the ubiquitous availability of multicore CPU systems, researchers may consider modern multicore servers or using centralized clusters to meet their CryoEM analysis needs.This is with the caveat that they use Version 3 of RELION software with appropriate compilation flags.

Dataset transfer is also a concern, and organizations that have a centralized Cryo-EM core would greatly benefit from upgraded networking (10Gbps+) from the core location to centralized compute resources, or to individual labs.

Visualization of the structure of beta-galactosidase from the EMPIAR database
Figure 4. A 2.2 angstrom resolution CryoEM structure of beta-galactosidase. This is currently the largest dataset in the EMPIAR database, totaling 12.4 TB. Source: EMPIAR

CryoEM takes center stage

The increase in capabilities, interest, and research related to CryoEM shows it is now a mainstream experimental technique.IT staff and scientists alike are rapidly becoming aware of this fact as they face the data analysis, transfer, and storage challenges associated with this technique.Careful consideration must be given to the infrastructure of an organization that is engaging in CryoEM research.

In an organization that is performing exclusively CryoEM experiments, a GPU cluster would be the most cost-effective solution for rapid analysis.Researchers with access to advanced professional grade GPU systems, such as a DGX-1, will see analysis times that are even faster than modern CPU optimized RELION.While these professional GPUs can greatly accelerate CryoEM analysis, it is unlikely in the short term that all researchers wanting to use CryoEM data will have access to such high-spec GPU hardware, as compared to mixed-use commodity clusters, which are ubiquitous at all life science organizations.A large multicore CPU machine, when properly configured, can give better performance than a low core workstation or server with a single consumer grade GPU (e.g., an NVIDIA GeForce GPU).

IT departments and researchers must work together to define expected turnaround time, analysis workflow requirements, budget, and configuration of existing hardware.In doing so, researcher needs will be met and IT can implement the most effective architecture for CryoEM.

References

[1] https://doi.org/10.7554/eLife.42166.001
[2] https://febs.onlinelibrary.wiley.com/doi/10.1111/febs.12796
[3] https://www.ncbi.nlm.nih.gov/pubmed/
[4] https://www.ebi.ac.uk/pdbe/emdb/empiar/
[5] https://www.rcsb.org/


If you are interested in trying out RELION performance on some of the latest CPU and GPU-accelerated systems (including NVIDIA DGX-1), please consider our free HPC Test Drive. We provide bare-metal benchmarking access to HPC and Deep Learning systems.

The post CryoEM takes center stage: how compute, storage, and networking needs are growing with CryoEM research appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/cryoem-takes-center-stage-how-compute-storage-networking-needs-growing/feed/ 0
Designing A Production-Class AI Cluster https://www.microway.com/hpc-tech-tips/designing-a-production-class-ai-cluster/ https://www.microway.com/hpc-tech-tips/designing-a-production-class-ai-cluster/#respond Fri, 27 Oct 2017 14:49:50 +0000 https://www.microway.com/?p=9997 Artificial Intelligence (AI) and, more specifically, Deep Learning (DL) are revolutionizing the way businesses utilize the vast amounts of data they collect and how researchers accelerate their time to discovery. Some of the most significant examples come from the way AI has already impacted life as we know it such as smartphone speech recognition, search […]

The post Designing A Production-Class AI Cluster appeared first on Microway.

]]>
Artificial Intelligence (AI) and, more specifically, Deep Learning (DL) are revolutionizing the way businesses utilize the vast amounts of data they collect and how researchers accelerate their time to discovery. Some of the most significant examples come from the way AI has already impacted life as we know it such as smartphone speech recognition, search engine image classification, and cancer detection in biomedical imaging. Most businesses have collected troves of data or incorporated new avenues to collect data in recent years. Through the innovations of deep learning, that same data can be used to gain insight, make accurate predictions, and pave the path to discovery.

Developing a plan to integrate AI workloads into an existing business infrastructure or research group presents many challenges. However, there are two key elements that will drive the decisions to customizing an AI cluster. First, understanding the types and volumes of data is paramount to beginning to understand the computational requirements of training the neural network. Secondly, understanding the business expectation for time to result is equally important. Each of these factors influence the first and second stages of the AI workload, respectively. Underestimating the data characteristics will result in insufficient computational and infrastructure resources to train the networks in a reasonable timeframe. Moreover, underestimating the value and requirement of time-to-results can fail to deliver ROI to the business or hamper research results.

Below are summaries of the different features of system design that must be evaluated when configuring an AI cluster in 2017.

System Architectures

AI workloads are very similar to HPC workloads in that they require massive computational resources combined with fast and efficient access to giant datasets. Today, there are systems designed to serve the workload of an AI cluster. These systems outlined in sections below generally share similar characteristics: high-performance CPU cores, large-capacity system memory, multiple NVLink-connected GPUs per node, 10G Ethernet, and EDR InfiniBand. However, there are nuanced differences with each platform. Read below for more information about each.

Microway GPU-Accelerated NumberSmashers

Microway demonstrates the value of experience with every GPU cluster deployment. The company’s long history of designing and deploying state of the art GPU clusters for HPC makes our expertise invaluable when custom configuring full-scale, production-ready AI clusters. One of the most common GPU nodes used in our AI offerings is the NumberSmasher 1U with NVLink. The system features dense compute performance in a small footprint, making it a building block for scale-out cluster design. Alternatively, the Octoputer with Single Root Complex offers the most GPUs per system to maximize the total throughput of a single system.

To ensure maximum performance and field reliability, our system integrators test and tune every node built. Clusters, once integrated, undergo total system testing to assure total peak system operability. We offer AI integration services for installation and testing of AI frameworks in addition to the full suite of cluster management utilities and software. Additionally, all Microway systems come complete with Lifetime Technical Support.

To learn more about Microway’s GPU clusters and systems, please visit Tesla GPU clusters.

NVIDIA DGX Systems

NVIDIA’s DGX-1 and DGX Station systems deliver not only dense computational power per system, they also include access to the NVIDIA GPU Cloud and Container Registry. These NVIDIA resources provide optimized container environments for the host of libraries and frameworks typically running on an AI cluster. This allows researchers and data scientists to focus on delivering results instead of worrying about software maintenance and tuning. As an Elite Solutions Provider of NVIDIA products, Microway offers DGX systems as either a full system solution or as part of a custom cluster design.

IBM Power Systems with PowerAI

IBM’s commitment to innovative chip and system design for HPC and AI workloads has created a platform for next-generation computing. Through collaboration with NVIDIA, the IBM Power Systems are the only available GPU platforms that integrate NVLink connectivity between the CPU and GPU. IBM’s latest AC922 Power System release delivers 10x the throughput over traditional x86 systems. Additionally, Microway integrates IBM PowerAI to provide faster time to deployment with their optimized software distribution.

Professional vs. Consumer GPUs

NVIDIA GPUs are the primary element to designing a world class AI deployment. In fact, NVIDIA’s commitment to delivering AI to everyone has led them to produce a multi-tiered array of GPU accelerators. Microway’s engineers often face questions about the difference between NVIDIA’s consumer GeForce and professional Tesla GPU accelerators. Although at first glance the higher-end GeForce GPUs seem to mimic the computational capabilities of the professional Tesla products, this is not always the case. Upon further inspection, the differences become quite evident.

When determining which GPU to use, raw performance numbers are typically the first technical specifications to review. In specific regard to AI workloads, a Tesla GPU has up to 1000X the performance of a high end GeForce card running half precision floating point calculations (FP16). The GeForce cards also do not support INT8 instructions used in Deep Learning inferencing. Although it is possible to use consumer GPUs for AI work, it is not recommended for large-scale production deployments. Aside from raw throughput, there are many other features that we outline in our article at the link below.

The price of the consumer cards allows businesses and researchers to understand the potential impact of AI and develop code on single systems without investing in a larger infrastructure. Microway recommends that the use of consumer cards be limited to development workstations during the investigatory and development process.

Our knowledge center provides a detailed article on the differences between Tesla and GeForce.

Training and Inferencing

There is a stark contrast between the resources needed for efficient training versus efficient inferencing. Training neural networks requires significant GPU resources for computation, host system resources for data passing, reliable and fast access to entire datasets, and a network architecture to support it all. The resource requirement for inferencing, however, depends on how the new data will be inferenced in production. Real-time inferencing has a far lower computational requirement because the data is fed to the neural network as it occurs in real time. This is very different from bulk inference where entire new data sets are fed into the neural network at the same time. Also, going back to the beginning, understanding the expectation for time-to-result will likely impact the overall cluster design regardless of inference workload.

Storage Architecture

The type of storage architecture used with an AI cluster can and will have a significant impact on efficiency of the cluster. Although storage can seem a rather nebulous topic, the demands of an AI workload are a mostly known factor. During training, the nodes of the cluster will need access to entire data sets because the data will be accessed often and in succession throughout the training process. Many commercial AI appliances, such as the DGX-1, leverage large high-speed cache volumes in each node for efficiency.

Standard and High-Performance Network File Systems are sufficient for small to medium sized AI cluster deployments. If the nodes have been configured properly to each have sufficient cache space, the file system itself does not need to be exceptionally performant as it is simply there for long-term storage. However, if the nodes do not have enough local cache space for the dataset, the need for performant storage increases. There are component features that can increase the performance of an NFS without moving to a parallel file system, but this is not a common scenario for this workload. The goal should always be to have enough local cache space for optimal performance.

Parallel File Systems are known for their performance and sometimes price. These storage systems should be reserved for larger cluster deployments where it will provide the best benefit per dollar spent.

Network Infrastructure

Deploying the right kind of network infrastructure will reduce bottlenecks and improve the performance of the AI cluster. The guidelines for networking will change depending on the size/type of data passing through the network as well as the nature of the computation. For instance, small text files will not need as much bandwidth as 4K video files, but Deep Learning training requires access to the entire data pool which can saturate the network. Going back to the beginning of this article, understanding data sets will help identify and prevent system bottlenecks. Our experts can help walk you through that analysis.

All GPU cluster deployments, regardless of workload, should utilize a tiered networking system that includes a management network and data traffic network. Management networks are typically a single Gigabit or 10Gb Ethernet link to support system management and IPMI. Data traffic networks, however, can require more network bandwidth to accommodate the increased amount of traffic as well as lower latency for increased efficiency.

Common data networks use either Ethernet (10G/25G/40G/50G) or InfiniBand (200Gb or 100Gb). There are many cases where 10G~50G Ethernet will be sufficient for the file sizes and volume of data passing through the network at the same time. These types of networks are often used in workloads with smaller files sizes such as still images or where computation happens within a single node. They can also be a cost-effective network for a cluster with a small number of nodes.

However, for larger files and/or multi-node GPU computation such as DL training, 100Gb EDR InfiniBand is the network fabric of choice for increased bandwidth and lower latency. InfiniBand enables Peer-to-Peer GPU communication between nodes via Remote Direct Memory Access (RDMA) which can increase the efficiency of the overall system.

To compare network speeds and latencies, please visit Performance Characteristics of Common Network Fabrics

The post Designing A Production-Class AI Cluster appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/designing-a-production-class-ai-cluster/feed/ 0
Managing a Linux Software RAID with MDADM https://www.microway.com/hpc-tech-tips/managing-a-linux-software-raid-with-mdadm/ https://www.microway.com/hpc-tech-tips/managing-a-linux-software-raid-with-mdadm/#respond Tue, 30 Aug 2011 16:24:09 +0000 http://https://www.microway.com/hpc-tech-tips/?p=19 There are several advantages to assembling hard drives into a RAID: performance, redundancy and capacity. Microway workstations and servers are most commonly outfitted with software RAID to prevent a single drive failure from destroying your operating system installation. In most cases, the RAID is built from two hard drives, but you may also find software […]

The post Managing a Linux Software RAID with MDADM appeared first on Microway.

]]>
There are several advantages to assembling hard drives into a RAID: performance, redundancy and capacity. Microway workstations and servers are most commonly outfitted with software RAID to prevent a single drive failure from destroying your operating system installation. In most cases, the RAID is built from two hard drives, but you may also find software RAID on systems with up to six drives. If you have a larger storage server, a hardware RAID manages the hard drives.

Linux provides a robust software RAID implementation which costs nothing and offers great performance for lower array levels (e.g. 0, 1, 10). It is flexible and powerful, but array monitoring and management can be opaque if you’ve not previously worked with a Linux software RAID.

Software RAID Introduction

Linux software RAID depends on two components:

  1. The Linux kernel, which operates the arrays
  2. The mdadm utility, which creates and manages the arrays

As a user, you need not worry much about #1. The only fact you need to know is that the kernel keeps a live printout of array status in the dynamic text file /proc/mdstat. You may check the status of all arrays by checking the contents of that file – either with your favorite text editor or a simple cat /proc/mdstat.

To properly maintain your arrays, you’ll need to learn some basics of the mdadm RAID management utility. Most commands should be fairly straightforward, but check the mdadm man page for full details. Microway customers are welcome to contact technical support for assistance at any point.

Traditional hardware RAIDs reserve the full capacity of each hard drive in the array. However, Linux software RAIDs may be built using either an entire drive or individual partitions of a hard drive. This allows us more flexibility, such as creating a redundant RAID1 mirror for the /home partition while using a faster RAID0 stripe for /tmp. You will typically see up to 10 partitions on each drive, such as sda1/sdb1, sda2/sdb, ..., sda9/sdb9, sda10/sdb10. These are used to build the corresponding RAID devices md1, md2, ..., md9, md10. By default, Microway installations use partitions 1, 5, 6, 7, 8, 9, 10.

The following examples assume a software RAID1 mirror of two hard drives, which is the most common configuration. Only minor changes should be needed to perform maintenance on other arrays, but take care. Dangerous commands (which could cause data loss) are marked in red.

Checking Array Health

To be certain you are alerted to drive failures, set up automated alerts for hard drive and array failures. As mentioned above, manually checking the status of a software array is easy. Simply check the contents of /proc/mdstat:

eliot@penguin:~$ cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
 194496 blocks [2/2] [UU]
unused devices: <none>

Breaking down the contents by line, we see:

  1. Personalities: reports which RAID levels are running (e.g. raid0, raid1, raid10, etc). It’s safe to ignore this line.
  2. md1: the status of the first array, the type of the array, and each member of the array.
  3. md1: the size of the first array, # of members active/# of members total, and the status of each member. If a drive has failed, it will be shown on this line. You will see one of the following listed: [1/2] [_U] [U_]. These indicate that one of the two members is no longer operating.
  4. List of unused devices (drives or partitions). There’s usually nothing interesting here.

If your system has experienced a drive failure, Linux kernel error messages will be logged. You will see them in the /var/log/messages file or by running dmesg. Be certain you know which drive has failed before taking further steps.

Replacing a Failed Hard Drive

Because RAID offers redundancy, it is not necessary to take the storage offline to repair the RAID. The commands below may be issued while the system is operating normally. However, a heavily-loaded system will take much more time to complete the repair.

Before installing the new drive, you will need to remove the failed drive. You can see which drive failed by looking at the contents of /proc/mdstat or consulting Linux kernel message logs. If you are working on a live system, be absolutely certain you remove the correct failed drive. Microway customers should contact tech support with any questions.

I’m assuming that /dev/sda and /dev/sdb are the two drives currently running. If this system does not use Microway’s default partitioning (partitions 1, 5, 6, 7, 8, 9, 10) you will need to adjust the commands.

The software RAID operates at a level below the filesystem, so you do not need to re-create any filesystems. However, you do have to get the partitioning right. Once the replacement drive is installed, the partitioning can be copied from the working drive. This example assumes that sda is the operating drive and sdb is a replacement for the drive that failed:

sfdisk -d /dev/sda | sfdisk /dev/sdb

Once the partitions of the two drives match, you can add the new drive into the mirror. This has to be done partition by partition. Microway’s defaults are below (assuming sdb was the failed drive):

mdadm /dev/md1 --add /dev/sdb1
mdadm /dev/md5 --add /dev/sdb5
mdadm /dev/md6 --add /dev/sdb6
mdadm /dev/md7 --add /dev/sdb7
mdadm /dev/md8 --add /dev/sdb8
mdadm /dev/md10 --add /dev/sdb10

(you can check the status of the sync in /proc/mdstat)

One partition is used for Linux swap:

mkswap /dev/sdb9
swapon /dev/sdb9

To make the replacement drive bootable, the GRUB bootloader installer will need to be run:

root@penguin:~# grub
grub> device (hd0) /dev/sdb
grub> root (hd0,0)
grub> setup (hd0)
grub> quit

Continue to check /proc/mdstat as the arrays sync. The time required to sync is usually several hours per terabyte, although a heavily-loaded system will take longer.

To make your job easier, set up automated alerts for hard drive and array failures.

The post Managing a Linux Software RAID with MDADM appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/managing-a-linux-software-raid-with-mdadm/feed/ 0