Software Archives - Microway https://www.microway.com/category/software/ We Speak HPC & AI Thu, 30 May 2024 20:12:07 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science https://www.microway.com/hpc-tech-tips/what-can-you-do-with-a-15k-nvidia-data-science-workstation-change-healthcare/ https://www.microway.com/hpc-tech-tips/what-can-you-do-with-a-15k-nvidia-data-science-workstation-change-healthcare/#respond Wed, 04 Mar 2020 22:31:20 +0000 https://www.microway.com/?p=12259 NVIDIA’s Data Science Workstation Platform is designed to bring the power of accelerated computing to a broad set of data science workflows. Recently, we found out what happens when you lend a talented data scientist (with a serious appetite for after-hours projects + coffee) a $15k accelerated data science tool. You can recreate a massive […]

The post What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science appeared first on Microway.

]]>
NVIDIA’s Data Science Workstation Platform is designed to bring the power of accelerated computing to a broad set of data science workflows. Recently, we found out what happens when you lend a talented data scientist (with a serious appetite for after-hours projects + coffee) a $15k accelerated data science tool. You can recreate a massive Pubmed literature search project on a Data Science WhisperStation in hours versus weeks.

Kyle Gallatin, an engineer at Pfizer, has deep data science credentials. He’s been working on projects for over 10 years. At the end of 2019 we gave him special access to one of our Data Science WhisperStations in partnership with NVIDIA:

When NVIDIA asked if I wanted to try one of the latest data science workstations, I was stoked. However, a sobering thought followed the excitement: what in the world should I use this for?

I thought back to my first data science project: a massive, multilingual search engine for medical literature. If I had access to the compute and GPU libraries I have now in 2020 back in 2017, what might I have been able to accomplish? How much faster would I have accomplished it?

Experimentation, Performance, and GPU Accelerated Data Science Tooling

Gallatin used Data Science WhisperStation to rapidly create an accelerated data science workflow for a healthcare—and tell us about his experience. And it was a remarkable one.

Not only was a previously impossible workflow made possible, but portions of the application were accelerated up to 39X!

The Data Science Workstation allowed him to design a Pubmed healthcare article search engine where he:

  1. Ingested a larger database than ever imagined (30,000,000 research article abstracts!)
  2. Didn’t require massive code changes to GPU accelerate the algorithm
  3. Used familiar looking tools for his workflow
  4. Had unsurpassed agility—he could search large portions of the abstract database in .1 seconds!

This last point is really critical and shows why we believe the NVIDIA Data Science Workstation Platform and its RAPIDS tools are so special. As Kyle put it:

Data science is a field grounded in experimentation. With big data or large models, the number of times a scientist can try out new configurations or parameters is limited without massive resources. Everyone knows the pain of starting a computationally-intensive process, only be blindsided by an unforeseen error literal hours into running it. Then you have to correct it and start all over again.

Walkthrough with Step-by-Step Instructions

The new article is available on Medium. It provides a complete step-by-step walkthrough of how NVIDIA Rapids tools and NVIDIA Quadro RTX 6000 with NVLink were utilized to revolutionize this process.

A short set of Kyle’s key findings about the environment and the hardware are below. We’re excited about how this kind of rapid development could change healthcare:

Running workflows with GPU libraries can speed up code by orders of magnitude — which can mean hours instead of weeks with every experiment run

Additionally, if you’ve ever set up a data science environment from scratch you know it can really suck. Having Docker, RAPIDs, tensorflow, pytorch and everything else installed and configured out-of-the-box saved hours in setup time

..

With these general-purpose data science libraries offering massive computational enhancements for traditionally CPU-bound processes (data loading, cleansing, feature engineering, linear models, etc…), the path is paved to entirely new frontier of data science.

Read on at Medium.com

The post What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/what-can-you-do-with-a-15k-nvidia-data-science-workstation-change-healthcare/feed/ 0
Improvements in scaling of Bowtie2 alignment software and implications for RNA-Seq pipelines https://www.microway.com/hpc-tech-tips/improvements-in-scaling-of-bowtie2-alignment-software-and-implications-for-rna-seq-pipelines/ https://www.microway.com/hpc-tech-tips/improvements-in-scaling-of-bowtie2-alignment-software-and-implications-for-rna-seq-pipelines/#respond Fri, 28 Jun 2019 17:48:05 +0000 https://www.microway.com/?p=11665  This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries What is Bowtie2? Bowtie2 is a commonly used, open-source, fast, and memory efficient application used as part of a Next Generation Sequencing (NGS) workflow.It aligns the sequencing reads, which are the genomic data output […]

The post Improvements in scaling of Bowtie2 alignment software and implications for RNA-Seq pipelines appeared first on Microway.

]]>
 
This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries

What is Bowtie2?

Bowtie2 is a commonly used, open-source, fast, and memory efficient application used as part of a Next Generation Sequencing (NGS) workflow.It aligns the sequencing reads, which are the genomic data output from an NGS device such as an Illumina HiSeq Sequencer, to a reference genome.Applications like Bowtie2 are used as the first step in pipelines such as those for variant determination, and an area of continuously growing research interest, RNA-Seq.

What is RNA-Seq?

RNA Sequencing (RNA-Seq) is a type of NGS that seeks to identify the presence and quantity of RNA in a sample at a given point in time.This can be used to quantify changes in gene expression, which can be a result of time, external stimuli, healthy or diseased states, and other factors.Through this quantification, researchers can obtain a unique snapshot of the genomic status of the organism to identify genomic information previously undetectable with other technologies.

There is considerable research effort being put into RNA-Seq, and the number of publications has grown steadily since its first use in 2009.

Plot of the number of RNA-Seq research publications accepted each year
Figure 1. RNA-Seq research publications published per year as of April 2019.Note the continuous growth.At the current rate, there will be 60% more publications in 2019 as compared to 2018. Source: NCBI PubMed

RNA-Seq is being applied to many research areas and diseases, and a few notable examples of using the technology include:

  • Oral Cancer: Researchers used an RNA-Seq approach to identify differences in gene expression between oral cancer and normal tissue samples.
  • Alzheimer’s Disease: Researchers compared the gene expression of different lobes of deceased Alzheimer’s Disease patients brain with the brain of healthy individuals.They were able to identify genomic differences between the diseased and unaffected individuals.
  • Diabetes: Researchers identified novel gene expression information from pancreatic beta-cells, which are cells critical for glycemic control.

Compute Infrastructure for aligning with Bowtie2

Designing a compute resource to meet the sequence analysis needs of Bioinformatics researchers can be a daunting task for IT staff.Limited information is available about multithreading and performance increases in the diverse portfolio of software related to NGS analysis.To further complicate things, processors are now available in a variety of models, with a large range of core counts and clock speeds, from both AMD and Intel. See, for example, the latest Intel Xeon “Cascade Lake” CPUs: Intel Xeon Scalable “Cascade Lake SP” Processor Review

Though many sequence analysis tools have multithreading options, the ability to scale is often limited, and rarely linear.In some cases, performance can decrease as more threads are added.Multithreading applications does not guarantee a performance improvement.

ThreadsRun Time (seconds)
8620
16340
32260
48385
64530

Table 1. Research data showing previous version of Bowtie2 scaling with thread count.Performance would decrease above 32 threads.

Plot of Bowtie2 run time as the number of threads increases
Figure 2. Plot of thread scaling of previous version of Bowtie2.Performance decreases after 32 threads due to a variety of factors.Non-linear scaling and performance decreases with core count have been shown in other scientific applications as well.

However, researchers recently greatly improved the thread scaling of Bowtie2.Original versions of this tool did not scale linearly, and demonstrated reduced performance when using more than 32 threads.Aware of these problems, the developers of Bowtie2 have implemented superior multithread scaling in their applications.Depending on processor type, their results show:

  • Removal of performance decreases over 32 threads
  • An increase in read throughput of up to 44%
  • Reduced memory usage with thread scaling
  • Up to a 4 hour reduction in time to align 40x coverage human genome

This new version of the software is open-source and available for download.

Right Sizing your NGS Cluster

With the recent release of Intel’s Cascade Lake-AP Xeons providing up to 112 threads per socket, as well as high density AMD EPYC processors, it can be tempting to assume that more cores will result in more performance for NGS applications.However, this is not always the case, and some applications will show reduced performance with higher thread count.

When selecting compute systems for NGS analysis, researchers and IT staff need to evaluate which software products will be used, and how they scale with threads.Depending on the use cases, more nodes with fewer, faster, threads could provide better performance than high thread density nodes.Unfortunately there is no “one size fits all” solution, and applications are in constant development, so research into the most recent versions of analysis software is always required.

References

[1] https://www.ncbi.nlm.nih.gov/pubmed/
[2] https://doi.org/10.1371/journal.pone.0016266
[3] https://doi.org/10.1101/205328
[4] https://link.springer.com/article/10.1007/s10586-017-1015-0


If you are interested in testing your NGS workloads on the latest Intel and AMD HPC systems, please consider our free HPC Test Drive. We provide bare-metal benchmarking access to HPC and Deep Learning systems.

The post Improvements in scaling of Bowtie2 alignment software and implications for RNA-Seq pipelines appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/improvements-in-scaling-of-bowtie2-alignment-software-and-implications-for-rna-seq-pipelines/feed/ 0
CryoEM takes center stage: how compute, storage, and networking needs are growing with CryoEM research https://www.microway.com/hpc-tech-tips/cryoem-takes-center-stage-how-compute-storage-networking-needs-growing/ https://www.microway.com/hpc-tech-tips/cryoem-takes-center-stage-how-compute-storage-networking-needs-growing/#respond Thu, 11 Apr 2019 14:05:48 +0000 https://www.microway.com/?p=11409  This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries Background and history Cryogenic Electron Microscopy (CryoEM) is a type of electron microscopy that images molecular samples embedded in a thin layer of non-crystalline ice, also called vitreous ice.Though CryoEM experiments have been performed […]

The post CryoEM takes center stage: how compute, storage, and networking needs are growing with CryoEM research appeared first on Microway.

]]>
 
This is a guest post by Adam Marko, an IT and Biotech Professional with 10+ years of experience across diverse industries

Background and history

Cryogenic Electron Microscopy (CryoEM) is a type of electron microscopy that images molecular samples embedded in a thin layer of non-crystalline ice, also called vitreous ice.Though CryoEM experiments have been performed since the 1980s, the majority of molecular structures have been determined with two other techniques, X-ray crystallography and Nuclear Magnetic Resonance (NMR).The primary advantage of X-ray crystallography and NMR is that molecules were able to be determined at very high resolution, several fold better than historical CryoEM results.

However, recent advancements in CryoEM microscope detector technology and analysis software have greatly improved the capability of this technique.Before 2012, CryoEM structures could not achieve the resolution of X-ray Crystallography and NMR structures.The imaging and analysis improvements since that time now allow researchers to image structures of large molecules and complexes at high resolution.The primary advantages of Cryo-EM over X-ray Crystallography and NMR are:

  • Much larger structures can be determined than by X-ray or NMR
  • Structures can be determined in a more native state than by using X-ray

The ability to generate these high resolution large molecular structures through CryoEM enables better understanding of life science processes and improved opportunities for drug design. CryoEM has been considered so impactful, that the inventors won the 2017 Nobel Prize in chemistry.

CryoEM structure and publication growth

While the number of molecular structures determined by CryoEM is much lower than those determined by X-ray crystallography and NMR, the rate at which these structures are released has greatly increased in the past decade. In 2016, the number CryoEM structures deposited in the Protein Data Bank (PDB) exceeded those of NMR for the first time.

The importance of CryoEM in research is growing, as shown by the steady increase in publications over the past decade (Table 1, Figure 1).Interestingly, though there are ~10 times as many X-ray crystallography publications per year, the number of related publications per year has decreased consistently since 2013.

Experimental Structure TypeApproximate total number of structures as of March 2019
X-ray Crystallography134,000
NMR12,500
Cryo-EM3,000

Table 1. Total number of CryoEM structures available in the publicly accessible Protein Data Bank (PDB). Source: RCSB

Chart showing the growth of CryoEM structures available in the Protein Data Bank
Figure 1. Total number of structures available in the Protein Data Bank (PDB) by year.
Note the rapid and consistent growth. Source: RCSB
Plot of the number of CryoEM publications accepted each year
Figure 2. Number of CryoEm publications accepted each year.
Note the rapid increase in publications. Source: NCBI PubMed
Plot showing the declining number of X-ray crystallography publications per year
Figure 3. Number of X-ray crystallography publications per year. Note the steady decline in publications. While publications related to X-ray crystallography may be decreasing, opportunities exist for integrating both CryoEM and X-ray crystallography data to further our understanding of molecular structure. Source: NCBI PubMed

CryoEM is part of our research – do I need to add GPUs to my infrastructure?

A major challenge facing researchers and IT staff is how to appropriately build out infrastructure for CryoEM demands.There are several software products that are used for CryoEM analysis, with RELION being one of the most widely used open source packages.While GPUs can greatly accelerate RELION workflows, support for them has only existed since Version 2 (released in 2016).Worldwide, the vast majority of individual servers and centralized resources available to researchers are not GPU accelerated.Those systems that do have professional grade GPUs are often oversubscribed and can have considerable queue wait times.The relative high cost of server grade GPU systems can put those devices out of the reach of many individual research labs.

While advanced GPU hardware like the DGX-1 continue to give the best analysis times, not every GPU system provides the same throughput. Large datasets can create issues with consumer grade GPUs, in that the dataset must fit within the GPU memory to fully take advantage of the acceleration.Though RELION can parallelize the datasets, GPU memory is still limited when compared to the large amounts of system memory available to CPUs that can be installed in a single device (DGX-1 provides 256GB GPU memory; DGX-2 provides 512GB).This problem is amplified if the researcher has access to only a single consumer grade graphic card (e.g., an NVIDIA GeForce GTX 1080 Ti GPU with 11GB memory).

With the Version 3 release of the software (late 2018), RELION authors have implemented CPU acceleration to broaden the usable hardware for efficient CryoEM reconstruction.The authors have shown a 1.5x improvement on Broadwell processors and a 2.5x improvement on Skylake over the previous code.However, taking advantage of AVX instructions during compilation can further improve performance, with the authors demonstrating a 5.4x improvement on Skylake processors.This improvement is approaching the performance increases of professional grade GPUs without the additional cost.

Additional infrastructure considerations

CryoEM datasets are being generated at a higher rate and with larger data sizes than ever before.Currently, the largest raw dataset in the Electron Microscopy Public Image Archive (EMPIAR) is 12.4TB, with a median dataset size of approximately 2TB.Researchers and IT staff can expect datasets in this order of magnitude to become the norm as CryoEM continues to grow as an experimental resource in the life sciences space.

Many CryoEM labs function as microscopy cores, where they provide the service of generating the 2D datasets for different researchers, which are then analyzed by individual labs.Given the high cost of professional GPUs as compared to the ubiquitous availability of multicore CPU systems, researchers may consider modern multicore servers or using centralized clusters to meet their CryoEM analysis needs.This is with the caveat that they use Version 3 of RELION software with appropriate compilation flags.

Dataset transfer is also a concern, and organizations that have a centralized Cryo-EM core would greatly benefit from upgraded networking (10Gbps+) from the core location to centralized compute resources, or to individual labs.

Visualization of the structure of beta-galactosidase from the EMPIAR database
Figure 4. A 2.2 angstrom resolution CryoEM structure of beta-galactosidase. This is currently the largest dataset in the EMPIAR database, totaling 12.4 TB. Source: EMPIAR

CryoEM takes center stage

The increase in capabilities, interest, and research related to CryoEM shows it is now a mainstream experimental technique.IT staff and scientists alike are rapidly becoming aware of this fact as they face the data analysis, transfer, and storage challenges associated with this technique.Careful consideration must be given to the infrastructure of an organization that is engaging in CryoEM research.

In an organization that is performing exclusively CryoEM experiments, a GPU cluster would be the most cost-effective solution for rapid analysis.Researchers with access to advanced professional grade GPU systems, such as a DGX-1, will see analysis times that are even faster than modern CPU optimized RELION.While these professional GPUs can greatly accelerate CryoEM analysis, it is unlikely in the short term that all researchers wanting to use CryoEM data will have access to such high-spec GPU hardware, as compared to mixed-use commodity clusters, which are ubiquitous at all life science organizations.A large multicore CPU machine, when properly configured, can give better performance than a low core workstation or server with a single consumer grade GPU (e.g., an NVIDIA GeForce GPU).

IT departments and researchers must work together to define expected turnaround time, analysis workflow requirements, budget, and configuration of existing hardware.In doing so, researcher needs will be met and IT can implement the most effective architecture for CryoEM.

References

[1] https://doi.org/10.7554/eLife.42166.001
[2] https://febs.onlinelibrary.wiley.com/doi/10.1111/febs.12796
[3] https://www.ncbi.nlm.nih.gov/pubmed/
[4] https://www.ebi.ac.uk/pdbe/emdb/empiar/
[5] https://www.rcsb.org/


If you are interested in trying out RELION performance on some of the latest CPU and GPU-accelerated systems (including NVIDIA DGX-1), please consider our free HPC Test Drive. We provide bare-metal benchmarking access to HPC and Deep Learning systems.

The post CryoEM takes center stage: how compute, storage, and networking needs are growing with CryoEM research appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/cryoem-takes-center-stage-how-compute-storage-networking-needs-growing/feed/ 0
NVIDIA Datacenter Manager (DCGM) for More Effective GPU Management https://www.microway.com/hpc-tech-tips/nvidia-dcgm-for-gpu-management/ https://www.microway.com/hpc-tech-tips/nvidia-dcgm-for-gpu-management/#respond Mon, 02 Apr 2018 03:58:47 +0000 https://www.microway.com/?p=10643 Managing an HPC server can be a tricky job, and managing multiple servers even more complex. Adding GPUs adds even more power yet new levels of granularity. Luckily, there’s a powerful, and effective tool available for managing multiple servers or a cluster of GPUs: NVIDIA Datacenter GPU Manager. Executing hardware or health checks DCGM’s power […]

The post NVIDIA Datacenter Manager (DCGM) for More Effective GPU Management appeared first on Microway.

]]>
Managing an HPC server can be a tricky job, and managing multiple servers even more complex. Adding GPUs adds even more power yet new levels of granularity. Luckily, there’s a powerful, and effective tool available for managing multiple servers or a cluster of GPUs: NVIDIA Datacenter GPU Manager.

Executing hardware or health checks

DCGM’s power comes from its ability to access all kinds of low level data from the GPUs in your system. Much of this data is reported by NVML (NVIDIA Management Library), and it may be accessible via IPMI on your system. But DCGM helps make it far easier to access and use the following:

Report what GPUs are installed, in which slots and PCI-E trees and make a group

Build a group of GPUs once you know which slots your GPUs are installed in and on which PCI-E trees and NUMA nodes they are on. This is great for binding jobs, linking available capabilities.

Determine GPU link states, bandwidths

Provide a report of the PCI-Express link speed each GPU is running at. You may also perform D2D and H2D bandwidth tests inside your system (to take action on the reports)

Read temps, boost states, power consumption, or utilization

Deliver data on the energy usage and utilization of your GPUs. This data can be used to control the cluster

Driver versions and CUDA versions

Report on the versions of CUDA, NVML, and the NVIDIA GPU driver installed on your system

Run sample jobs and integrated validation

Run basic diagnostics and sample jobs that are built into the DCGM package.

Set policies

DCGM provide a mechanism to set policies to a group of GPUs.

Policy driven management: elevating from “what’s happening” to “what can I do”

Simply accessing data about your GPUs is only of modest use. The power of DCGM is in how it arms you to take act upon that data.DCGM allows administrators to take programmatic or preventative action when “it’s not right”

Here’s are a few scenarios where data provided by DCGM allows for both powerful control of your hardware and action:

Scenario 1: Healthchecks – periodic or before the job

Run a check before each job, after a job, or daily/hourly to ensure a cluster is performing optimally.

This allows you to preemptively stop a run if diagnostics fail or move GPUs/nodes out of the scheduling queue for the next job.

Scenario 2: Resource Allocation

Jobs often need a certain class of node (ex: with >4 GPUs or with IB & GPUs on the same PCI-E tree). DCGM can be used to report on the capabilities of a node and help identify appropriate resources.

Users/schedulers can subsequently send jobs only where they are capable of being executed

Scenario 3: “Personalities”

Some codes request specific CUDA or NVIDIA driver versions. DCGM can be used to probe the CUDA version/NVIDIA GPU driver version on a compute node.

Users can then script the deployment of alternate versions or the launch containerized apps to support non-standard versions.

Scenario 4: Stress tests

Periodically stress test GPUs in a cluster with integrated functions

Stress tests like Microway GPU Checker can tease out failing GPUs, and reading data via DCGM during or after can identify bad nodes to be sidelined.

Scenario 5: Power Management

Programmatically set GPU Boost or max TDP levels for an application or run. This allow you to eke out extra performance.

Alternatively, set your GPUs to stay within a certain power band to reduce electricity costs when rates are high or lower total cluster consumption when there is insufficient generation capacity

Scenario 6: Logging for Validation

Script the pull of error logs and take action with that data.

You can accumulate error logging over time, and determine tendencies of your cluster. For example, a group of GPUs with consistently high temperatures may indicate a hotspot in your datacenter

Getting Started with DCGM: Starting a Health Check

DCGM can be used in many ways. We won’t explore them all here, but it’s important to understand the ease of use of these capabilities.

Here’s the code for a simple health check and also for a basic diagnostic:

dcgmi health --check -g 1
dcgmi diag –g 1 -r 1

The syntax is very standard and includes dcgmi, the command, and the group of GPUs (you must set a group first). In the diagnostic, you include the level of diagnostics requested (-r 1, or lowest level here).

DCGM and Cluster Management

While the Microway team loves advanced scripting, you may prefer integrating DCGM or its capabilities in with your existing schedulers or cluster managers. The following are supported or already leverage DCGM today:

What’s Next

What will you do with DCGM or DCGM-enabled tools? We’ve only scratched the surface. There are extensive resources on how to use DCGM and/or how it is integrated with other tools. We recommend this blog post.

The post NVIDIA Datacenter Manager (DCGM) for More Effective GPU Management appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-dcgm-for-gpu-management/feed/ 0
One-shot Learning Methods Applied to Drug Discovery with DeepChem https://www.microway.com/hpc-tech-tips/one-shot-learning-methods-applied-drug-discovery-deepchem/ https://www.microway.com/hpc-tech-tips/one-shot-learning-methods-applied-drug-discovery-deepchem/#respond Wed, 26 Jul 2017 14:01:15 +0000 https://www.microway.com/?p=8929 Experimental data sets for drug discovery are sometimes limited in size, due to the difficulty of gathering this type of data. Drug discovery data sets are expensive to obtain, and some are the result of clinical trials, which might not be repeatable for ethical reasons. The ClinTox data set, for example, is comprised of data […]

The post One-shot Learning Methods Applied to Drug Discovery with DeepChem appeared first on Microway.

]]>
Experimental data sets for drug discovery are sometimes limited in size, due to the difficulty of gathering this type of data. Drug discovery data sets are expensive to obtain, and some are the result of clinical trials, which might not be repeatable for ethical reasons. The ClinTox data set, for example, is comprised of data from FDA clinical trials of drug candidates, where some data sets are derived from failures, due to toxic side effects [2].For cases where training data is scarce, application of one-shot learning methods have demonstrated significantly improved performance over methods consisting only of graphical convolution networks.The performance of one-shot network architectures will be discussed here for several drug discovery data sets, which are described in Table 1.

These data sets, along with one-shot learning methods, have been integrated into the DeepChem deep learning framework, as a result of research published by Altae-Tran, et al. [1].While data remains scarce for some problem domains, such as drug discovery, one-shot learning methods could pose an important alternative network architecture, which can possibly far outperform methods which use only graphical convolution.

DatasetCategoryDescriptionNetwork TypeNumber of TasksCompounds
Tox21PhysiologytoxicityClassification128,014
SIDERPhysiologyside reactionsClassification271,427
MUVBiophysicsbioactivityClassification1793,127

Table 1. DeepChem drug discovery data sets investigated with one-shot learning.

One-Shot Network Architectures Produce Most Accurate Results When Applied to Small Population Training Sets

The original motivation for investigating one-shot neural networks arose from the fact that humans can learn sufficient representations, given small amounts of data, and then later apply a learned representation to correctly distinguish between objects which have been observed only once.The one-shot network architecture has previously been developed, and applied to image data, with this motivational context in mind [3, 5].

The question arose, as to whether an artificial neural network, given a small data set, could similarly learn a sufficient number of features through training, and perform at a satisfactory level.After some period of development, one-shot learning methods have emerged to demonstrate good success [3,4,5,6].

The description provided here of the one-shot approach focuses mostly on methodology, and less on the theoretical and experimental results which support the method.The simplest one-shot method computes a distance weighted combination of support set labels.The distance metric can be defined using a structure called a Siamese network, where two identical networks are used.The first twin produces a vector output for the molecule being queried, while the other twin produces a vector representing an element of the support set.Any difference between the outputs can be interpreted as a dissimilarity measure between the query structure and any particular structure in the support set.A separate dissimilarity measure can be computed for each element in the support set, and then a normalized, weighted representation of the query structure can be determined.For example, if the query structure is significantly less dissimilar to two structures, out of, say, twenty, in the support set, then the weighted representation will be nearly the average of the vectors which represent the two support structures which most resemble the queried structure.

There are two other one-shot methods which take more complex considerations into account.In the Siamese network one-shot approach, the vector embeddings of both the query structure and each individual support structure is computed independently of the support set.However, it has been shown empirically, that by taking into account the context of all support set elements, when computing the vector embeddings of the query, and each individual support structure, better one-shot network performance can be realized.This approach is called full context embedding, since the full context of the support set is taken into account when computing every vector embedding.In the full context embedding approach, the embeddings for the every support structure are allowed to influence the embedding of the query structure.

The full context embedding approach uses Siamese, i.e. matching, networks like before, but once the embeddings are computed, they are then further processed by Long Short-Term Memory (LSTM) structures.The embeddings, before processing by LSTM structures, will be referred to here as pre-contextualized vectors. The full contextual embeddings for the support structures are produced using an LSTM structure called a bidirectional LSTM (biLSTM), while the full contextual embedding for the query structure is produced by an LSTM structure called an attentional LSTM (attLSTM).An LSTM is a type of recurring neural network, which can process sequences of input.With the biLSTM, the support set is viewed as a sequence of vectors. A bidirectional LSTM is used, instead of just an LSTM, in order to reduce dependence on the sequence order.This improves model performance because the support set has no natural order.However, not all dependence on sequence order is removed with the biLSTM.

The attLSTM constructs an order-independent full contextual embedded vector of the query structure.The full details of the attLSTM will not be discussed here, beyond saying that both the biLSTM and attLSTM are network elements which interpret some set of structures as a sequence of pre-contextualized vectors, and converts a sequence into a single full context embedded vector.One full context embedded vector is produced for the support set of structures, and one is produced for the query structure.

A further improvement has been made to the one-shot model described here.As mentioned, the biLSTM does not produce an entirely order-independent full context embedding for each pre-contextualized vector, corresponding to a support structure.As mentioned, the support set does not contain any natural order to it, so any sequence order dependence present in the full context embedded support vector is an unwanted artifact, and will lead to reduced model performance.There is another problem, which is that, in the way they have been defined, the full context embedded vectors of the support structures depend only on the pre-contextualized support vectors, and not on the pre-contextualized query vector.On the other hand, the full context embedded vector of the query structure depends on both its own pre-contextualized vector, and the pre-contextualized vectors of the support set.This asymmetry indicates that some additional information is not accounted for in the model, and that performance could be improved if this asymmetry could be removed, and if the order dependence of the full context embedded support vectors could also be removed.

To address this problem, a new LSTM model was developed by Altae-Tran, et al., called the Iteratively Refined LSTM (IterRefLSTM).The full details of how the IterRefLSTM model operates is beyond the scope of this discussion.A full explanation can be found in Altae-Tran, et al.Put briefly, the full contextual embedded vectors of the support and query structures are co-evolved, in an iterative process, which uses an attLSTM element, and results in removal of order-dependence in the full contextual embedding for the support, as well removal of the asymmetry in dependency between the full context embedded vectors of the support and query structures.

A brief summary of the one-shot network architectures discussed is presented in Table 2.

ArchitectureDescription
Siamese Networksscore comparison, dissimilarity measure
Attention LSTM (attLSTM)better extraction of prior data, contains order-dependence of input data
Iterative Refinement LSTMs (IterRefLSTM)similar to attLSTM, but removes all order dependence of data by iteratively evolving the query and support embeddings simultaneously in an iterative loop

Table 2. One-shot networks used for investigating low-population biological assay data sets.

Computed Results of One-Shot Performance Metric is Compared to Published Values

A comparison of independently computed values is made here with published values from Altae-Tran, et al. [1].Quantitative results for classification tasks associated with the Tox21, SIDER, and MUV datasets were obtained by evaluating the the area under the receiver operating characteristic curve (read more on AUROC).For datasets having more than one task, the median of the performance metric over all tasks in the held-out data sets is reported.A k-fold cross-validation was then done, with k=4.The mean of performances across all cross-validations was then taken, and reported as the performance measure for the data set.A discussion of the standard deviation is given further below.

Since the tasks for Tox21, SIDER, and MUV are all classification tasks for binary assay data, with positive and negative results from a clinical trial, for example, the performance values, as mentioned, are reported with the AUROC metric.With AUROC, a value of 0.5 indicates no predictive power, while a result of 1.0 indicates that every outcome in the held out data set has been predicted correctly [Kennis Research, 9]. A value less than 0.5 can be interpreted as a value of 1.0 minus the metric value.This operation corresponds to inverting the model, where True is now False, and vice versa.This way, a metric value between 0.5 and 1.0 can always be realized. Each data set performance is reported with a standard deviation, containing dependence on the dispersion of metric values across classifications tasks, and then k cross-validations.

Our computed values match well with those published by Altae-Tran, et al. [1], and essentially confirm their published performance metric values, from their ACS Central Science publication.The first and second columns in Table 3 show classification performance for GC tasks, and RF, respectively, as computed by Altae-Tran, et al.Single task GC and RF results are presented as a baseline of comparison to one-shot methods.

The use of k-fold cross validation improves the estimated predicted performance of the model, as it would perform if trained on all of the data, and not just a training subset, with a portion reserved testing.Since we cannot directly measure the performance of a model trained on the full data set (since no testing data would remain), the k-fold cross validation is used to provide a best guess of a performance estimate we cannot see (until final deployment), where a deployed network would be trained on all of the data.

 Tox21SIDERMUV
Random Forests‡,⁑0.539 ± 0.0490.557 ± 0.0590.751 ± 0.062Ω
Graphical Convolution‡,⁑0.625 ± 0.0360.482 ± 0.0380.583 ± 0.061
Siamese Networks0.783 ± 0.0090.660 ± 0.0880.500 ± 0.043
AttLSTM0.759 ± 0.0070.607 ± 0.0800.500 ± 0.058
IterRefLSTM0.807 ± 0.003Ω0.751 ± 0.002Ω0.533 ± 0.051

Table 3. AUROC performance metric values for each one-shot method, plus the random forests (RF), and graphical convolution (GC) methods.Metric values were measured across Tox21, SIDER, and MUV test data sets, using a trained modelΦ.Randomness arises from using a trained model to evaluate the AUROC metric on a test set.First a support setΨ, S, of 20 data points is chosen from the set of data points for a test task.The metric is then evaluated over the remaining points in a test task data set.This process is repeated 20 times for every test task in the data set. The mean and standard deviation for all AUROC measures generated in this way are computed.

Finally, for each data set (Tox21, SIDER, and MUV), the reported performance result is actually the median performance value across all test tasks for a data set.This indirectly implies that the individual metric performances on individual tasks is unimportant, and that they more or less tend to all do well or poorly together, without too much variance across tasks.However, a median measure can mask outliers, where performance on one task might be very bad.If outliers can be removed for rational reasons, then using the median across task performance can be an effective way of removing the influence of outliers.


The performance measures for RF and GC were computed with one-fold cross validation (i.e. no cross-validation).This is because the RF and GC scripts available with our current version of DeepChem (July, 2017), are written for performing only one-fold validation with these models.

The variances of k-fold cross validation performance estimates were determined from pooling all performance values, and then finding the median variance of the entire pool.More complex techniques exist for estimating the variance from a cross-validated set, and the reader is invited to investigate other methods [Nadeau, et al.].

Ω This performance measure by IterRefLSTM on the Tox21 data set is the only performance which rates rates as good.IterRefLSTM performance on the SIDER dataset performs fairly, while RF on MUV, rates as only fair.

Φ Since network inference (predicting outcomes) can be done much faster than network training, due to the computationally expensive backprogragation algorithm, only a batch, B, of data points, and not the entire training data, excluding support data, are selected to train.A support set, S, of 20 data points, along with a batch of queries, B, of 128 data points, is selected for each training set task, in each of the the held-out training sets, for a given episode of training.

A number of training episodes equal to 2000 * ntrain is performed, with one step of minimization performed by the ADAM optimizer per episode[11]. ntrain is the number of test tasks in a test set.After the total number of training episodes has been computed, an intermediate information structure for the the attLSTM, and IterRefLSTM models, called the embedding vector set, described earlier, is produced.It should be noted that the same size of support set, S, is also used during model testing on the held out testing tasks.

Ψ Every support set, S, whether selected during training or testing, was chosen so that it contained 10 positive and 10 negative samples for the task in question.In the full study done in [1], however, variations on the number of positive and negatives samples are explored for the support set, S.Investigators found that by sampling more data points in S, rather than increasing the number of backpropagation iterations, better model performance resulted.


It should be noted that, for a support set of 10 positive, and 10 negative assay results, our computed results for the Siamese method on MUV do not show any predictive performance.The results published by Altaei-Tran, however, indicate marginally predictive, but poor predictability, with an AUROC metric value of 0.601 ± 0.041.

Our metric was computed several times on both a Tesla P100 16GB GPU, and a Tesla M40 GPU, but we found, with this particular support, the Siamese model has no predictive power, with a metric value of 0.500 ± 0.043 (see Table 3). Our other computed results for the AttLSTM and IterRefLSTM concur with published results, which show that neither one-shot learning method has predictive power on MUV data, with a support set containing 10 positive, and 10 negative assay results.

The Iterative Refinement LSTM shows a narrower dispersion of scores than other one-shot Learning models.This result agrees with published standard deviation values for LSTM in Altae-Tran, et al. [1].

Speedups factors are determined by comparing runtimes on the NVIDIA Tesla P100 GPU, to runtimes on the Tesla M40 GPU, and are presented in Tables 4 and 5.Speedup factors are found to be not as pronounced for one-shot methods, and an explanation of the speedup results is presented.The approach for training and testing one-shot methods is described, as they involve some extra considerations which do not apply to graphical convolution.

 Tesla P100 runtimesTesla M40 runtimes
 Tox21SIDERMUVTox21SIDERMUV
Random Forests253784243783
Graphical Convolution38796441100720
Siamese8572,1801,4649562,4071,617
AttLSTM9332,4051,5911,0412,5811,725
IterRefLSTM1,0062,5111,6801,1012,7211,834

Table 4. Runtimes for each one-shot model on the NVIDIA Tesla M40 and Tesla P100 16GB PCIe GPU.All runtimes are in seconds.

RF are run entirely on CPU, and reflect CPU runtimes. Their run times are shown with strikethrough, to indicate that their values are not be considered for determining GPU speedup factors.

A quick inspection of the results in Table 4 shows that the one-shot methods perform better on the Tox21 and SIDER data sets, but not on the MUV data.A reason for poor performance of one-shot methods in MUV data is proposed below.

Limitations of One-Shot Networks

Compared to previous methods, one-shot networks demonstrate extraction of more information from the prior (support) data than RF or GC, but with a limitation.One-shot methods are only successful when data in the held out testing set is sufficiently similar to data seen during training.Networks trained using one-shot methods do not perform well when trying to classify data that is too dissimilar from the sample data used for training.In the context of drug discovery, this problem is encountered when trying to apply one-shot learning to the Maximum Unbiased Validation (MUV) dataset, for example [10].Benchmark results show that all three one-shot learning methods explored here do little better than pure chance when making classification predictions with MUV data (see Table 3).

The MUV dataset contains around 93,000 compounds, and represents a diverse collection of molecular scaffolds, compared to Tox21 and SIDER.One-shot methods do not perform as well on this data set, probably because there is less structural similarity between the elements of the MUV dataset, compared to Tox21 and SIDER.One-shot networks require some amount structural similarity, within the data set, in order to extrapolate from limited data, and correctly classify new, but similar, compounds.

A metric of self-similarity within a data set could be computed as a data set size-independent, extensive measure, where every element is compared to every other measurement, and some attention measure is evaluated, such a cosine distance.The attention measure can be summed through all unique comparisons, and then be normalized, by diving by the number of unique comparisons between N elements in the set to all other elements in the set.

 Tox21SIDERMUV
GC1.0791.26611.25α
Siamese1.116λ1.1041.105
AttLSTM1.116λ1.1161.084
IterRefLSTM1.094λ1.0841.092

Table 5. Speedup Factors, comparing the Tesla P100 16GB GPU to the Tesla M40. All speedups are Tesla P100 runtimes divided by Tesla M40 runtimes.


α The greatest speedup is observed with GC on the MUV data set (Table 5).

GC also exhibits the most precipitous drop in performance, as it transitions to one-shot models. Table 4 indicates that the GC model performs better across all data sets, compared to one-shot methods.This is not surprising, because the purely graphical model is more susceptible to GPU acceleration.However, it is crucial to note that GC models perform worse than one-shot models on Tox21, SIDER, but not MUV.On MUV, GC has nearly no predictive ability (Table 3), compared the one-shot models which have absolutely no predictability with MUV.

λ The one-shot newtorks, while providing substantial improvement in performance, do not seem to show a significant speedup, observing the values for the data sets, in the rows for Siamese, attLTSM, or IterRefLSTM.The nearly absent-speedup could arise from high GPU-system memory transfers.Note, however, that although small, there is a slight but consistent improvement is speedup for the one-shot networks for the Tox21 set.The Tox21 data set may therefore require fewer transfers to system memory.A general observation of the element flatline in speedup for one-shot methods may be from the LSTM elements.


Generally, deep convolutional network models, such as GC, or models which benefit from having a large data set containing structurally diverse groups, such as RF and GC, perform better on the MUV data.RF, for example, shows the best performance, even if very poor.Deep networks have demonstrated that, provided enough layers, they have the information-holding capacity required in order to learn and retain representations for the MUV data.Their information-holding capacity is what enables them to classify between the large number of structurally diverse classes in MUV.It may be the case that the hyper parameters for the graphical convolutional network at not set such that the GC model would yield a poor to fair level of performance on MUV.In their paper, Altae-Tran stated that hyperparameters for the convolutional networks were not optimized, and that there may be an opportunity to improve performance there [1].

Remarks on Neural Network Information Structure, and How One-Shot Networks are Different

All neural networks require training data in order to develop structure under training pressure.Feature complexity, in image classification networks, becomes stratified, under training pressure, through the network’s layers.The lowest layers emerge as edge detectors, with successive layers building upon features from previous layers.The second layer, for example, can build corner detectors, or curved edge detectors, by detecting combinations of simpler edges.Through a buildup of feature complexity, eventually, higher layers can emerge which can detect complex, high-level features such as faces.The combinatorial size of the detectable feature space grows with the number of viable filters (kernels) connecting each layer to the preceding layer.With Natural Language Processing (NLP) networks, layer complexity progresses from sentence features, to paragraphs, then chapters, and finally whole book vector representations, which consist of succinct thematic summaries of written works.

To reiterate, all networks require information structure, acquired under training pressure, to develop some inner representation, or “belief” about data.Deep networks allow for more diverse structures to be learned, compared to one-shot networks, which are limited in their ability to learn diverse representations.One-shot structural features are designed to improve extraction of information for support data, in order to learning a representation which can be used to extrapolate from a smaller group of similar classes.One-shot methods do not perform as well with MUV, compared to RF, for the reason that they are not designed to produce a useful network from a data set having the level of dissimilarity between molecular scaffolds between elements, such as with MUV.

Transfer Learning with One-Shot Learning Network Architecture

A network trained on the Tox21 data set was evaluated on the SIDER data set.The results, given by the performance metric values, shown in Table 6, indicate that the network trained on Tox21 has nearly no predictive power on the SIDER data.This indicates that the performance does not generalize well to new chemical scaffolds, which supports the explanation for why one-shot methods do poorly at predicting the results for the MUV dataset.

 SiameseattnLSTMIterRefLSTM
To SIDER from Tox210.5050.5020.504

Table 6. Transfer Learning to SIDER from Tox21. These results agree with the performance metric values reported for transfer learning in [1], and support the conclusion that transfer learning between data sets will result in no predictive capability, unless the data sets are significantly similar.

Conclusion

For binary classification tasks associated with small population data sources, one-shot learning methods may provide significantly better results compared to baseline performances of graphical convolution and random forests.The results show that the performance of one shot learning methods may depend on the diversity of molecular scaffolds in a data set.With MUV, for example, one shot methods did not extrapolate well to unseen molecular scaffolds.Also, the failure of transfer learning from the Tox21 network, to correctly predict SIDER assay outcomes, also indicates that data set training may not be easily generalized with one shot networks.

The Iterative Refinement LSTM method developed in [1] demonstrates that LSTMs can generalize to similar experimental assays which are not identical to assays in the data set, but which have some common relation.

References

1.) Altae-Tran, Han, Ramsundar, Bharath, Pappu, Aneesh S., and Pande, Vijay”Low Data Drug Discovery with One-Shot Learning.” ACS central science 3.4 (2017): 283-293.
2.) Wu, Zhenqin, et al. “MoleculeNet: A Benchmark for Molecular Machine Learning.” arXiv preprint arXiv:1703.00564 (2017).
3.) Hariharan, Bharath, and Ross Girshick. “Low-shot visual object recognition.” arXiv preprint arXiv:1606.02819 (2016).
4.) Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. “Siamese neural networks for one-shot image recognition.” ICML Deep Learning Workshop. Vol. 2. 2015.
5.) Vinyals, Oriol, et al.Matching networks for one shot learning.” Advances in Neural Information Processing Systems. 2016.
6.) Wang, Peilu, et al. “A unified tagging solution: Bidirectional LSTM recurrent neural network with word embedding.” arXiv preprint arXiv:1511.00215 (2015).
7.) Duvenaud, David K., et al.Convolutional networks on graphs for learning molecular fingerprints.” Advances in neural information processing systems. 2015.
8.) Lusci, Alessandro, Gianluca Pollastri, and Pierre Baldi. “Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules.” Journal of chemical information and modeling 53.7 (2013): 1563-1575.
9. Receiver Operating Curves Applet, Kennis Research, 2016.
10. Maximum Unbiased Validation Chemical Data Set
11. Kingma, D. and Ba, J. Adam: a Method for Stochastic Optimization, arXiv preprint: arxiv.org/pdf/1412.6980v8.pdf.
12. University of Nebraska Medical Center online information resource, AUROC
13. Inference for the Generalization of Error

The post One-shot Learning Methods Applied to Drug Discovery with DeepChem appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/one-shot-learning-methods-applied-drug-discovery-deepchem/feed/ 0
DeepChem – a Deep Learning Framework for Drug Discovery https://www.microway.com/hpc-tech-tips/deepchem-deep-learning-framework-for-drug-discovery/ https://www.microway.com/hpc-tech-tips/deepchem-deep-learning-framework-for-drug-discovery/#respond Fri, 28 Apr 2017 19:02:51 +0000 https://www.microway.com/?p=8687 A powerful new open source deep learning framework for drug discovery is now available for public download on github.This new framework, called DeepChem, is python-based, and offers a feature-rich set of functionality for applying deep learning to problems in drug discovery and cheminformatics.Previous deep learning frameworks, such as scikit-learn have been applied to chemiformatics, but […]

The post DeepChem – a Deep Learning Framework for Drug Discovery appeared first on Microway.

]]>
A powerful new open source deep learning framework for drug discovery is now available for public download on github.This new framework, called DeepChem, is python-based, and offers a feature-rich set of functionality for applying deep learning to problems in drug discovery and cheminformatics.Previous deep learning frameworks, such as scikit-learn have been applied to chemiformatics, but DeepChem is the first to accelerate computation with NVIDIA GPUs.

The framework uses Google TensorFlow, along with scikit-learn, for expressing neural networks for deep learning.It also makes use of the RDKit python framework, for performing more basic operations on molecular data, such as converting SMILES strings into molecular graphs.The framework is now in the alpha stage, at version 0.1.As the framework develops, it will move toward implementing more models in TensorFlow, which use GPUs for training and inference.This new open source framework is poised to become an accelerating factor for innovation in drug discovery across industry and academia.

Another unique aspect of DeepChem is that it has incorporated a large amount of publicly-available chemical assay datasets, which are described in Table 1.

DeepChem Assay Datasets

DatasetCategoryDescriptionClassification TypeCompounds
QM7Quantum Mechanicsorbital energies
atomization energies
Regression7,165
QM7bQuantum Mechanicsorbital energiesRegression7,211
ESOLPhysical ChemistrysolubilityRegression1,128
FreeSolvPhysical Chemistrysolvation energyRegression643
PCBABiophysicsbioactivityClassification439,863
MUVBiophysicsbioactivityClassification93,127
HIVBiophysicsbioactivityClassification41,913
PDBBindBiophysicsbinding activityRegression11,908
Tox21PhysiologytoxicityClassification8,014
ToxCastPhysiologytoxicityClassification8,615
SIDERPhysiologyside reactionsClassification1,427
ClinToxPhysiologyclinical toxicityClassification1,491

Table 1:The current v0.1 DeepChem Framework includes the data sets in this table, along others which will be added to future versions.

Metrics

The squared Pearson Correleation Coefficient is used to quantify the quality of performance of a model trained on any of these regression datasets.Models trained on classification datasets have their predictive quality measured by the area under curve (AUC) for receiver operator characteristic (ROC) curves (AUC-ROC).Some datasets have more than one task, in which case the mean over all tasks is reported by the framework.

Data Splitting

DeepChem uses a number of methods for randomizing or reordering datasets so that models can be trained on sets which are more thoroughly randomized, in both the training and validation sets, for example.These methods are summarized in Table 2.

DeepChem Dataset Splitting Methods

Split Typeuse cases
Index Splitdefault index is sufficient as long as it contains no built-in bias
Random Splitif there is some bias to the default index
Scaffold Splitif chemical properties of dataset will be depend on molecular scaffold
Stratified Random Splitwhere one needs to ensure that each dataset split contains a full range of some real-valued property

Table 2:Various methods are available for splitting the dataset in order to avoid sampling bias.

Featurizations

DeepChem offers a number of featurization methods, summarized in Table 3.SMILES strings are unique representations of molecules, and can themselves can be used as a molecular feature.The use of SMILES strings has been explored in recent work.SMILES featurization will likely become a part of future versions of DeepChem.

Most machine learning methods, however, require more feature information than can be extracted from a SMILES string alone.

DeepChem Featurizers

Featurizeruse cases
Extended-Connectivity Fingerprints (ECFP)for molecular datasets not containing large numbers of non-bonded interactions
Graph ConvolutionsLike ECFP, graph convolution produces granular representations of molecular topology. Instead of applying fixed hash functions, as with ECFP, graph convolution uses a set of parameters which can learned by training a neural network associated with a molecular graph structure.
Coloumb MatrixColoumb matrix featurization captures information about the nuclear charge state, and internuclear electric repulsion. This featurization is less granular than ECFP, or graph convolutions, and may perform better where intramolecular electrical potential may play an important role in chemical activity
Grid Featurizationdatasets containing molecules interacting through non-bonded forces, such as docked protein-ligand complexes

Table 3:Various methods are available for splitting the dataset in order to avoid sampling bias.

Supported Models

Supported Models as of v0.1

Model Typepossible use case
Logistic Regressioncontinuous, real-valued prediction required
Random ForestClassification or Regression
Multitask NetworkIf various prediction types required, a multitask network would be a good choice. An example would be a continuous real-valued prediction, along with one or more categorical predictions, as predicted outcomes.
Bypass NetworkClassification and Regression
Graph Convolution Modelsame as Multitask Networks

Table 4: Model types supported by DeepChem 0.1

A Glimpse into the Tox21 Dataset and Deep Learning

The Toxicology in the 21st Century (Tox21) research initiative led to the creation of a public dataset which includes measurements of activation of stress response and nuclear receptor response pathways by 8,014 distinct molecules.Twelve response pathways were observed in total, with each having some association with toxicity.Table 5 summarizes the pathways investigated in the study.

Tox21 Assay Descriptions

Biological Assaydescription
NR-ARNuclear Receptor Panel, Androgen Receptor
NR-AR-LBDNuclear Receptor Panel, Androgen Receptor, luciferase
NR-AhRNuclear Receptor Panel, aryl hydrocarbon receptor
NR-AromataseNuclear Receptor Panel, aromatase
NR-ERNuclear Receptor Panel, Estrogen Receptor alpha
NR-ER-LBDNuclear Receptor Panel, Estrogen Receptor alpha, luciferase
NR-PPAR-gammaNuclear Receptor Panel, peroxisome profilerator-activated receptor gamma
SR-AREStress Response Panel, nuclear factor (erythroid-derived 2)-like 2 antioxidant responsive element
SR-ATAD5Stress Response Panel, genotoxicity indicated by ATAD5
SR-HSEStress Response Panel, heat shock factor response element
SR-MMPStress Response Panel, mitochondrial membrane potential
SR-p53Stress Response Panel, DNA damage p53 pathway

Table 5:Biological pathway responses investigated in the Tox21 Machine Learning Challenge.

We used the Tox21 dataset to make predictions on molecular toxicity in DeepChem using the variations shown in Table 6.

Model Construction Parameter Variations Used

Dataset SplittingIndexScaffold
FeaturizationECFPMolecular Graph Convolution

Table 6:Model construction parameter variations used in generating our predictions, as shown in Figure 1.

A .csv file containing SMILES strings for 8,014 molecules was used to first featurize each molecule by using either ECFP or molecular graph convolution.IUPAC names for each molecule were queried from NIH Cactus, and toxicity predictions were made, using a trained model, on a set of nine molecules randomly selected from the total tox21 data set.Nine results showing molecular structure (rendered by RDKit), IUPAC names, and predicted toxicity scores, across all 12 biochemical response pathways, described in Table 5, are shown in Figure 1.

Tox21 wprediction ith DeepChem
Figure 1. Tox21 Predictions for nine randomly selected molecules from the tox21 dataset

Expect more from DeepChem in the Future

The DeepChem framework is undergoing rapid development, and is currently at the 0.1 release version.New models and features will be added, along with more data sets in future.You can download the DeepChem framework from github.There is also a website for framework documentation at deepchem.io.

Microway offers DeepChem pre-installed on our line of WhisperStation products for Deep Learning. Researchers interested in exploring deep learning applications with chemistry and drug discovery can browse our line of WhisperStation products.

References

1.) Subramanian, Govindan, et al. “Computational Modeling of β-secretase 1 (BACE-1) Inhibitors using Ligand Based Approaches.” Journal of Chemical Information and Modeling 56.10 (2016): 1936-1949.
2.) Altae-Tran, Han, et al. “Low Data Drug Discovery with One-shot Learning.” arXiv preprint arXiv:1611.03199 (2016).
3.) Wu, Zhenqin, et al. “MoleculeNet: A Benchmark for Molecular Machine Learning.” arXiv preprint arXiv:1703.00564 (2017).
4.) Gomes, Joseph, et al. “Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity.” arXiv preprint arXiv:1703.10603 (2017).
5.) Gómez-Bombarelli, Rafael, et al. “Automatic chemical design using a data-driven continuous representation of molecules.” arXiv preprint arXiv:1610.02415 (2016).
6.) Mayr, Andreas, et al. “DeepTox: toxicity prediction using deep learning.” Frontiers in Environmental Science 3 (2016): 80.

The post DeepChem – a Deep Learning Framework for Drug Discovery appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/deepchem-deep-learning-framework-for-drug-discovery/feed/ 0
GPU-accelerated HPC Containers with Singularity https://www.microway.com/hpc-tech-tips/gpu-accelerated-hpc-containers-singularity/ https://www.microway.com/hpc-tech-tips/gpu-accelerated-hpc-containers-singularity/#respond Tue, 11 Apr 2017 16:44:44 +0000 https://www.microway.com/?p=8673 Fighting with application installations is frustrating and time consuming. It’s not what domain experts should be spending their time on. And yet, every time users move their project to a new system, they have to begin again with a re-assembly of their complex workflow. This is a problem that containers can help to solve. HPC […]

The post GPU-accelerated HPC Containers with Singularity appeared first on Microway.

]]>
Fighting with application installations is frustrating and time consuming. It’s not what domain experts should be spending their time on. And yet, every time users move their project to a new system, they have to begin again with a re-assembly of their complex workflow.

This is a problem that containers can help to solve. HPC groups have had some success with more traditional containers (e.g., Docker), but there are security concerns that have made them difficult to use on HPC systems. Singularity, the new tool from the creator of CentOS and Warewulf, aims to resolve these issues.

Singularity helps you to step away from the complex dependencies of your software apps. It enables you to assemble these complex toolchains into a single unified tool that you can use just as simply as you’d use any built-in Linux command. A tool that can be moved from system to system without effort.

Surprising Simplicity

Of course, HPC tools are traditionally quite complex, so users seem to expect Singularity containers to also be complex. Just as virtualization is hard for novices to wrap their heads around, the operation of Singularity containers can be disorienting. For that reason, I encourage you to think of your Singularity containers as a single file; a single tool. It’s an executable that you can use just like any other program. It just happens to have all its dependencies built in.

This means it’s not doing anything tricky with your data files. It’s not doing anything tricky with the network. It’s just a program that you’ll be running like any other. Just like any other program, it can read data from any of your files; it can write data to any local directory you specify. It can download data from the network; it can accept connections from the network. InfiniBand, Omni-Path and/or MPI are fully supported. Once you’ve created it, you really don’t think of it as a container anymore.

GPU-accelerated HPC Containers

When it comes to utilizing the GPUs, Singularity will see the same GPU devices as the host system. It will respect any device selections or restrictions put in place by the workload manager (e.g., SLURM). You can package your applications into GPU-accelerated HPC containers and leverage the flexibilities provided by Singularity. For example, run Ubuntu containers on an HPC cluster that uses CentOS Linux; run binaries built for CentOS on your Ubuntu system.

As part of this effort, we have contributed a Singularity image for TensorFlow back to the Singularity community. This image is available pre-built for all users on our GPU Test Drive cluster. It’s a fantastically easy way to compare the performance of CPU-only and GPU-accelerated versions of TensorFlow. All one needs to do is switch between executables:

Executing the pre-built TensorFlow for CPUs

[eliot@node2 ~]$ tensorflow_cpu ./hello_world.py
Hello, TensorFlow!
42

Executing the pre-built TensorFlow with GPU acceleration

[eliot@node2 ~]$ tensorflow_gpu ./hello_world.py
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla P100-SXM2-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.4805
pciBusID 0000:06:00.0
Total memory: 15.89GiB
Free memory: 15.61GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties:
name: Tesla P100-SXM2-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.4805
pciBusID 0000:07:00.0
Total memory: 15.89GiB
Free memory: 15.61GiB

[...]

I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:06:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-SXM2-16GB, pci bus id: 0000:07:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla P100-SXM2-16GB, pci bus id: 0000:84:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0)
Hello, TensorFlow!
42

As shown above, the tensorflow_cpu and tensorflow_gpu executables include everything that’s needed for TensorFlow. You can just think of them as ready-to-run applications that have all their dependencies built in. All you need to know is where the Singularity container image is stored on the filesystem.

Caveats of GPU-accelerated HPC containers with Singularity

In earlier versions of Singularity, the nature of NVIDIA GPU drivers required a couple extra steps during the configurations of GPU-accelerated containers. Although GPU support is still listed as experimental, Singularity now offers a --nv flag which passes through the appropriate driver/library files. In most cases, you will find that no additional steps are needed to access NVIDIA GPUs with a Singularity container. Give it a try!

Taking the next step on GPU-accelerated HPC containers

There are still many use cases left to be discovered. Singularity containers open up a lot of exciting capabilities. As an example, we are leveraging Singularity on our OpenPower systems (which provide full NVLink connectivity between CPUs and GPUs). All the benefits of Singularity are just as relevant on these platforms. The Singularity images cannot be directly transferred between x86 and POWER8 CPUs, but the same style Singularity recipes may be used. Users can run a pre-built Tensorflow image on x86 nodes and a complimentary image on POWER8 nodes. They don’t have to keep all the internals and dependencies in mind as they build their workflows.

Generating reproducible results is another anticipated benefit of Singularity. Groups can publish complete and ready-to-run containers alongside their results. Singularity’s flexibility will allow those containers to continue operating flawlessly for years to come – even if they move to newer hardware or different operating system versions.

If you’d like to see Singularity in action for yourself, request an account on our GPU Test Drive cluster. For those looking to deploy systems and clusters leveraging Singularity, we provide fully-integrated HPC clusters with Singularity ready-to-run. We can also assist by building optimized libraries, applications, and containers. Contact an HPC expert.

This post was updated 2017-06-02 to reflect recent changes in GPU support.

The post GPU-accelerated HPC Containers with Singularity appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/gpu-accelerated-hpc-containers-singularity/feed/ 0
More Tips on OpenACC Acceleration https://www.microway.com/hpc-tech-tips/more-tips-on-openacc-acceleration/ https://www.microway.com/hpc-tech-tips/more-tips-on-openacc-acceleration/#respond Mon, 25 Jul 2016 14:48:54 +0000 https://www.microway.com/?p=6933 One blog post may not be enough to present all tips for performance acceleration using OpenACC.So here, more tips on OpenACC acceleration are provided, complementing our previous blog post on accelerating code with OpenACC. Further tips discussed here are: More Runtime Enhancements Using a Linearized Array Instead of a 2D Array Using a linearized array […]

The post More Tips on OpenACC Acceleration appeared first on Microway.

]]>
One blog post may not be enough to present all tips for performance acceleration using OpenACC.So here, more tips on OpenACC acceleration are provided, complementing our previous blog post on accelerating code with OpenACC.

Further tips discussed here are:

  • linearizing a 2D array
  • usage of contiguous memory
  • parallelizing loops
  • PGI compiler information reports
  • OpenACC general guidelines
  • the OpenACC runtime library

More Runtime Enhancements

Using a Linearized Array Instead of a 2D Array

Using a linearized array for dynamically allocated data structures having more than one dimension is a common approach, and requires a simpler implementation that non-linearized 2D arrays.However, with OpenACC, the mixing of 2D indices required for linearizing the array index can lead to problems with compilation.The following code, for example, will lead to the warning messages shown below:
[sourcecode language=”C”]
float* MatrixMult_linearIdx(int size, float *restrict A, float
*restrict B, float *restrict C) {
int idx1, idx2, idx3;
float tmp;

for (int i=0; i<size; ++i) {
for (int j=0; j<size; ++j) {
tmp = 0.;
for (int k=0; k<size; ++k) {
idx1 = i*size + k;
idx2 = k*size + j;
tmp += A[idx1] * B[idx2];
}
idx3 = i*size + j;
C[idx3] = tmp;
}
}
return C;
}
float* MakeMatrix_linearIdx(int size, float *restrict arr) {
int i, j, idx;
arr = (float *)malloc( sizeof(float) * size * size);
for (i=0; i<size; i++){
for (j=0; j<size; j++){
idx = i*size + j;
arr[idx] = ((float)i);
}
}
return arr;
}
void fillMatrix_linearIdx(int size, float *restrict A) {
int idx;
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
idx = i*size + j;
A[idx] = ((float)i);
}
}
}
[/sourcecode]
During compilation, the compiler will print this message:
Complex loop carried dependence of '*(A)' prevents parallelization
Parallelization would require privatization of array 'A[:size*size-1]'

To correct this, one can either: 1.) chose to not linearize the array index, and use arrays with two indices, or 2.) use the independent clause to tell the compiler that there really is no dependence, even if it detects otherwise.In the case of the linearized index presented here, the dependence detected by the compiler turns out to be a removable obstacle to parallelization, since the loops are actually independent.The two code samples below reflect these two solutions.

1.) choosing to not linearize the array index, and instead use two array indices
[sourcecode language=”C”]
float** MatrixMult(int size, float **restrict A, float **restrict B,
float **restrict C) {
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
pcopyout(C[0:size][0:size])
{
float tmp;
for (int i=0; i<size; ++i) {
for (int j=0; j<size; ++j) {
tmp = 0.;
for (int k=0; k<size; ++k) {
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
}
return C;
}
[/sourcecode]
2.) using the independent clause to tell the compiler that there really is no dependence between loops.When selecting this option, the programmer must be confident that the loops are independent, or else the program will behave unexpectedly.
[sourcecode language=”C”]
float* MatrixMult_linearIdx(int size, float *restrict A,
float *restrict B, float *restrict C) {
// uses linearized indices for matrices
int idx1, idx2, idx3;
float tmp;

#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
pcopyout(C[0:size][0:size])
{
#pragma acc for independent
for (int i=0; i<size; ++i) {
#pragma acc for independent
for (int j=0; j<size; ++j) {
tmp = 0.;
#pragma acc for independent
for (int k=0; k<size; ++k) {
idx1 = i*size + k;
idx2 = k*size + j;
tmp += A[idx1] * B[idx2];
}
idx3 = i*size + j;
C[idx3] = tmp;
}
}
} // pragma acc kernels
return C;
} // MatrixMult_linearIdx()
[/sourcecode]

The Use of Contiguous Memory for Dynamically Allocated Data Structures

The dynamic allocation of data structures, such as multi-dimensional arrays to heap memory is an important memory management feature of the C language.Through the use of pointers, and pointers to pointers, etc., one and two-dimensional data structures can be allocated chunks of memory in an adjustable region of memory which the OS uses, called the heap.This flexible memory region allows for dynamic allocation and removal of data structures as a program executes, allowing the programmer some finer level of control over what memory is used by a program.

With OpenACC, the relevant consideration is with 2D or higher-than-2D data structures, such as arrays.The data transfer for 2D data structures can be slowed down considerably if the second order pointers are allocated using separate calls to malloc().Put another way, each call to malloc() is assigned some region of heap memory which is not related in position to the position of previously assigned heap memory.Therefore, a memory assignment for a 2D array such as in the following procedure will result in a collection of non-contiguous blocks of memory comprising the total array.
[sourcecode language=”C”]
float** MakeMatrix_nonContiguous(int size, float **restrict arr) {
// assigns memory non-contiguously in a 2D array
int i;
arr = (float **)malloc( sizeof(float *) * size);
for (i=0; i<size; i++){
arr[i] = (float *)malloc( sizeof(float) * size );
}
return arr;
}
[/sourcecode]
In the code snippet above, with each iteration of the i loop, an entirely new call to malloc is made.

By making one call to malloc(), allocating the entire space needed for the 2D array, and then setting pointer positions explicitly in a loop, such as in the following procedure, a contiguous memory assignment is made.
[sourcecode language=”C”]
float** MakeMatrix(int size, float **restrict arr) {
// assigns memory contiguously in a 2D array
int i;
arr = (float **)malloc( sizeof(float *) * size);
arr[0] = (float *)malloc( sizeof(float) * size * size);
for (i=1; i<size; i++){
arr[i] = (float *)(arr[i-1] + size);
}
return arr;
}
[/sourcecode]
Because this 2D array is assigned to a contiguous memory block, there will just be one memory transfer across the PCIe bus, when sending it to or from the GPU.

In the non-contiguous implementation, however, each row is assigned memory with a separate malloc() statement, and there are N rows.This will result in N copies being sent across the PCIe bus, instead of just one. The data transfer to and from device will be therefore be faster for memory-contiguous arrays.

To get a sense of how much faster the memory transfer speed is, the transfer time for square 1000x1000 Matrices was measured over five iterations with contiguous and non-contiguous arrays using a Tesla M40.The transfer times are shown below in Table 1.

memory
assignment
PCIe xfer time
data copyin
PCIe xfer time
data copyout
contiguous702 us316 us
non-contiguous8,302 us5,551 us

Table 1. PCIe transfer of contiguous vs. non-contiguous arrays results in faster data transfers

A speedup of at least 11x was achieved by assigning matrix data to contiguous arrays.

Runtime output with data copy times for contiguous memory assigned data
[john@node6 MD_openmp]$ ./matrix_ex_float 1000 5
./matrix_ex_float total runtime 0.055342

Accelerator Kernel Timing data
/home/john/MD_openmp/./matrix_ex_float.c
  MatrixMult  NVIDIA  devicenum=0
    time(us): 52,456
    19: compute region reached 5 times
        26: kernel launched 5 times
            grid: [63x16]  block: [16x32]
             device time(us): total=52,456 max=10,497 min=10,488 avg=10,491
            elapsed time(us): total=52,540 max=10,514 min=10,504 avg=10,508
    19: data region reached 5 times
    35: data region reached 5 times
/home/john/MD_openmp/./matrix_ex_float.c
  main  NVIDIA  devicenum=0
    time(us): 1,037
    96: data region reached 1 time
        31: data copyin transfers: 2
             device time(us): total=702 max=358 min=344 avg=351
        31: kernel launched 3 times
            grid: [8]  block: [128]
             device time(us): total=19 max=7 min=6 avg=6
            elapsed time(us): total=489 max=422 min=28 avg=163
    128: data region reached 1 time
        128: data copyout transfers: 1
             device time(us): total=316 max=316 min=316 avg=316

Runtime output with data copy times for non-contiguous memory assigned data
[john@node6 MD_openmp]$ ./matrix_ex_float_non_contig 1000 5
./matrix_ex_float_non_contig total runtime 0.059821

Accelerator Kernel Timing data
/home/john/MD_openmp/./matrix_ex_float_non_contig.c
  MatrixMult  NVIDIA  devicenum=0
    time(us): 51,869
    19: compute region reached 5 times
        26: kernel launched 5 times
            grid: [63x16]  block: [16x32]
             device time(us): total=51,869 max=10,378 min=10,369 avg=10,373
            elapsed time(us): total=51,967 max=10,398 min=10,389 avg=10,393
    19: data region reached 5 times
    35: data region reached 5 times
/home/john/MD_openmp/./matrix_ex_float_non_contig.c
  main  NVIDIA  devicenum=0
    time(us): 31,668
    113: data region reached 1 time
        31: data copyin transfers: 2000
             device time(us): total=8,302 max=27 min=3 avg=4
        31: kernel launched 3000 times
            grid: [1]  block: [128]
             device time(us): total=17,815 max=7 min=5 avg=5
            elapsed time(us): total=61,892 max=422 min=19 avg=20
    145: data region reached 1 time
        145: data copyout transfers: 1000
             device time(us): total=5,551 max=28 min=5 avg=5

Tips for Parallelizing Loops

Parallelizing iterative structures can trigger warnings, and re-expressing loop code is sometimes required.For example, if the programmer uses a directive such as kernels, parallel, or region, and if the compiler sees any dependency between loops, then the compiler will not parallelize that section of code.By expressing the same iteration differently, it may be possible to avoid warnings and get the compiler to accelerate the loops.The code samples below illustrate loops which will cause the compiler to complain.
[sourcecode language=”C”]
#pragma acc region
{
while (i<N && found == -1) {
if (A[i] >= 102.0f) {
found = i;
}
++i;
}
}
[/sourcecode]
Compiling the above code will generate the following warning from the compiler:
Accelerator restriction: loop has multiple exits
Accelerator region ignored

The problem here is that i could take on different values when the while loop is exited, depending on when an executing thread samples a value of A[i] greater than, or equal to 102.0.The value of i would vary from run to run, and will not produce the result the programmer intended.

Re-expressed in the for loop below, containing branching logic, the compiler now will see the first loop as being parallelizable.
[sourcecode language=”C”]
#pragma acc region
{
for (i=0; i<N; ++i) {
if (A[i] >= 102.0f) {
found[i] = i;
}
else {
found[i] = -1;
}
}
}
i=0;
while (i < N && found[i] < 0) {
++i;
}
[/sourcecode]
Although this code is slightly longer, with two loops, accelerating the first loop makes up for the separation of one loop into two.Normally separating one loop into two is bad for performance, but when expressing parallel loops, the old rules might no longer apply.

Another potential problem with accelerating loops is that inner loop variables may sometimes have to be declared as private. The loops below will trigger a compiler warning:
[sourcecode language=”C”]
#pragma acc region
{
for (int i=0; i<N; ++i) {
for (int j=0; j<M; ++j) {
for (int k=0; k<10; ++k) {
tmp[k] = k;
}
sum=0;
for (int k=0; k<10; ++k) {
sum+=tmp[k];
}
A[i][j] = sum;
}
}
}
[/sourcecode]
The warning message from the compiler will be:
Parallelization would require privatization of array tmp[0:9]
Here, the problem is that the array tmp is not declared within the parallel region, nor is it copied into the parallel region.The outer i loop will run in parallel but the inner j loop will run sequentially because the compiler does not know how to handle the array tmp.

Since tmp is not initialized before the parallel region, and it is re-initialized before re-use in the region, it can be declared as a private variable in the region.This means that every thread will retain its own copy of tmp, and that the inner loop j can now be parallelized:
[sourcecode language=”C”]
#pragma acc region
{
for (int i=0; i<N; ++i) {
#pragma acc for private(tmp[0:9])
for (int j=0; j<M; ++j) {
for (int ii=0; ii<10; ++ii) {
tmp[ii] = ii;
}
sum=0;
for (int ii=0; ii<10; ++ii) {
sum += tmp[ii];
}
A[i][j] = sum;
}
}
}
[/sourcecode]

Compiler Information Reports

Compiling with Time Target (target=nvidia,time)

Compiling the program with time as a target causes the execution to report data transfer times to and from the device, as well as kernel execution times.Data transfer times can be helpful in detecting whether large portions of runtime might be spent on needless data transfers to and from the host.This is a common loss of speedup when first learning OpenACC.To compile with the “time” target option, use the compiler syntax:
pgcc -ta=nvidia,time -acc -Minfo -o test test.c

The data transfer times reported below are shown for one version of the program, where no data region is established around the loop in main(), while only kernels are used in MultMatrix().

Runtime output with kernel times, grid size, block size
[john@node6 openacc_ex]$ ./matrix_ex_float 1000 5
./matrix_ex_float total runtime  0.46526

Accelerator Kernel Timing data
/home/john/openacc_ex/./matrix_ex_float.c
  MatrixMult  NVIDIA  devicenum=0
    time(us): 114,979
    29: compute region reached 5 times
        32: kernel launched 5 times
            grid: [8x1000]  block: [128]
             device time(us): total=109,238 max=21,907 min=21,806 avg=21,847
            elapsed time(us): total=109,342 max=21,922 min=21,825 avg=21,868
    29: data region reached 5 times
        31: data copyin transfers: 10
             device time(us): total=3,719 max=394 min=356 avg=371
        31: kernel launched 15 times
            grid: [8]  block: [128]
             device time(us): total=76 max=6 min=5 avg=5
            elapsed time(us): total=812 max=444 min=23 avg=54
    40: data region reached 5 times
        40: data copyout transfers: 5
             device time(us): total=1,946 max=424 min=342 avg=389

With kernels used in MatrixMult() and data region declared around iterative loop in main():

Runtime output
[john@node6 openacc_ex]$ ./matrix_ex_float 1000 5
./matrix_ex_float total runtime  0.11946

Accelerator Kernel Timing data
/home/john/openacc_ex/./matrix_ex_float.c
  MatrixMult  NVIDIA  devicenum=0
    time(us): 111,186
    29: compute region reached 5 times
        32: kernel launched 5 times
            grid: [8x1000]  block: [128]
             device time(us): total=109,230 max=21,903 min=21,801 avg=21,846
            elapsed time(us): total=109,312 max=21,918 min=21,818 avg=21,862
    29: data region reached 5 times
        31: kernel launched 5 times
            grid: [8]  block: [128]
             device time(us): total=25 max=5 min=5 avg=5
            elapsed time(us): total=142 max=31 min=27 avg=28
    40: data region reached 5 times
        40: data copyout transfers: 5
             device time(us): total=1,931 max=398 min=372 avg=386
/home/john/openacc_ex/./matrix_ex_float.c
  main  NVIDIA  devicenum=0
    time(us): 790
    176: data region reached 1 time
        31: data copyin transfers: 2
             device time(us): total=779 max=398 min=381 avg=389
        31: kernel launched 2 times
            grid: [8]  block: [128]
             device time(us): total=11 max=6 min=5 avg=5
            elapsed time(us): total=522 max=477 min=45 avg=261
    194: data region reached 1 time

When the data region is established around the iterative loop in main(), the boundary of the parallel data context is pushed out, so that the A and B matrices are only copied into the larger region once, for a total of two copyin transfers.In the former case, with no data region, the A and B matrices are copied at each of five iterations, resulting in a total of 10 copyin transfers.

Compiling with PGI_ACC_* Environment Variables

The environment variables PGI_ACC_NOTIFY and PGI_ACC_TIME provide timing information in the shell of execution.

PGI_ACC_NOTIFY
If set to 1, the runtime prints a line of output each time a kernel is launched on the GPU.
If set to 2, the runtime prints a line of output about each data transfer.
If set to 3, the runtime prints out kernel launching and data transfer.

PGI_ACC_TIME
By setting PGI_ACC_TIME to 1, during runtime, a summary will be made of time taken for data movement between the host and the GPU, as well as kernel computation on the GPU.

Other OpenACC General Guidelines

1.) Avoid using pointer arithmetic.Instead, used subscripted arrays, rather than pointer-indexed arrays.Pointer arithmetic can confuse compilers.

2.) When accelerating C code, compilers will not parallelize a loop containing data structures which are referenced by a pointer, unless that pointer is declared as restricted.This can be done at the input-argument of a routine. If called within the main body, instead of a routine, the original pointer declaration must be restricted.This way, the compiler can be certain that two pointers do not point to the same memory and that the loop can be parallelized without producing unpredictable results.If pointers are not restricted the compiler will report the warning loop not parallelizable.The code will compile, but the parallel region declared around the unrestricted pointers will be ignored.

The example below, taken from the application code, shows a procedure for assigning contiguous memory for a 2D array, where three of the procedure pass-in variables are cast onto a float **restrict data type.This is a restricted pointer to a pointer.Without the cast to restricted datatype in the list of routine arguments, the compiler would report the loops are not parallelizable.
[sourcecode language=”C”]
float** MatrixMult(int size, float **restrict A, float **restrict B,
float **restrict C) {
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
pcopyout(C[0:size][0:size])
{
float tmp;
for (int i=0; i<size; ++i) {
for (int j=0; j<size; ++j) {
tmp = 0.;
for (int k=0; k<size; ++k) {
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
} // #pragma
return C;
}
[/sourcecode]

3.) It is possible to do conditional compilation with _OPENACC macro._OPENACC can be set to a value equal to the version number of OpenACC you wish to use.

4.) Routines called within a directive region must be inlined, when the program is run from the command line.Commonplace C math library functions, such as rand() from math.h, for example, when present in a loop, will present a problem if the loop needs to be parallelized.The only solution, other than removing the function call, and replacing it with something else which can be parallelized, is to inline the function when compiling at the command line.This identifies it as parallelizable, but under the proviso that it be run sequentially.For example, the following code region, if identified as a parallel region using compiler directives, would require inlining of the rand() routine call:
[sourcecode language=”C”]
void initialize(int np, int nd, vnd_t box, vnd_t *restrict pos) {
int i, j;
double x;
srand(4711L);
#pragma acc kernels
for (i = 0; i<np; i++) {
for (j = 0; j<nd; j++) {
x = rand() % 10000 / (double)10000.0;
pos[i][j] = box[j] * x;
}
}
}
[/sourcecode]
Here, the inner j loop would run sequentially and slow, but the outer i loop would run in parallel.
The inlining of rand() requires a command-line flag of the form:
pgcc -acc -ta=nvidia -Minfo -Minline=func1,func2 -o ./a.out ./a.c

Since we are inlining only one function, rand(), the inlining would be:
pgcc -acc -ta=nvidia -Minfo -Minline=rand -o ./a.out ./a.c

If, however, the function to be inlined is in a different file than the one with the accelerated code
region, then the inter-procedural optimizer must be used.Automatic inlining is enabled by specifying
-Mipa=inline on the command-line.

5.) Reduction operations are used across threads to resolve global operations , such as the minimum value in a loop, or the sum total, for example.When a reduction is done on a parallel region, the compiler will usually automatically detect it.Explicitly applying a reduction operation requires that a clause be placed on a loop or parallel directive:
[sourcecode language=”C”]
#pragma acc parallel loop pcopyin(m, n, Anew[0:n][0:m], A[0:n][0:m]) reduction(max:error)
for( int j=1; j<n-1; j++){
for( int i=1; i<m-1; i++){
Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));
}
}
[/sourcecode]
Here, the reduction(max:error) tells the compiler to find the maximum value of error across executing kernels. Reduction operators in OpenACC are: +, *, &&, ||, &, |, ∧, max, min.

OpenACC Runtime Library Routines

OpenACC features a number of routines from the openacc.h header file.An exhaustive list will not be given here, but some functionalities provided by the library include setting which device to use, getting the current device type, and dynamically allocating memory on the device.There are approximately two dozen OpenACC runtime routines, which are described in further detail in The OpenACC Reference Guide [ref. 1].

Background Reading

1.) OpenACC 2.6 Quick Reference
2.) OpenACC 1.0 Quick Reference
3.) The PGI Accelerator Programming Model on NVIDIA GPUs
4.) 11 Tips for Maximizing Performance with OpenACC Directives in Fortran
5.) 12 Tips for Maximum Performance with PGI Directives in C
6.) The OpenACC Application Programming Interface Version 2.0 (July, 2013)

The post More Tips on OpenACC Acceleration appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/more-tips-on-openacc-acceleration/feed/ 0
Can I use Deep Learning? https://www.microway.com/hpc-tech-tips/can-i-use-deep-learning/ https://www.microway.com/hpc-tech-tips/can-i-use-deep-learning/#respond Thu, 30 Jun 2016 07:30:21 +0000 https://www.microway.com/?p=7905 If you’ve been reading the press this year, you’ve probably seen mention of deep learning or machine learning. You’ve probably gotten the impression they can do anything and solve every problem. It’s true that computers can be better than humans at recognizing people’s faces or playing the game Go. However, it’s not the solution to […]

The post Can I use Deep Learning? appeared first on Microway.

]]>
If you’ve been reading the press this year, you’ve probably seen mention of deep learning or machine learning. You’ve probably gotten the impression they can do anything and solve every problem. It’s true that computers can be better than humans at recognizing people’s faces or playing the game Go. However, it’s not the solution to every problem. We want to help you understand if you can use deep learning. And if so, how it will help you.

Just as they have for decades, computers performing deep learning are running a specific set of instructions specified by their programmers. Only now, we have a method which allows them to learn from their mistakes until they’re doing the task with high accuracy.

If you have a lot of data (images, videos, text, numbers, etc), you can use that data to train your computers on what you want done with the information. The result, an artificial neural network trained for this specific task, can then process any new data you provide.

We’ve written a detailed post on recent developments in Deep Learning applications. Below is a brief summary.

What types of problems are being solved using Deep Learning?

Computer Vision

If you have a lot imaging data or photographs, then deep learning should certainly be considered. Deep learning has been used extensively in the field of computer vision. For example, image classification (describing the items in a picture) and image enhancement (removing defects or fog from photographs). It is also vital to many of the self-driving car projects.

Written Language and Speech

Deep Learning has also been used extensively with language. Certain types of networks are able to picks clues and meaning from written text. Others have been created to translate between different languages. You may have noticed that smartphones have recently become much more accurate at recognizing spoken language – a clear demonstration of the ability of deep learning.

Scientific research, engineering, and medicine

Materials scientists have used deep learning to predict how alloys will perform – allowing them to investigate 800,000 candidates while conducting only 36 actual, real-world tests. Such success promises dramatic improvements in the speed and efficiency of such projects in the future.

Physicists researching the Higgs boson have used deep learning to clean up their data and better understand what happens when they witness one of these particles. Simply dealing with the data from CERN’s Large Hadron Collider has been a significant challenge for these scientists.

Those studying life science and medicine are looking to use these methods for a variety of tasks, such as:

  • determining the shape of correctly-folded proteins (some diseases are caused by proteins that are not shaped correctly)
  • processing large quantities of bioinformatics data (such as the genomes in DNA)
  • categorizing the possible uses of drugs
  • detecting new information simply by examining blood

If you have large quantities of data, consider using deep learning

Meteorologists are working to predict thunderstorms by sending weather data through a specialized neural network. Astronomers may be able to get a handle on the vast quantities of images and data that are captured by modern telescopes. Hospitals are expected to be using deep learning for cancer detection. There are many other success stories, and new papers are being published every month.

For details on recent projects, read our blog post on deep learning applications.

Want to use Deep Learning?

If you think you could use deep learning, Microway’s experts will design and build a high-performance deep learning system for you. We’d love to talk with you.

The post Can I use Deep Learning? appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/can-i-use-deep-learning/feed/ 0
Deep Learning Applications in Science and Engineering https://www.microway.com/hpc-tech-tips/deep-learning-applications/ https://www.microway.com/hpc-tech-tips/deep-learning-applications/#respond Wed, 29 Jun 2016 15:44:51 +0000 https://www.microway.com/?p=7385 Over the past decade, and particularly over the past several years, Deep learning applications have been developed for a wide range of scientific and engineering problems. For example, deep learning methods have recently increased the level of significance of the Higgs Boson detection at the LHC. Similar analysis is being used to explore possible decay […]

The post Deep Learning Applications in Science and Engineering appeared first on Microway.

]]>
Over the past decade, and particularly over the past several years, Deep learning applications have been developed for a wide range of scientific and engineering problems. For example, deep learning methods have recently increased the level of significance of the Higgs Boson detection at the LHC. Similar analysis is being used to explore possible decay modes of the Higgs. Deep Learning methods fall under the larger category of Machine Learning, which includes various methods, such as Support Vector Machines (SVMs), Kernel Methods, Hidden Markov Models (HMMs), Bayesian Methods, along with regression techniques, among others.

Deep learning is a methodology which involves computation using an artificial neural network (ANN). Training deep networks was not always within practical reach, however. The main difficulty arose from the vanishing/exploding gradient problem. See a previous blog on Theano and Keras for a discussion on this. Training deep networks has become feasible with the developments of GPU parallel computation, better error minimization algorithms, careful network weight initialization, the application of regularization or dropout methods, and the use of Rectified Linear Units (ReLUs) as artificial neuron activation functions. ReLUs can tune weights such that the backpropagation signal does not become too attenuated.

Application of Trained Deep Networks

Trained deep networks are now being applied to a wide range of problems. Some areas of application could instead be solved using a numerical model comprised of discrete differential equations. For instance, deep learning is being applied to the protein folding problem, which could be modeled as a physical system, using equations of motion (for very small proteins), or energy-based minimization methods (for larger systems). Used as an alternative approach, a deep network can be trained on correctly folded tertiary protein structures, given primary and secondary structure, as input data. The trained network could then predict a protein’s tertiary structure [Lena, P.D., et al.].

Neural networks offer an alternative, data-based method to solve problems which were previously approached using physical numerical models, or other machine learning methods. A distinction between data-based models and physical models is that data-based models can be applied to problems for which no well-accepted, or practical, predictive theoretical framework exists.

Deep Neural Networks as Biological Analogs

Aside from providing an alternative data-based approach to problems for which no discrete physical model may exist, deep learning applications can reproduce some function of a real-world biological neural network analog, such as vision, or hearing.

In both biological and artificial visual networks, the lower convolutional layers detect the most basic features, such as edges. Convolutional layers are separated by pooling layers, which add some robustness to feature detection, so that if a feature is translated slightly, or rotated a bit, it will still be detected. Successive convolution/feature layers build from edges to form features with multiple edges, or curves. In most network architectures, pooling layers are placed between convolutional layers. This is done to add robustness to each feature detection layer. The highest convolutional layers are built upon combinations of features from previous layers. The highest layers build the most complex feature detectors. The weights in the highest layers become set, through training pressure, to become detectors for complex shapes, such faces, chairs, tires, houses, doors, etc. The layers in a deep visual classification network will separate out image features from lowest to highest complexity. If a network does not have sufficient depth, then there will not be good separation and the classifications will be too blurred and unfocused.

Deep Learning Applications in Science and Engineering

Despite the advances of the past decade, deep learning cannot presently be applied to just any sort of research problem. Some problems still have either not been expressed in an information framework that is compatible with deep learning, or there are not yet deep network architectures that exist which can perform the kinds of functions needed. Deep learning has shown surprising progress, however. For example, a recent advance in deep learning surpassed nearly everyone’s expectation, when Google DeepMind’s “AlphaGo” AI player defeated the World Champion, Lee Sedol, in the game of Go. This milestone achievement was thought to be decades away, not mere months.

The following sections are not meant to be a complete description of deep learning applications, but are meant to demonstrate the wide range of scientific research problems to which deep learning can be applied. Recent major developments in algorithms, methods, and parallel computation with GPUs, have created the right conditions which precipitated the recent succession of major advances in the field of Artificial Intelligence.

Deep Learning in Image Classification

Image Classification uses a particular type of deep neural network, called a convolutional neural network (CNN). Figure 1 illustrates the basic organization of a CNN for visual classification. The actual network for this sort of task would have more neurons per layer.

Deep Learning Applications to Visual Recognition
Figure 1. Convolutional Neural Network for Facial Recognition

It is, however, possible to scale an image down to some level without losing the essential features for detection. If the features in an image are too large or too small for the filter sizes, however, then the features will not be detected. This is a subtle point which presents a problem for visual recognition. Recent approaches have addressed this problem by having the network construct various filters of the same feature but at different size scales. For a given trained neural network, images must be scaled such that their feature sizes match those in the highest layer convolutional filters. The pooling layers impart some robustness for feature detection, which allow for some small amount of feature rotation and translations. However, if the face is flipped upside down, nothing will work, and the network will not correctly identify the face. This is in fact a methodological difficulty, and in order to address it, the network must develop feature detectors for the same feature, but at different rotations. This can be done by including rotations of the image into the training set. A similar problem arises if a face is rotated not in the plane, but out of the plane. This introduces distortions of key facial features, which would once again foil a network trained only on forward facing faces. These problems of rotational and scale invariance are active areas of research in the area of object recognition.

Looking at the network in Figure 1, the three output neurons indicate, in coded form, the name of the person whose face is presented to the network. The grayscale values in the grayscale images indicate connection weight values.

Deep Learning Application for Autonomous Vehicles
Figure 2. NVIDIA DRIVE PX2 for Autonomous Cars (image re-used with permission by NVIDIA Automotive)

In one research development, de-noising autoencoders were used to remove fog from images taken live from autonomous land vehicles. A similar solution was used for enhancing low-light images [Lore, K., et al.] Each square tile represents a different convolutional filter, which is formed under training pressure to extract certain features. The convolutional filters in the lowest layers pick out edges. Higher layers detect more complex features, which could consist of combinations of edges, to form a nose, or chin, for example.

Recent major advances in machine vision research include image content tagging (Regional Convolutional Neural Networks, or RCNNs), along with development of more robust recognition of objects, in the presence of noise, applied rotation, size variation, etc. Scene recognition deep networks are currently being used in self-driving cars (Figure 2).

Deep Learning in Natural Language Processing

Significant progress has also been made in Natural Language Processing (NLP). Using word and sentence vector representations along with syntactic tree parsing, NLP ANNs have been able to identify complex variations in written form, such as sarcasm, where a seemingly positive sentence takes on a sudden negative meaning [Socher, R., et al.]. The meaning of large groups or bodies of sentences, such as articles, or chapters from books, can be resolved to a group of vectors, summarizing the meaning of the text.

In a different NLP application, a collection of IMDB movie reviews were used to train a deep network to evaluate the sentiment of movie reviews. An approach for this was examined using Keras in a previous Microway Tech Tips blog post. Primary applications of NLP deep networks include language translation, and sentiment analysis.

Transcription of video to text, in the absence of audio is an active area of research which involves both Image Classification and NLP. Deep networks for NLP are usually recursive neural networks (RNNs), having the output fed back into the input layer. For NLP tasks, the previous context partly determines the best vector for representing the next sentence, or word. For a review of RNNs, see, for example, The Unreasonable Effectiveness of Recurrent Neural Networks, by Andrej Karpathy.

Language Translation

Encoder-Decoder frameworks have been developed for encoding English words, for example, into reduced vector representations, and then decoding the reduced representations of English words into French words, using a French decoder. The encoding/decoding can be done between any two languages. The reduced representation can be thought to be a universal encoding, which encodes the word into a distributed pattern in the network [Cho, K., et al.]. This sort of distributed encoded pattern has been referred to as constituting a “thought”, or an internal encoded representation of data.

Automatic Speech Recognition

Automatic Speech Recognition (ASR), is another research area seeing deep networks producing results better than any method previously used, including Hidden Markov Models. Previously, ASR used Hidden Harkov Models, mixed with ANNs, in a hybrid method. Because deep networks for ASR can now be trained within practical timescales, deep learning is now producing the best results. Long Short Term Memory (LSTM) [Song, W., et al & Sak, H., at al] and Gated Recurrent Units (GRU) play an important role in improving ASR deep networks, by helping to retain information from more than several iterations ago. The TIMIT speech dataset is used as a primary data source for training ASR deep networks.

Deep Learning in Scientific Experiment Design

Using a deep learning approach, machines can now provide direction on the design of scientific experiments. Consider for example, a recent deep learning approach taken by materials scientists, where new NiTi-based shape memory alloys were explored for lower thermal diffusivity [Xue, D., et al.]. From a dataset of 22 known NiTi-based alloys, a deep network was trained to report their 22 measured thermal diffusivity values. Particular physical properties of the 22 alloys were used as input parameters for training the network.

With the deep network trained on the known alloys, it was used to determine the diffusivity values for a large number of theoretical alloys. Four alloys were selected from the predicted set which showed the lowest estimated thermal diffusivity values. Real experiments were then carried out on these four theoretical alloys, and their thermal diffusivity values were measured. The data for these four new alloys, with known thermal diffusivities were then added to the training set, and the network was re-trained in order to improve the accuracy of the deep network. After the experiment proceeded in iterations of four unexplored alloys, the final remarkable result was reached, where 14 of the 36 new alloys had a smaller thermal diffusivity than any of the 22 known alloys in the original data set.

High Throughput Screening Experimentation will be improved with Deep Learning

Research problems which have large combinatorics of possible experiments, such as the investigation of new NiTi shape memory alloys, are likely to be expressible into an information framework conducive for solving with deep learning. Once trained, the deep networks will help the investigator sort through the vast combinatoric landscape of experiment design possibilities. Trained deep networks will estimate which experiments will result in the best property being sought after. Once the best candidate experiments are performed, and the property of interest is measured, the deep network can be re-trained with the new data. Instead of starting with a total of twenty 384-well plates, for example, the researcher may only need one quarter of this amount, or may instead fill the twenty plates with more promising molecular candidates.

Deep Learning in High Energy Physics

The discovery of the Higgs Boson marked a major achievement for the Standard Model of high energy particle physics. First detected in 2011/2012 at the CERN LHC, the elusive particle was hypothesized to be responsible for imparting the property of mass, onto other particles (except for massless particles). Detecting the Higgs Boson with a high enough level of certainty to declare it an actual discovery required examining its decay modes in millions of high energy particle collisions, where two protons collided at sufficiently high energy to create two heavy Tau leptons, which then spontaneously decayed into lighter leptons, the muon and the electron. Through the course of these spontaneous decays, tell-tale signatures could be discerned in the data, indicating that the resulting particles and momenta were very likely to have come from the decay of a Higgs Boson.

Machine Learning techniques have been used in particle physics data analysis since their development. The application of deep networks and deep learning is an extension of machine learning methods which have previously been widely used for this sort of data analysis [Sadowski, P., et al. & Sadowski, P., et al.]

Deep Learning in Drug Discovery

Deep Learning Applications to Drug Discovery
Figure 3. A DNN is trained on gene expression levels and pathway activations scores to predict therapeutic use categories

Deep Learning is beginning to see applications in pharmacology, in processing large amounts of genomic, transcriptomic, proteomic, and other “-omic” data [Mamoshina, P, et al.]. Recently, a deep network was trained to categorize drugs according to therapeutic use by observing transcriptional levels present in cells after treating them with drugs for a period of time [Aliper, A, et al.] (Figure 3). Deep learning has also been used to identify biomarkers from blood which are strong indicators for age [Putin, E., et al.].

Deep Learning is Just Getting Started

DRIVE PX onboard embedded system
Figure 4. Object Recognition by NVIDIA DRIVE PX, an onboard scene processing neural network (image re-used with permission by NVIDIA Automotive)

In addition to the applications mentioned here, there are numerous others, including robotics, autonomous vehicles (see Figure 4), genomics, bioinformatics [Alipanahi, B., et al.], and cancer screening, for example. The 21st International Conference on Pattern Recognition (ICPR2012) hosted a challenge for detecting breast cancer cell mitosis in histological images. In April 2016, the Massachusetts General Hospital (MGH) announced it would begin a major research effort into exploring ways to improve health care and disease management through application of artificial intelligence and deep learning to a vast and growing volume of personal health data. MGH will be using the NVIDIA DGX-1 Deep Learning Appliance as the hardware platform for the research initiative.

Want to use Deep Learning?

Microway’s Sales Engineers are excited about deep learning, and we are happy to help you find the best solution for your research. Let us know what you’re working on and we’ll help you put together the right configuration.

References

1. Lena, Pietro D., Ken Nagata, and Pierre F. Baldi. “Deep spatio-temporal architectures and learning for protein structure prediction.” Advances in Neural Information Processing Systems. 2012.
2. Lee, Honglak, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.” In Proceedings of the 26th annual international conference on machine learning, pp. 609-616. ACM, 2009.
3. Lore, Kin Gwn, Adedotun Akintayo, and Soumik Sarkar. “LLNet: A Deep Autoencoder Approach to Natural Low-light Image Enhancement.” arXiv preprint arXiv:1511.03995 (2015).
4. Socher, Richard, et al. “Recursive deep models for semantic compositionality over a sentiment treebank.” Proceedings of the conference on empirical methods in natural language processing (EMNLP). Vol. 1631. 2013.
5. Cho, Kyunghyun, et al. “On the properties of neural machine translation: Encoder-decoder approaches.” arXiv preprint arXiv:1409.1259 (2014).
6. Song, William, and Jim Cai. “End-to-End Deep Neural Network for Automatic Speech Recognition.”
7. Sak, Haşim, Andrew Senior, and Françoise Beaufays. “Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition.” arXiv preprint arXiv:1402.1128 (2014).
8. Dezhen Xue et al., Accelerated search for materials with targeted properties by adaptive design, Nature Communications (2016). DOI: 10.1038/ncomms11241
9. Sadowski, Peter J., Daniel Whiteson, and Pierre Baldi. “Searching for higgs boson decay modes with deep learning.” Advances in Neural Information Processing Systems. 2014.
10. Sadowski, P., Collado, J., Whiteson, D., and Baldi, P., Deep Learning, Dark Knowledge, and Dark Matter, JMLR: Workshop and Conference Proceedings 42:81-97, 2015
11. Mamoshina, Polina, et al. “Applications of deep learning in biomedicine.” Molecular pharmaceutics 13.5 (2016): 1445-1454.
12. Aliper, Alexander, et al. “Deep learning applied to predicting pharmacological properties of drugs and drug repurposing using transcriptomic data.” Molecular pharmaceutics (2016).
13. Putin, Evgeny, et al. “Deep biomarkers of human aging: Application of deep neural networks to biomarker development.” Aging 8.5 (2016).

The post Deep Learning Applications in Science and Engineering appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/deep-learning-applications/feed/ 0