Development Archives - Microway https://www.microway.com/category/development/ We Speak HPC & AI Thu, 30 May 2024 20:12:07 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science https://www.microway.com/hpc-tech-tips/what-can-you-do-with-a-15k-nvidia-data-science-workstation-change-healthcare/ https://www.microway.com/hpc-tech-tips/what-can-you-do-with-a-15k-nvidia-data-science-workstation-change-healthcare/#respond Wed, 04 Mar 2020 22:31:20 +0000 https://www.microway.com/?p=12259 NVIDIA’s Data Science Workstation Platform is designed to bring the power of accelerated computing to a broad set of data science workflows. Recently, we found out what happens when you lend a talented data scientist (with a serious appetite for after-hours projects + coffee) a $15k accelerated data science tool. You can recreate a massive […]

The post What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science appeared first on Microway.

]]>
NVIDIA’s Data Science Workstation Platform is designed to bring the power of accelerated computing to a broad set of data science workflows. Recently, we found out what happens when you lend a talented data scientist (with a serious appetite for after-hours projects + coffee) a $15k accelerated data science tool. You can recreate a massive Pubmed literature search project on a Data Science WhisperStation in hours versus weeks.

Kyle Gallatin, an engineer at Pfizer, has deep data science credentials. He’s been working on projects for over 10 years. At the end of 2019 we gave him special access to one of our Data Science WhisperStations in partnership with NVIDIA:

When NVIDIA asked if I wanted to try one of the latest data science workstations, I was stoked. However, a sobering thought followed the excitement: what in the world should I use this for?

I thought back to my first data science project: a massive, multilingual search engine for medical literature. If I had access to the compute and GPU libraries I have now in 2020 back in 2017, what might I have been able to accomplish? How much faster would I have accomplished it?

Experimentation, Performance, and GPU Accelerated Data Science Tooling

Gallatin used Data Science WhisperStation to rapidly create an accelerated data science workflow for a healthcare—and tell us about his experience. And it was a remarkable one.

Not only was a previously impossible workflow made possible, but portions of the application were accelerated up to 39X!

The Data Science Workstation allowed him to design a Pubmed healthcare article search engine where he:

  1. Ingested a larger database than ever imagined (30,000,000 research article abstracts!)
  2. Didn’t require massive code changes to GPU accelerate the algorithm
  3. Used familiar looking tools for his workflow
  4. Had unsurpassed agility—he could search large portions of the abstract database in .1 seconds!

This last point is really critical and shows why we believe the NVIDIA Data Science Workstation Platform and its RAPIDS tools are so special. As Kyle put it:

Data science is a field grounded in experimentation. With big data or large models, the number of times a scientist can try out new configurations or parameters is limited without massive resources. Everyone knows the pain of starting a computationally-intensive process, only be blindsided by an unforeseen error literal hours into running it. Then you have to correct it and start all over again.

Walkthrough with Step-by-Step Instructions

The new article is available on Medium. It provides a complete step-by-step walkthrough of how NVIDIA Rapids tools and NVIDIA Quadro RTX 6000 with NVLink were utilized to revolutionize this process.

A short set of Kyle’s key findings about the environment and the hardware are below. We’re excited about how this kind of rapid development could change healthcare:

Running workflows with GPU libraries can speed up code by orders of magnitude — which can mean hours instead of weeks with every experiment run

Additionally, if you’ve ever set up a data science environment from scratch you know it can really suck. Having Docker, RAPIDs, tensorflow, pytorch and everything else installed and configured out-of-the-box saved hours in setup time

..

With these general-purpose data science libraries offering massive computational enhancements for traditionally CPU-bound processes (data loading, cleansing, feature engineering, linear models, etc…), the path is paved to entirely new frontier of data science.

Read on at Medium.com

The post What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/what-can-you-do-with-a-15k-nvidia-data-science-workstation-change-healthcare/feed/ 0
One-shot Learning Methods Applied to Drug Discovery with DeepChem https://www.microway.com/hpc-tech-tips/one-shot-learning-methods-applied-drug-discovery-deepchem/ https://www.microway.com/hpc-tech-tips/one-shot-learning-methods-applied-drug-discovery-deepchem/#respond Wed, 26 Jul 2017 14:01:15 +0000 https://www.microway.com/?p=8929 Experimental data sets for drug discovery are sometimes limited in size, due to the difficulty of gathering this type of data. Drug discovery data sets are expensive to obtain, and some are the result of clinical trials, which might not be repeatable for ethical reasons. The ClinTox data set, for example, is comprised of data […]

The post One-shot Learning Methods Applied to Drug Discovery with DeepChem appeared first on Microway.

]]>
Experimental data sets for drug discovery are sometimes limited in size, due to the difficulty of gathering this type of data. Drug discovery data sets are expensive to obtain, and some are the result of clinical trials, which might not be repeatable for ethical reasons. The ClinTox data set, for example, is comprised of data from FDA clinical trials of drug candidates, where some data sets are derived from failures, due to toxic side effects [2].For cases where training data is scarce, application of one-shot learning methods have demonstrated significantly improved performance over methods consisting only of graphical convolution networks.The performance of one-shot network architectures will be discussed here for several drug discovery data sets, which are described in Table 1.

These data sets, along with one-shot learning methods, have been integrated into the DeepChem deep learning framework, as a result of research published by Altae-Tran, et al. [1].While data remains scarce for some problem domains, such as drug discovery, one-shot learning methods could pose an important alternative network architecture, which can possibly far outperform methods which use only graphical convolution.

DatasetCategoryDescriptionNetwork TypeNumber of TasksCompounds
Tox21PhysiologytoxicityClassification128,014
SIDERPhysiologyside reactionsClassification271,427
MUVBiophysicsbioactivityClassification1793,127

Table 1. DeepChem drug discovery data sets investigated with one-shot learning.

One-Shot Network Architectures Produce Most Accurate Results When Applied to Small Population Training Sets

The original motivation for investigating one-shot neural networks arose from the fact that humans can learn sufficient representations, given small amounts of data, and then later apply a learned representation to correctly distinguish between objects which have been observed only once.The one-shot network architecture has previously been developed, and applied to image data, with this motivational context in mind [3, 5].

The question arose, as to whether an artificial neural network, given a small data set, could similarly learn a sufficient number of features through training, and perform at a satisfactory level.After some period of development, one-shot learning methods have emerged to demonstrate good success [3,4,5,6].

The description provided here of the one-shot approach focuses mostly on methodology, and less on the theoretical and experimental results which support the method.The simplest one-shot method computes a distance weighted combination of support set labels.The distance metric can be defined using a structure called a Siamese network, where two identical networks are used.The first twin produces a vector output for the molecule being queried, while the other twin produces a vector representing an element of the support set.Any difference between the outputs can be interpreted as a dissimilarity measure between the query structure and any particular structure in the support set.A separate dissimilarity measure can be computed for each element in the support set, and then a normalized, weighted representation of the query structure can be determined.For example, if the query structure is significantly less dissimilar to two structures, out of, say, twenty, in the support set, then the weighted representation will be nearly the average of the vectors which represent the two support structures which most resemble the queried structure.

There are two other one-shot methods which take more complex considerations into account.In the Siamese network one-shot approach, the vector embeddings of both the query structure and each individual support structure is computed independently of the support set.However, it has been shown empirically, that by taking into account the context of all support set elements, when computing the vector embeddings of the query, and each individual support structure, better one-shot network performance can be realized.This approach is called full context embedding, since the full context of the support set is taken into account when computing every vector embedding.In the full context embedding approach, the embeddings for the every support structure are allowed to influence the embedding of the query structure.

The full context embedding approach uses Siamese, i.e. matching, networks like before, but once the embeddings are computed, they are then further processed by Long Short-Term Memory (LSTM) structures.The embeddings, before processing by LSTM structures, will be referred to here as pre-contextualized vectors. The full contextual embeddings for the support structures are produced using an LSTM structure called a bidirectional LSTM (biLSTM), while the full contextual embedding for the query structure is produced by an LSTM structure called an attentional LSTM (attLSTM).An LSTM is a type of recurring neural network, which can process sequences of input.With the biLSTM, the support set is viewed as a sequence of vectors. A bidirectional LSTM is used, instead of just an LSTM, in order to reduce dependence on the sequence order.This improves model performance because the support set has no natural order.However, not all dependence on sequence order is removed with the biLSTM.

The attLSTM constructs an order-independent full contextual embedded vector of the query structure.The full details of the attLSTM will not be discussed here, beyond saying that both the biLSTM and attLSTM are network elements which interpret some set of structures as a sequence of pre-contextualized vectors, and converts a sequence into a single full context embedded vector.One full context embedded vector is produced for the support set of structures, and one is produced for the query structure.

A further improvement has been made to the one-shot model described here.As mentioned, the biLSTM does not produce an entirely order-independent full context embedding for each pre-contextualized vector, corresponding to a support structure.As mentioned, the support set does not contain any natural order to it, so any sequence order dependence present in the full context embedded support vector is an unwanted artifact, and will lead to reduced model performance.There is another problem, which is that, in the way they have been defined, the full context embedded vectors of the support structures depend only on the pre-contextualized support vectors, and not on the pre-contextualized query vector.On the other hand, the full context embedded vector of the query structure depends on both its own pre-contextualized vector, and the pre-contextualized vectors of the support set.This asymmetry indicates that some additional information is not accounted for in the model, and that performance could be improved if this asymmetry could be removed, and if the order dependence of the full context embedded support vectors could also be removed.

To address this problem, a new LSTM model was developed by Altae-Tran, et al., called the Iteratively Refined LSTM (IterRefLSTM).The full details of how the IterRefLSTM model operates is beyond the scope of this discussion.A full explanation can be found in Altae-Tran, et al.Put briefly, the full contextual embedded vectors of the support and query structures are co-evolved, in an iterative process, which uses an attLSTM element, and results in removal of order-dependence in the full contextual embedding for the support, as well removal of the asymmetry in dependency between the full context embedded vectors of the support and query structures.

A brief summary of the one-shot network architectures discussed is presented in Table 2.

ArchitectureDescription
Siamese Networksscore comparison, dissimilarity measure
Attention LSTM (attLSTM)better extraction of prior data, contains order-dependence of input data
Iterative Refinement LSTMs (IterRefLSTM)similar to attLSTM, but removes all order dependence of data by iteratively evolving the query and support embeddings simultaneously in an iterative loop

Table 2. One-shot networks used for investigating low-population biological assay data sets.

Computed Results of One-Shot Performance Metric is Compared to Published Values

A comparison of independently computed values is made here with published values from Altae-Tran, et al. [1].Quantitative results for classification tasks associated with the Tox21, SIDER, and MUV datasets were obtained by evaluating the the area under the receiver operating characteristic curve (read more on AUROC).For datasets having more than one task, the median of the performance metric over all tasks in the held-out data sets is reported.A k-fold cross-validation was then done, with k=4.The mean of performances across all cross-validations was then taken, and reported as the performance measure for the data set.A discussion of the standard deviation is given further below.

Since the tasks for Tox21, SIDER, and MUV are all classification tasks for binary assay data, with positive and negative results from a clinical trial, for example, the performance values, as mentioned, are reported with the AUROC metric.With AUROC, a value of 0.5 indicates no predictive power, while a result of 1.0 indicates that every outcome in the held out data set has been predicted correctly [Kennis Research, 9]. A value less than 0.5 can be interpreted as a value of 1.0 minus the metric value.This operation corresponds to inverting the model, where True is now False, and vice versa.This way, a metric value between 0.5 and 1.0 can always be realized. Each data set performance is reported with a standard deviation, containing dependence on the dispersion of metric values across classifications tasks, and then k cross-validations.

Our computed values match well with those published by Altae-Tran, et al. [1], and essentially confirm their published performance metric values, from their ACS Central Science publication.The first and second columns in Table 3 show classification performance for GC tasks, and RF, respectively, as computed by Altae-Tran, et al.Single task GC and RF results are presented as a baseline of comparison to one-shot methods.

The use of k-fold cross validation improves the estimated predicted performance of the model, as it would perform if trained on all of the data, and not just a training subset, with a portion reserved testing.Since we cannot directly measure the performance of a model trained on the full data set (since no testing data would remain), the k-fold cross validation is used to provide a best guess of a performance estimate we cannot see (until final deployment), where a deployed network would be trained on all of the data.

 Tox21SIDERMUV
Random Forests‡,⁑0.539 ± 0.0490.557 ± 0.0590.751 ± 0.062Ω
Graphical Convolution‡,⁑0.625 ± 0.0360.482 ± 0.0380.583 ± 0.061
Siamese Networks0.783 ± 0.0090.660 ± 0.0880.500 ± 0.043
AttLSTM0.759 ± 0.0070.607 ± 0.0800.500 ± 0.058
IterRefLSTM0.807 ± 0.003Ω0.751 ± 0.002Ω0.533 ± 0.051

Table 3. AUROC performance metric values for each one-shot method, plus the random forests (RF), and graphical convolution (GC) methods.Metric values were measured across Tox21, SIDER, and MUV test data sets, using a trained modelΦ.Randomness arises from using a trained model to evaluate the AUROC metric on a test set.First a support setΨ, S, of 20 data points is chosen from the set of data points for a test task.The metric is then evaluated over the remaining points in a test task data set.This process is repeated 20 times for every test task in the data set. The mean and standard deviation for all AUROC measures generated in this way are computed.

Finally, for each data set (Tox21, SIDER, and MUV), the reported performance result is actually the median performance value across all test tasks for a data set.This indirectly implies that the individual metric performances on individual tasks is unimportant, and that they more or less tend to all do well or poorly together, without too much variance across tasks.However, a median measure can mask outliers, where performance on one task might be very bad.If outliers can be removed for rational reasons, then using the median across task performance can be an effective way of removing the influence of outliers.


The performance measures for RF and GC were computed with one-fold cross validation (i.e. no cross-validation).This is because the RF and GC scripts available with our current version of DeepChem (July, 2017), are written for performing only one-fold validation with these models.

The variances of k-fold cross validation performance estimates were determined from pooling all performance values, and then finding the median variance of the entire pool.More complex techniques exist for estimating the variance from a cross-validated set, and the reader is invited to investigate other methods [Nadeau, et al.].

Ω This performance measure by IterRefLSTM on the Tox21 data set is the only performance which rates rates as good.IterRefLSTM performance on the SIDER dataset performs fairly, while RF on MUV, rates as only fair.

Φ Since network inference (predicting outcomes) can be done much faster than network training, due to the computationally expensive backprogragation algorithm, only a batch, B, of data points, and not the entire training data, excluding support data, are selected to train.A support set, S, of 20 data points, along with a batch of queries, B, of 128 data points, is selected for each training set task, in each of the the held-out training sets, for a given episode of training.

A number of training episodes equal to 2000 * ntrain is performed, with one step of minimization performed by the ADAM optimizer per episode[11]. ntrain is the number of test tasks in a test set.After the total number of training episodes has been computed, an intermediate information structure for the the attLSTM, and IterRefLSTM models, called the embedding vector set, described earlier, is produced.It should be noted that the same size of support set, S, is also used during model testing on the held out testing tasks.

Ψ Every support set, S, whether selected during training or testing, was chosen so that it contained 10 positive and 10 negative samples for the task in question.In the full study done in [1], however, variations on the number of positive and negatives samples are explored for the support set, S.Investigators found that by sampling more data points in S, rather than increasing the number of backpropagation iterations, better model performance resulted.


It should be noted that, for a support set of 10 positive, and 10 negative assay results, our computed results for the Siamese method on MUV do not show any predictive performance.The results published by Altaei-Tran, however, indicate marginally predictive, but poor predictability, with an AUROC metric value of 0.601 ± 0.041.

Our metric was computed several times on both a Tesla P100 16GB GPU, and a Tesla M40 GPU, but we found, with this particular support, the Siamese model has no predictive power, with a metric value of 0.500 ± 0.043 (see Table 3). Our other computed results for the AttLSTM and IterRefLSTM concur with published results, which show that neither one-shot learning method has predictive power on MUV data, with a support set containing 10 positive, and 10 negative assay results.

The Iterative Refinement LSTM shows a narrower dispersion of scores than other one-shot Learning models.This result agrees with published standard deviation values for LSTM in Altae-Tran, et al. [1].

Speedups factors are determined by comparing runtimes on the NVIDIA Tesla P100 GPU, to runtimes on the Tesla M40 GPU, and are presented in Tables 4 and 5.Speedup factors are found to be not as pronounced for one-shot methods, and an explanation of the speedup results is presented.The approach for training and testing one-shot methods is described, as they involve some extra considerations which do not apply to graphical convolution.

 Tesla P100 runtimesTesla M40 runtimes
 Tox21SIDERMUVTox21SIDERMUV
Random Forests253784243783
Graphical Convolution38796441100720
Siamese8572,1801,4649562,4071,617
AttLSTM9332,4051,5911,0412,5811,725
IterRefLSTM1,0062,5111,6801,1012,7211,834

Table 4. Runtimes for each one-shot model on the NVIDIA Tesla M40 and Tesla P100 16GB PCIe GPU.All runtimes are in seconds.

RF are run entirely on CPU, and reflect CPU runtimes. Their run times are shown with strikethrough, to indicate that their values are not be considered for determining GPU speedup factors.

A quick inspection of the results in Table 4 shows that the one-shot methods perform better on the Tox21 and SIDER data sets, but not on the MUV data.A reason for poor performance of one-shot methods in MUV data is proposed below.

Limitations of One-Shot Networks

Compared to previous methods, one-shot networks demonstrate extraction of more information from the prior (support) data than RF or GC, but with a limitation.One-shot methods are only successful when data in the held out testing set is sufficiently similar to data seen during training.Networks trained using one-shot methods do not perform well when trying to classify data that is too dissimilar from the sample data used for training.In the context of drug discovery, this problem is encountered when trying to apply one-shot learning to the Maximum Unbiased Validation (MUV) dataset, for example [10].Benchmark results show that all three one-shot learning methods explored here do little better than pure chance when making classification predictions with MUV data (see Table 3).

The MUV dataset contains around 93,000 compounds, and represents a diverse collection of molecular scaffolds, compared to Tox21 and SIDER.One-shot methods do not perform as well on this data set, probably because there is less structural similarity between the elements of the MUV dataset, compared to Tox21 and SIDER.One-shot networks require some amount structural similarity, within the data set, in order to extrapolate from limited data, and correctly classify new, but similar, compounds.

A metric of self-similarity within a data set could be computed as a data set size-independent, extensive measure, where every element is compared to every other measurement, and some attention measure is evaluated, such a cosine distance.The attention measure can be summed through all unique comparisons, and then be normalized, by diving by the number of unique comparisons between N elements in the set to all other elements in the set.

 Tox21SIDERMUV
GC1.0791.26611.25α
Siamese1.116λ1.1041.105
AttLSTM1.116λ1.1161.084
IterRefLSTM1.094λ1.0841.092

Table 5. Speedup Factors, comparing the Tesla P100 16GB GPU to the Tesla M40. All speedups are Tesla P100 runtimes divided by Tesla M40 runtimes.


α The greatest speedup is observed with GC on the MUV data set (Table 5).

GC also exhibits the most precipitous drop in performance, as it transitions to one-shot models. Table 4 indicates that the GC model performs better across all data sets, compared to one-shot methods.This is not surprising, because the purely graphical model is more susceptible to GPU acceleration.However, it is crucial to note that GC models perform worse than one-shot models on Tox21, SIDER, but not MUV.On MUV, GC has nearly no predictive ability (Table 3), compared the one-shot models which have absolutely no predictability with MUV.

λ The one-shot newtorks, while providing substantial improvement in performance, do not seem to show a significant speedup, observing the values for the data sets, in the rows for Siamese, attLTSM, or IterRefLSTM.The nearly absent-speedup could arise from high GPU-system memory transfers.Note, however, that although small, there is a slight but consistent improvement is speedup for the one-shot networks for the Tox21 set.The Tox21 data set may therefore require fewer transfers to system memory.A general observation of the element flatline in speedup for one-shot methods may be from the LSTM elements.


Generally, deep convolutional network models, such as GC, or models which benefit from having a large data set containing structurally diverse groups, such as RF and GC, perform better on the MUV data.RF, for example, shows the best performance, even if very poor.Deep networks have demonstrated that, provided enough layers, they have the information-holding capacity required in order to learn and retain representations for the MUV data.Their information-holding capacity is what enables them to classify between the large number of structurally diverse classes in MUV.It may be the case that the hyper parameters for the graphical convolutional network at not set such that the GC model would yield a poor to fair level of performance on MUV.In their paper, Altae-Tran stated that hyperparameters for the convolutional networks were not optimized, and that there may be an opportunity to improve performance there [1].

Remarks on Neural Network Information Structure, and How One-Shot Networks are Different

All neural networks require training data in order to develop structure under training pressure.Feature complexity, in image classification networks, becomes stratified, under training pressure, through the network’s layers.The lowest layers emerge as edge detectors, with successive layers building upon features from previous layers.The second layer, for example, can build corner detectors, or curved edge detectors, by detecting combinations of simpler edges.Through a buildup of feature complexity, eventually, higher layers can emerge which can detect complex, high-level features such as faces.The combinatorial size of the detectable feature space grows with the number of viable filters (kernels) connecting each layer to the preceding layer.With Natural Language Processing (NLP) networks, layer complexity progresses from sentence features, to paragraphs, then chapters, and finally whole book vector representations, which consist of succinct thematic summaries of written works.

To reiterate, all networks require information structure, acquired under training pressure, to develop some inner representation, or “belief” about data.Deep networks allow for more diverse structures to be learned, compared to one-shot networks, which are limited in their ability to learn diverse representations.One-shot structural features are designed to improve extraction of information for support data, in order to learning a representation which can be used to extrapolate from a smaller group of similar classes.One-shot methods do not perform as well with MUV, compared to RF, for the reason that they are not designed to produce a useful network from a data set having the level of dissimilarity between molecular scaffolds between elements, such as with MUV.

Transfer Learning with One-Shot Learning Network Architecture

A network trained on the Tox21 data set was evaluated on the SIDER data set.The results, given by the performance metric values, shown in Table 6, indicate that the network trained on Tox21 has nearly no predictive power on the SIDER data.This indicates that the performance does not generalize well to new chemical scaffolds, which supports the explanation for why one-shot methods do poorly at predicting the results for the MUV dataset.

 SiameseattnLSTMIterRefLSTM
To SIDER from Tox210.5050.5020.504

Table 6. Transfer Learning to SIDER from Tox21. These results agree with the performance metric values reported for transfer learning in [1], and support the conclusion that transfer learning between data sets will result in no predictive capability, unless the data sets are significantly similar.

Conclusion

For binary classification tasks associated with small population data sources, one-shot learning methods may provide significantly better results compared to baseline performances of graphical convolution and random forests.The results show that the performance of one shot learning methods may depend on the diversity of molecular scaffolds in a data set.With MUV, for example, one shot methods did not extrapolate well to unseen molecular scaffolds.Also, the failure of transfer learning from the Tox21 network, to correctly predict SIDER assay outcomes, also indicates that data set training may not be easily generalized with one shot networks.

The Iterative Refinement LSTM method developed in [1] demonstrates that LSTMs can generalize to similar experimental assays which are not identical to assays in the data set, but which have some common relation.

References

1.) Altae-Tran, Han, Ramsundar, Bharath, Pappu, Aneesh S., and Pande, Vijay”Low Data Drug Discovery with One-Shot Learning.” ACS central science 3.4 (2017): 283-293.
2.) Wu, Zhenqin, et al. “MoleculeNet: A Benchmark for Molecular Machine Learning.” arXiv preprint arXiv:1703.00564 (2017).
3.) Hariharan, Bharath, and Ross Girshick. “Low-shot visual object recognition.” arXiv preprint arXiv:1606.02819 (2016).
4.) Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. “Siamese neural networks for one-shot image recognition.” ICML Deep Learning Workshop. Vol. 2. 2015.
5.) Vinyals, Oriol, et al.Matching networks for one shot learning.” Advances in Neural Information Processing Systems. 2016.
6.) Wang, Peilu, et al. “A unified tagging solution: Bidirectional LSTM recurrent neural network with word embedding.” arXiv preprint arXiv:1511.00215 (2015).
7.) Duvenaud, David K., et al.Convolutional networks on graphs for learning molecular fingerprints.” Advances in neural information processing systems. 2015.
8.) Lusci, Alessandro, Gianluca Pollastri, and Pierre Baldi. “Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules.” Journal of chemical information and modeling 53.7 (2013): 1563-1575.
9. Receiver Operating Curves Applet, Kennis Research, 2016.
10. Maximum Unbiased Validation Chemical Data Set
11. Kingma, D. and Ba, J. Adam: a Method for Stochastic Optimization, arXiv preprint: arxiv.org/pdf/1412.6980v8.pdf.
12. University of Nebraska Medical Center online information resource, AUROC
13. Inference for the Generalization of Error

The post One-shot Learning Methods Applied to Drug Discovery with DeepChem appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/one-shot-learning-methods-applied-drug-discovery-deepchem/feed/ 0
DeepChem – a Deep Learning Framework for Drug Discovery https://www.microway.com/hpc-tech-tips/deepchem-deep-learning-framework-for-drug-discovery/ https://www.microway.com/hpc-tech-tips/deepchem-deep-learning-framework-for-drug-discovery/#respond Fri, 28 Apr 2017 19:02:51 +0000 https://www.microway.com/?p=8687 A powerful new open source deep learning framework for drug discovery is now available for public download on github.This new framework, called DeepChem, is python-based, and offers a feature-rich set of functionality for applying deep learning to problems in drug discovery and cheminformatics.Previous deep learning frameworks, such as scikit-learn have been applied to chemiformatics, but […]

The post DeepChem – a Deep Learning Framework for Drug Discovery appeared first on Microway.

]]>
A powerful new open source deep learning framework for drug discovery is now available for public download on github.This new framework, called DeepChem, is python-based, and offers a feature-rich set of functionality for applying deep learning to problems in drug discovery and cheminformatics.Previous deep learning frameworks, such as scikit-learn have been applied to chemiformatics, but DeepChem is the first to accelerate computation with NVIDIA GPUs.

The framework uses Google TensorFlow, along with scikit-learn, for expressing neural networks for deep learning.It also makes use of the RDKit python framework, for performing more basic operations on molecular data, such as converting SMILES strings into molecular graphs.The framework is now in the alpha stage, at version 0.1.As the framework develops, it will move toward implementing more models in TensorFlow, which use GPUs for training and inference.This new open source framework is poised to become an accelerating factor for innovation in drug discovery across industry and academia.

Another unique aspect of DeepChem is that it has incorporated a large amount of publicly-available chemical assay datasets, which are described in Table 1.

DeepChem Assay Datasets

DatasetCategoryDescriptionClassification TypeCompounds
QM7Quantum Mechanicsorbital energies
atomization energies
Regression7,165
QM7bQuantum Mechanicsorbital energiesRegression7,211
ESOLPhysical ChemistrysolubilityRegression1,128
FreeSolvPhysical Chemistrysolvation energyRegression643
PCBABiophysicsbioactivityClassification439,863
MUVBiophysicsbioactivityClassification93,127
HIVBiophysicsbioactivityClassification41,913
PDBBindBiophysicsbinding activityRegression11,908
Tox21PhysiologytoxicityClassification8,014
ToxCastPhysiologytoxicityClassification8,615
SIDERPhysiologyside reactionsClassification1,427
ClinToxPhysiologyclinical toxicityClassification1,491

Table 1:The current v0.1 DeepChem Framework includes the data sets in this table, along others which will be added to future versions.

Metrics

The squared Pearson Correleation Coefficient is used to quantify the quality of performance of a model trained on any of these regression datasets.Models trained on classification datasets have their predictive quality measured by the area under curve (AUC) for receiver operator characteristic (ROC) curves (AUC-ROC).Some datasets have more than one task, in which case the mean over all tasks is reported by the framework.

Data Splitting

DeepChem uses a number of methods for randomizing or reordering datasets so that models can be trained on sets which are more thoroughly randomized, in both the training and validation sets, for example.These methods are summarized in Table 2.

DeepChem Dataset Splitting Methods

Split Typeuse cases
Index Splitdefault index is sufficient as long as it contains no built-in bias
Random Splitif there is some bias to the default index
Scaffold Splitif chemical properties of dataset will be depend on molecular scaffold
Stratified Random Splitwhere one needs to ensure that each dataset split contains a full range of some real-valued property

Table 2:Various methods are available for splitting the dataset in order to avoid sampling bias.

Featurizations

DeepChem offers a number of featurization methods, summarized in Table 3.SMILES strings are unique representations of molecules, and can themselves can be used as a molecular feature.The use of SMILES strings has been explored in recent work.SMILES featurization will likely become a part of future versions of DeepChem.

Most machine learning methods, however, require more feature information than can be extracted from a SMILES string alone.

DeepChem Featurizers

Featurizeruse cases
Extended-Connectivity Fingerprints (ECFP)for molecular datasets not containing large numbers of non-bonded interactions
Graph ConvolutionsLike ECFP, graph convolution produces granular representations of molecular topology. Instead of applying fixed hash functions, as with ECFP, graph convolution uses a set of parameters which can learned by training a neural network associated with a molecular graph structure.
Coloumb MatrixColoumb matrix featurization captures information about the nuclear charge state, and internuclear electric repulsion. This featurization is less granular than ECFP, or graph convolutions, and may perform better where intramolecular electrical potential may play an important role in chemical activity
Grid Featurizationdatasets containing molecules interacting through non-bonded forces, such as docked protein-ligand complexes

Table 3:Various methods are available for splitting the dataset in order to avoid sampling bias.

Supported Models

Supported Models as of v0.1

Model Typepossible use case
Logistic Regressioncontinuous, real-valued prediction required
Random ForestClassification or Regression
Multitask NetworkIf various prediction types required, a multitask network would be a good choice. An example would be a continuous real-valued prediction, along with one or more categorical predictions, as predicted outcomes.
Bypass NetworkClassification and Regression
Graph Convolution Modelsame as Multitask Networks

Table 4: Model types supported by DeepChem 0.1

A Glimpse into the Tox21 Dataset and Deep Learning

The Toxicology in the 21st Century (Tox21) research initiative led to the creation of a public dataset which includes measurements of activation of stress response and nuclear receptor response pathways by 8,014 distinct molecules.Twelve response pathways were observed in total, with each having some association with toxicity.Table 5 summarizes the pathways investigated in the study.

Tox21 Assay Descriptions

Biological Assaydescription
NR-ARNuclear Receptor Panel, Androgen Receptor
NR-AR-LBDNuclear Receptor Panel, Androgen Receptor, luciferase
NR-AhRNuclear Receptor Panel, aryl hydrocarbon receptor
NR-AromataseNuclear Receptor Panel, aromatase
NR-ERNuclear Receptor Panel, Estrogen Receptor alpha
NR-ER-LBDNuclear Receptor Panel, Estrogen Receptor alpha, luciferase
NR-PPAR-gammaNuclear Receptor Panel, peroxisome profilerator-activated receptor gamma
SR-AREStress Response Panel, nuclear factor (erythroid-derived 2)-like 2 antioxidant responsive element
SR-ATAD5Stress Response Panel, genotoxicity indicated by ATAD5
SR-HSEStress Response Panel, heat shock factor response element
SR-MMPStress Response Panel, mitochondrial membrane potential
SR-p53Stress Response Panel, DNA damage p53 pathway

Table 5:Biological pathway responses investigated in the Tox21 Machine Learning Challenge.

We used the Tox21 dataset to make predictions on molecular toxicity in DeepChem using the variations shown in Table 6.

Model Construction Parameter Variations Used

Dataset SplittingIndexScaffold
FeaturizationECFPMolecular Graph Convolution

Table 6:Model construction parameter variations used in generating our predictions, as shown in Figure 1.

A .csv file containing SMILES strings for 8,014 molecules was used to first featurize each molecule by using either ECFP or molecular graph convolution.IUPAC names for each molecule were queried from NIH Cactus, and toxicity predictions were made, using a trained model, on a set of nine molecules randomly selected from the total tox21 data set.Nine results showing molecular structure (rendered by RDKit), IUPAC names, and predicted toxicity scores, across all 12 biochemical response pathways, described in Table 5, are shown in Figure 1.

Tox21 wprediction ith DeepChem
Figure 1. Tox21 Predictions for nine randomly selected molecules from the tox21 dataset

Expect more from DeepChem in the Future

The DeepChem framework is undergoing rapid development, and is currently at the 0.1 release version.New models and features will be added, along with more data sets in future.You can download the DeepChem framework from github.There is also a website for framework documentation at deepchem.io.

Microway offers DeepChem pre-installed on our line of WhisperStation products for Deep Learning. Researchers interested in exploring deep learning applications with chemistry and drug discovery can browse our line of WhisperStation products.

References

1.) Subramanian, Govindan, et al. “Computational Modeling of β-secretase 1 (BACE-1) Inhibitors using Ligand Based Approaches.” Journal of Chemical Information and Modeling 56.10 (2016): 1936-1949.
2.) Altae-Tran, Han, et al. “Low Data Drug Discovery with One-shot Learning.” arXiv preprint arXiv:1611.03199 (2016).
3.) Wu, Zhenqin, et al. “MoleculeNet: A Benchmark for Molecular Machine Learning.” arXiv preprint arXiv:1703.00564 (2017).
4.) Gomes, Joseph, et al. “Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity.” arXiv preprint arXiv:1703.10603 (2017).
5.) Gómez-Bombarelli, Rafael, et al. “Automatic chemical design using a data-driven continuous representation of molecules.” arXiv preprint arXiv:1610.02415 (2016).
6.) Mayr, Andreas, et al. “DeepTox: toxicity prediction using deep learning.” Frontiers in Environmental Science 3 (2016): 80.

The post DeepChem – a Deep Learning Framework for Drug Discovery appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/deepchem-deep-learning-framework-for-drug-discovery/feed/ 0
Deep Learning Applications in Science and Engineering https://www.microway.com/hpc-tech-tips/deep-learning-applications/ https://www.microway.com/hpc-tech-tips/deep-learning-applications/#respond Wed, 29 Jun 2016 15:44:51 +0000 https://www.microway.com/?p=7385 Over the past decade, and particularly over the past several years, Deep learning applications have been developed for a wide range of scientific and engineering problems. For example, deep learning methods have recently increased the level of significance of the Higgs Boson detection at the LHC. Similar analysis is being used to explore possible decay […]

The post Deep Learning Applications in Science and Engineering appeared first on Microway.

]]>
Over the past decade, and particularly over the past several years, Deep learning applications have been developed for a wide range of scientific and engineering problems. For example, deep learning methods have recently increased the level of significance of the Higgs Boson detection at the LHC. Similar analysis is being used to explore possible decay modes of the Higgs. Deep Learning methods fall under the larger category of Machine Learning, which includes various methods, such as Support Vector Machines (SVMs), Kernel Methods, Hidden Markov Models (HMMs), Bayesian Methods, along with regression techniques, among others.

Deep learning is a methodology which involves computation using an artificial neural network (ANN). Training deep networks was not always within practical reach, however. The main difficulty arose from the vanishing/exploding gradient problem. See a previous blog on Theano and Keras for a discussion on this. Training deep networks has become feasible with the developments of GPU parallel computation, better error minimization algorithms, careful network weight initialization, the application of regularization or dropout methods, and the use of Rectified Linear Units (ReLUs) as artificial neuron activation functions. ReLUs can tune weights such that the backpropagation signal does not become too attenuated.

Application of Trained Deep Networks

Trained deep networks are now being applied to a wide range of problems. Some areas of application could instead be solved using a numerical model comprised of discrete differential equations. For instance, deep learning is being applied to the protein folding problem, which could be modeled as a physical system, using equations of motion (for very small proteins), or energy-based minimization methods (for larger systems). Used as an alternative approach, a deep network can be trained on correctly folded tertiary protein structures, given primary and secondary structure, as input data. The trained network could then predict a protein’s tertiary structure [Lena, P.D., et al.].

Neural networks offer an alternative, data-based method to solve problems which were previously approached using physical numerical models, or other machine learning methods. A distinction between data-based models and physical models is that data-based models can be applied to problems for which no well-accepted, or practical, predictive theoretical framework exists.

Deep Neural Networks as Biological Analogs

Aside from providing an alternative data-based approach to problems for which no discrete physical model may exist, deep learning applications can reproduce some function of a real-world biological neural network analog, such as vision, or hearing.

In both biological and artificial visual networks, the lower convolutional layers detect the most basic features, such as edges. Convolutional layers are separated by pooling layers, which add some robustness to feature detection, so that if a feature is translated slightly, or rotated a bit, it will still be detected. Successive convolution/feature layers build from edges to form features with multiple edges, or curves. In most network architectures, pooling layers are placed between convolutional layers. This is done to add robustness to each feature detection layer. The highest convolutional layers are built upon combinations of features from previous layers. The highest layers build the most complex feature detectors. The weights in the highest layers become set, through training pressure, to become detectors for complex shapes, such faces, chairs, tires, houses, doors, etc. The layers in a deep visual classification network will separate out image features from lowest to highest complexity. If a network does not have sufficient depth, then there will not be good separation and the classifications will be too blurred and unfocused.

Deep Learning Applications in Science and Engineering

Despite the advances of the past decade, deep learning cannot presently be applied to just any sort of research problem. Some problems still have either not been expressed in an information framework that is compatible with deep learning, or there are not yet deep network architectures that exist which can perform the kinds of functions needed. Deep learning has shown surprising progress, however. For example, a recent advance in deep learning surpassed nearly everyone’s expectation, when Google DeepMind’s “AlphaGo” AI player defeated the World Champion, Lee Sedol, in the game of Go. This milestone achievement was thought to be decades away, not mere months.

The following sections are not meant to be a complete description of deep learning applications, but are meant to demonstrate the wide range of scientific research problems to which deep learning can be applied. Recent major developments in algorithms, methods, and parallel computation with GPUs, have created the right conditions which precipitated the recent succession of major advances in the field of Artificial Intelligence.

Deep Learning in Image Classification

Image Classification uses a particular type of deep neural network, called a convolutional neural network (CNN). Figure 1 illustrates the basic organization of a CNN for visual classification. The actual network for this sort of task would have more neurons per layer.

Deep Learning Applications to Visual Recognition
Figure 1. Convolutional Neural Network for Facial Recognition

It is, however, possible to scale an image down to some level without losing the essential features for detection. If the features in an image are too large or too small for the filter sizes, however, then the features will not be detected. This is a subtle point which presents a problem for visual recognition. Recent approaches have addressed this problem by having the network construct various filters of the same feature but at different size scales. For a given trained neural network, images must be scaled such that their feature sizes match those in the highest layer convolutional filters. The pooling layers impart some robustness for feature detection, which allow for some small amount of feature rotation and translations. However, if the face is flipped upside down, nothing will work, and the network will not correctly identify the face. This is in fact a methodological difficulty, and in order to address it, the network must develop feature detectors for the same feature, but at different rotations. This can be done by including rotations of the image into the training set. A similar problem arises if a face is rotated not in the plane, but out of the plane. This introduces distortions of key facial features, which would once again foil a network trained only on forward facing faces. These problems of rotational and scale invariance are active areas of research in the area of object recognition.

Looking at the network in Figure 1, the three output neurons indicate, in coded form, the name of the person whose face is presented to the network. The grayscale values in the grayscale images indicate connection weight values.

Deep Learning Application for Autonomous Vehicles
Figure 2. NVIDIA DRIVE PX2 for Autonomous Cars (image re-used with permission by NVIDIA Automotive)

In one research development, de-noising autoencoders were used to remove fog from images taken live from autonomous land vehicles. A similar solution was used for enhancing low-light images [Lore, K., et al.] Each square tile represents a different convolutional filter, which is formed under training pressure to extract certain features. The convolutional filters in the lowest layers pick out edges. Higher layers detect more complex features, which could consist of combinations of edges, to form a nose, or chin, for example.

Recent major advances in machine vision research include image content tagging (Regional Convolutional Neural Networks, or RCNNs), along with development of more robust recognition of objects, in the presence of noise, applied rotation, size variation, etc. Scene recognition deep networks are currently being used in self-driving cars (Figure 2).

Deep Learning in Natural Language Processing

Significant progress has also been made in Natural Language Processing (NLP). Using word and sentence vector representations along with syntactic tree parsing, NLP ANNs have been able to identify complex variations in written form, such as sarcasm, where a seemingly positive sentence takes on a sudden negative meaning [Socher, R., et al.]. The meaning of large groups or bodies of sentences, such as articles, or chapters from books, can be resolved to a group of vectors, summarizing the meaning of the text.

In a different NLP application, a collection of IMDB movie reviews were used to train a deep network to evaluate the sentiment of movie reviews. An approach for this was examined using Keras in a previous Microway Tech Tips blog post. Primary applications of NLP deep networks include language translation, and sentiment analysis.

Transcription of video to text, in the absence of audio is an active area of research which involves both Image Classification and NLP. Deep networks for NLP are usually recursive neural networks (RNNs), having the output fed back into the input layer. For NLP tasks, the previous context partly determines the best vector for representing the next sentence, or word. For a review of RNNs, see, for example, The Unreasonable Effectiveness of Recurrent Neural Networks, by Andrej Karpathy.

Language Translation

Encoder-Decoder frameworks have been developed for encoding English words, for example, into reduced vector representations, and then decoding the reduced representations of English words into French words, using a French decoder. The encoding/decoding can be done between any two languages. The reduced representation can be thought to be a universal encoding, which encodes the word into a distributed pattern in the network [Cho, K., et al.]. This sort of distributed encoded pattern has been referred to as constituting a “thought”, or an internal encoded representation of data.

Automatic Speech Recognition

Automatic Speech Recognition (ASR), is another research area seeing deep networks producing results better than any method previously used, including Hidden Markov Models. Previously, ASR used Hidden Harkov Models, mixed with ANNs, in a hybrid method. Because deep networks for ASR can now be trained within practical timescales, deep learning is now producing the best results. Long Short Term Memory (LSTM) [Song, W., et al & Sak, H., at al] and Gated Recurrent Units (GRU) play an important role in improving ASR deep networks, by helping to retain information from more than several iterations ago. The TIMIT speech dataset is used as a primary data source for training ASR deep networks.

Deep Learning in Scientific Experiment Design

Using a deep learning approach, machines can now provide direction on the design of scientific experiments. Consider for example, a recent deep learning approach taken by materials scientists, where new NiTi-based shape memory alloys were explored for lower thermal diffusivity [Xue, D., et al.]. From a dataset of 22 known NiTi-based alloys, a deep network was trained to report their 22 measured thermal diffusivity values. Particular physical properties of the 22 alloys were used as input parameters for training the network.

With the deep network trained on the known alloys, it was used to determine the diffusivity values for a large number of theoretical alloys. Four alloys were selected from the predicted set which showed the lowest estimated thermal diffusivity values. Real experiments were then carried out on these four theoretical alloys, and their thermal diffusivity values were measured. The data for these four new alloys, with known thermal diffusivities were then added to the training set, and the network was re-trained in order to improve the accuracy of the deep network. After the experiment proceeded in iterations of four unexplored alloys, the final remarkable result was reached, where 14 of the 36 new alloys had a smaller thermal diffusivity than any of the 22 known alloys in the original data set.

High Throughput Screening Experimentation will be improved with Deep Learning

Research problems which have large combinatorics of possible experiments, such as the investigation of new NiTi shape memory alloys, are likely to be expressible into an information framework conducive for solving with deep learning. Once trained, the deep networks will help the investigator sort through the vast combinatoric landscape of experiment design possibilities. Trained deep networks will estimate which experiments will result in the best property being sought after. Once the best candidate experiments are performed, and the property of interest is measured, the deep network can be re-trained with the new data. Instead of starting with a total of twenty 384-well plates, for example, the researcher may only need one quarter of this amount, or may instead fill the twenty plates with more promising molecular candidates.

Deep Learning in High Energy Physics

The discovery of the Higgs Boson marked a major achievement for the Standard Model of high energy particle physics. First detected in 2011/2012 at the CERN LHC, the elusive particle was hypothesized to be responsible for imparting the property of mass, onto other particles (except for massless particles). Detecting the Higgs Boson with a high enough level of certainty to declare it an actual discovery required examining its decay modes in millions of high energy particle collisions, where two protons collided at sufficiently high energy to create two heavy Tau leptons, which then spontaneously decayed into lighter leptons, the muon and the electron. Through the course of these spontaneous decays, tell-tale signatures could be discerned in the data, indicating that the resulting particles and momenta were very likely to have come from the decay of a Higgs Boson.

Machine Learning techniques have been used in particle physics data analysis since their development. The application of deep networks and deep learning is an extension of machine learning methods which have previously been widely used for this sort of data analysis [Sadowski, P., et al. & Sadowski, P., et al.]

Deep Learning in Drug Discovery

Deep Learning Applications to Drug Discovery
Figure 3. A DNN is trained on gene expression levels and pathway activations scores to predict therapeutic use categories

Deep Learning is beginning to see applications in pharmacology, in processing large amounts of genomic, transcriptomic, proteomic, and other “-omic” data [Mamoshina, P, et al.]. Recently, a deep network was trained to categorize drugs according to therapeutic use by observing transcriptional levels present in cells after treating them with drugs for a period of time [Aliper, A, et al.] (Figure 3). Deep learning has also been used to identify biomarkers from blood which are strong indicators for age [Putin, E., et al.].

Deep Learning is Just Getting Started

DRIVE PX onboard embedded system
Figure 4. Object Recognition by NVIDIA DRIVE PX, an onboard scene processing neural network (image re-used with permission by NVIDIA Automotive)

In addition to the applications mentioned here, there are numerous others, including robotics, autonomous vehicles (see Figure 4), genomics, bioinformatics [Alipanahi, B., et al.], and cancer screening, for example. The 21st International Conference on Pattern Recognition (ICPR2012) hosted a challenge for detecting breast cancer cell mitosis in histological images. In April 2016, the Massachusetts General Hospital (MGH) announced it would begin a major research effort into exploring ways to improve health care and disease management through application of artificial intelligence and deep learning to a vast and growing volume of personal health data. MGH will be using the NVIDIA DGX-1 Deep Learning Appliance as the hardware platform for the research initiative.

Want to use Deep Learning?

Microway’s Sales Engineers are excited about deep learning, and we are happy to help you find the best solution for your research. Let us know what you’re working on and we’ll help you put together the right configuration.

References

1. Lena, Pietro D., Ken Nagata, and Pierre F. Baldi. “Deep spatio-temporal architectures and learning for protein structure prediction.” Advances in Neural Information Processing Systems. 2012.
2. Lee, Honglak, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.” In Proceedings of the 26th annual international conference on machine learning, pp. 609-616. ACM, 2009.
3. Lore, Kin Gwn, Adedotun Akintayo, and Soumik Sarkar. “LLNet: A Deep Autoencoder Approach to Natural Low-light Image Enhancement.” arXiv preprint arXiv:1511.03995 (2015).
4. Socher, Richard, et al. “Recursive deep models for semantic compositionality over a sentiment treebank.” Proceedings of the conference on empirical methods in natural language processing (EMNLP). Vol. 1631. 2013.
5. Cho, Kyunghyun, et al. “On the properties of neural machine translation: Encoder-decoder approaches.” arXiv preprint arXiv:1409.1259 (2014).
6. Song, William, and Jim Cai. “End-to-End Deep Neural Network for Automatic Speech Recognition.”
7. Sak, Haşim, Andrew Senior, and Françoise Beaufays. “Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition.” arXiv preprint arXiv:1402.1128 (2014).
8. Dezhen Xue et al., Accelerated search for materials with targeted properties by adaptive design, Nature Communications (2016). DOI: 10.1038/ncomms11241
9. Sadowski, Peter J., Daniel Whiteson, and Pierre Baldi. “Searching for higgs boson decay modes with deep learning.” Advances in Neural Information Processing Systems. 2014.
10. Sadowski, P., Collado, J., Whiteson, D., and Baldi, P., Deep Learning, Dark Knowledge, and Dark Matter, JMLR: Workshop and Conference Proceedings 42:81-97, 2015
11. Mamoshina, Polina, et al. “Applications of deep learning in biomedicine.” Molecular pharmaceutics 13.5 (2016): 1445-1454.
12. Aliper, Alexander, et al. “Deep learning applied to predicting pharmacological properties of drugs and drug repurposing using transcriptomic data.” Molecular pharmaceutics (2016).
13. Putin, Evgeny, et al. “Deep biomarkers of human aging: Application of deep neural networks to biomarker development.” Aging 8.5 (2016).

The post Deep Learning Applications in Science and Engineering appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/deep-learning-applications/feed/ 0
Accelerating Code with OpenACC and the NVIDIA Visual Profiler https://www.microway.com/hpc-tech-tips/accelerating-code-with-openacc-and-nvidia-visual-profiler/ https://www.microway.com/hpc-tech-tips/accelerating-code-with-openacc-and-nvidia-visual-profiler/#respond Mon, 14 Mar 2016 15:00:48 +0000 http://https://www.microway.com/?p=6249 Comprised of a set of compiler directives, OpenACC was created to accelerate code using the many streaming multiprocessors (SM) present on a GPU. Similar to how OpenMP is used for accelerating code on multicore CPUs, OpenACC can accelerate code on GPUs. But OpenACC offers more, as it is compatible with multiple architectures and devices, including […]

The post Accelerating Code with OpenACC and the NVIDIA Visual Profiler appeared first on Microway.

]]>
Comprised of a set of compiler directives, OpenACC was created to accelerate code using the many streaming multiprocessors (SM) present on a GPU. Similar to how OpenMP is used for accelerating code on multicore CPUs, OpenACC can accelerate code on GPUs. But OpenACC offers more, as it is compatible with multiple architectures and devices, including multicore x86 CPUs and NVIDIA GPUs.

Here we will examine some fundamentals of OpenACC by accelerating a small program consisting of iterations of simple matrix multiplication. Along the way, we will see how to use the NVIDIA Visual Profiler to identify parts of the code which call OpenACC compiler directives. Graphical timelines displayed by the NVIDIA Visual Profiler visually indicate where greater speedups can be achieved. For example, applications which perform excessive host to device data transfer (and vice versa), can be significantly improved by eliminating excess data transfer.

Industry Support for OpenACC

OpenACC is the result of a collaboration between PGI, Cray, and CAPS. It is an open specification which sets out compiler directives (sometimes called pragmas). The major compilers supporting OpenACC at inception came from PGI, Cray, and CAPS. The OpenACC Toolkit (which includes the PGI compilers) is available for download from NVIDIA

The free and open source GNU GCC compiler supports OpenACC. This support may trail the commercial implemenations.

Introduction to Accelerating Code with OpenACC

Logo of the OpenACC standard for Accelerator DirectivesOpenACC facilitates the process of accelerating existing applications by requiring changes only to compute-intense sections of code, such as nested loops. A nested loop might go through many serial iterations on a CPU. By adding OpenACC directives, which look like specially-formatted comments, the loop can run in parallel to save significant amounts of runtime. Because OpenACC requires only the addition of compiler directives, usually along with small amounts of re-writing of code, it does not require extensive re-factoring of code. For many code bases, a few dozen effectively-placed compiler directives can achieve significant speedup (though it should be mentioned that most existing applications will likely require some amount of modification before they can be accelerated to near-maximum performance).

OpenACC is relatively new to the set of frameworks, software development kits, and programming interfaces available for accelerating code on GPUs. In June 2013, the 2.0 stable release of OpenACC was introduced. OpenACC 3.0 is current as of November 2019. The 1.0 stable release of OpenACC was first made available in November, 2011.

Diagram of the Maxwell architecture's Streaming Multiprocessor (SMM)
Figure 1 The Maxwell Architecture Streaming Multiprocessor (SM)

By reading OpenACC directives, the compiler assembles CUDA kernels from each section of compute-intense code. Each CUDA kernel is a portion of code that will be sent to the many GPU Streaming Multiprocessor processing elements for parallel execution (see Figure 1).

The Compute Unified Device Architecture (CUDA) is an application programming interface (API), which was developed by NVIDIA for the C and Fortran languages. CUDA allows for parallelization of computationally-demanding applications. Those looking to use OpenACC do not need to know CUDA, but those looking for maximum performance usually need to use some direct CUDA calls. This is accomplished either by the programmer writing tasks as CUDA kernels, or by calling a CUDA ‘drop-in’ library. With these libraries, a developer invokes accelerated routines without having to write any CUDA kernels. Such CUDA ‘drop-in’ libraries include CUBLAS, CUFFT, CURAND, CUSPARSE, NPP, among others. The libraries mentioned here by name are included in the freely available CUDA toolkit.

While OpenACC makes it easier for scientists and engineers to accelerate large and widely-used code bases, it is sometimes only the first step. With CUDA, a more extensive process of code refactoring and acceleration can be undertaken. Greater speedups can be achieved using CUDA. OpenACC is therefore a relatively easy first step toward GPU acceleration. The second (optional), and more challenging step requires code refactoring with CUDA.

OpenACC Parallelization Reports

There are several tools available for reporting information on the parallel execution of an OpenACC application. Some of these tools run within the terminal and are text-based. The text reports can be generated by setting particular environment variables (more on this below), or by invoking compiler options when compiling at the command line. Text reports will provide detail on which portions of the code can be accelerated with kernels.

The NVIDIA Visual Profiler, has a graphical interface which displays a timeline detailing when data transfers occur between the host and device. Kernel launches and runtimes are indicated with a colored horizontal bar. The graphical timeline and text reports in the terminal together provide important information which could indicate sections of code that are reducing performance. By locating inefficiencies in data transfers, for example, the runtime can be reduced by restructuring parallel regions. The example below illustrates a timeline report showing excessive data transfers between the system and the GPU (the host and the device).

Applying OpenACC to Accelerate Matrix Operations

Start with a Serial Code

To illustrate OpenACC usage, we will examine an application which performs common matrix operations. To begin, look at the serial version of the code (without OpenACC compiler directives) in Figure 2:

[sourcecode language=”C”]
#include "stdio.h"
#include "stdlib.h"
#include "omp.h"
#include "math.h"

void fillMatrix(int size, float **restrict A) {
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
A[i][j] = ((float)i);
}
}
}
float** MatrixMult(int size, float **restrict A, float **restrict B,
float **restrict C) {
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
float tmp = 0.;
for (int k = 0; k < size; ++k) {
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
return C;
}
float** MakeMatrix(int size, float **restrict arr) {
int i;
arr = (float **)malloc( sizeof(float *) * size);
arr[0] = (float *)malloc( sizeof(float) * size * size);
for (i=1; i<size; i++){
arr[i] = (float *)(arr[i-1] + size);
}
return arr;
}
void showMatrix(int size, float **restrict arr) {
int i, j;
for (i=0; i<size; i++){
for (j=0; j<size; j++){
printf("arr[%d][%d]=%f \n",i,j,arr[i][j]);
}
}
}
void copyMatrix(float **restrict A, float **restrict B, int size){
for (int i=0; i<size; ++i){
for (int j=0; j<size; ++j){
A[i][j] = B[i][j];
}
}
}
int main (int argc, char **argv) {
int i, j, k;
float **A, **B, **C;

if (argc != 3) {
fprintf(stderr,"Use: %s size nIter\n", argv[0]);
return -1;
}
int size = atoi(argv[1]);
int nIter = atoi(argv[2]);

if (nIter <= 0) {
fprintf(stderr,"%s: Invalid nIter (%d)\n", argv[0],nIter);
return -1;
}
A = (float**)MakeMatrix(size, A);
fillMatrix(size, A);
B = (float**)MakeMatrix(size, B);
fillMatrix(size, B);
C = (float**)MakeMatrix(size, C);

float startTime_tot = omp_get_wtime();
for (int i=0; i<nIter; i++) {
float startTime_iter = omp_get_wtime();
C = MatrixMult(size, A, B, C);
if (i%2==1) {
//multiply A by B and assign back to A on even iterations
copyMatrix(A, C, size);
}
else {
//multiply A by B and assign back to B on odd iterations
copyMatrix(B, C, size);
}
float endTime_iter = omp_get_wtime();
}
float endTime_tot = omp_get_wtime();
printf("%s total runtime %8.5g\n", argv[0], (endTime_tot-startTime_tot));
free(A); free(B); free(C);
return 0;
}
[/sourcecode]

Figure 2 Be sure to include the stdio.h and stdlib.h header files. Without these includes, you may encounter segmentation faults during dynamic memory allocation for 2D arrays.

If the program is run in the NVIDIA Profiler without any OpenACC directive, a console output will not include a timeline. Bear in mind that the runtime displayed in the console includes runtime overhead from the profiler itself. To get a more accurate measurement of runtime, run without the profiler at the command line. To compile the serial executable with the PGI compiler, run:

pgcc -fast -o ./matrix_ex_float ./matrix_ex_float.c

The serial runtime, for five iterations with 1000x1000 matrices, is 7.57 seconds. Using larger 3000x3000 matrices, with five iterations increases the serial runtime to 265.7 seconds.

Parallelizing Matrix Multiplication

The procedure-calling iterative loop within main() cannot, in this case, be parallelized because the value of matrix A depends on a series of sequence-dependent multiplications. This is the case with all sequence-dependent evolution of data, such as with time stepped iterations in molecular dynamics (MD). In an obvious sense, loops performing time evolution cannot be run in parallel, because the causality between discrete time steps would be lost. Another way of stating this is that loops with backward dependencies cannot be made parallel.

With the application presented here, the correct matrix product is dependent on the matrices being multiplied together in the correct order, since matrix multiplication does not commute, in general. If the loop was run in parallel, the outcome would be unpredictable, and very likely not what the programmer intended. For example, the correct output for our application, after three iterations, takes on the form AxBxAxBxB. This accounts for the iterative reassignments of A and B to intermediate forms of the product matrix, C. After four iterations, the sequence becomes AxBxAxBxBxAxBxB. The main point: if this loop were to run in parallel, this sequence would very likely be disrupted into some other sequence, through the uncontrolled process of which threads, representing loop iterations, execute before others on the GPU.

[sourcecode language=”C”]
for (int i=0; i<nIter; i++) {
float startTime_iter = omp_get_wtime();
C = MatrixMult(size, A, B, C);
if (i%2==1) {
//multiply A by B and assign back to A on even iterations
copyMatrix(A, C, size);
}
else {
//multiply A by B and assign back to B on odd iterations
copyMatrix(B, C, size);
}
float endTime_iter = omp_get_wtime();
}
[/sourcecode]

We’ve established that the loop in main() is non-parallelizable, having an implicit dependence on the order of execution of loop iterations. To achieve a speedup, one must examine the routine within the loop: MatrixMult()

[sourcecode language=”C”]
float** MatrixMult(int size, float **restrict A, float **restrict B,
float **restrict C) {
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
pcopyout(C[0:size][0:size])
{
float tmp;
for (int i=0; i<size; ++i) {
for (int j=0; j<size; ++j) {
tmp = 0.;
for (int k=0; k<size; ++k) {
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
}
return C;
}
[/sourcecode]

Here, a kernels OpenACC directive has been placed around all three for loops. Three loops happens to be the maximum number of nested loops that can be parallelized within a single nested structure. Note the syntax for an OpenACC compiler directive in C takes on the following form:

#pragma acc kernels [clauses]

In the code above, the kernels directive tells the compiler that it should try to convert this section of code into a CUDA kernel for parallel execution on the device. Instead of describing a long list of OpenACC directives here, an abbreviated list of commonly used directives appears below in Table 1 (see the references for complete API documentation):

Commonly used OpenACC directives
#pragma acc parallelStart parallel execution on the device. The compiler will generate parallel code whether the result is correct or not.
#pragma acc kernelsHint to the compiler that kernels may be generated for the defined region. The compiler may generate parallel code for the region if it determines that the region can be accelerated safely. Otherwise, it will output warnings and compile the region to run in serial.
#pragma acc dataDefine contiguous data to be allocated on the device; establish a data region minimizing excessive data transfer to/from GPU
#pragma acc loopDefine the type of parallelism to apply to the proceeding loop
#pragma acc regionDefine a parallel region where the compiler will search for code segments to accelerate. The compiler will attempt to automatically parallelize whatever it can, and report during compilation exactly what portions of the parallel region have been accelerated.

Table 1 OpenACC Compiler Directives

Along with directives, there can be modifying clauses. In the example above, we are using the kernels directive with the pcopyin(list) and pcopyout(list) clauses. These are abbreviations for present_or_copyin(list), and present_or_copyout(list).

  • pcopy(list) tells the compiler to copy the data to the device, but only if data is not already present. Upon exiting from the parallel region, any data which is present will be copied to the host.
  • pcopyin(list) tells the compiler to copy to the device if the data is not already there.
  • pcopyout(list) directs the compiler to copy the data if it is on the device, else the data is allocated to the device memory and then copied to the host. The variables, and arrays in list are those which will be copied.
  • present_or_copy(list) clauses avoid the reduced performance of excessive data copies, since the data needed may already be present.

After adding the kernels directive to MatrixMult(), compile and run the executable in the profiler. To compile a GPU-accelerated OpenACC executable with PGI, run:

pgcc -fast -acc -ta=nvidia -Minfo -o ./matrix_ex_float ./matrix_ex_float.c

The -Minfo flag is used to enable informational messages from the compiler. These messages are crucial for determining whether the compiler is able to apply the directives successfully, or whether there is some problem which could possibly be solved. For an example of a compiler message reporting a warning, see the section ‘Using a Linearized Array Instead of a 2D Array’ in the next OpenACC blog, entitled ‘More Tips on OpenACC Code Acceleration‘.

To run the executable in the NVIDIA Visual Profiler, run:

nvvp ./matrix_ex 1000 5

During execution, the 1000x1000 matrices – A and B – are created and multiplied together into a product. The command line argument 1000 specifies the dimensions of the square matrix and the argument 5 sets the number of iterations for the loop to run through. The NVIDIA Visual Profiler will display the timeline below:

Screenshot of NVIDIA Visual Profiler Timeline showing the test case where pcopyin and pcopyout are used in MatrixMult().
Figure 3 (click for expanded view)

Note that there are two Host to Device transfers of matrices A and B at the start of every iteration. Data transfers to the device, occurring after the first transfer, are excessive. In other words, every data copy after the first one is wasted time and lost performance.

Using the OpenACC data Directive to Eliminate Excess Data Transfer

Because the parallel region consists of only the two loops in the MatrixMult() routine, every time this routine is called entire copies of matrices A & B are passed to the device. Since the data only needs to be sent before the first iteration, it would make sense to expand the data region to encompass every call to MatrixMult(). The boundary of the data region must be pushed out to encompass the loop in main(). By placing a data directive just outside of this loop, as shown in Figure 4, the unnecessary copying of A and B to the device after the first iteration is eliminated:

[sourcecode language=”C”]
#pragma acc data pcopyin(A[0:size][0:size],B[0:size][0:size],C[0:size][0:size]) \
pcopyout(C[0:size][0:size])
{
float startTime_tot = omp_get_wtime();
for (int i=0; i<nIter; i++) {
float startTime_iter = omp_get_wtime();
C = MatrixMult(size, A, B, C);
if (i%2==1) {
//multiply A by B and assign back to A on even iterations
copyMatrix(A, C, size);
}
else {
//multiply A by B and assign back to B on odd iterations
copyMatrix(B, C, size);
}
float endTime_iter = omp_get_wtime();
}
float endTime_tot = omp_get_wtime();
}
[/sourcecode]
Figure 4 A data region is established around the for loop in main()

After recompiling and re-running the executable in NVIDIA’s Visual Profiler nvvp, the timeline in Figure 5 shows that the unnecessary transfers are now gone:

Screenshot of NVIDIA Visual Profiler Timeline for test case where pcopyin and pcopyout are used in MatrixMult() and the data region is used in main().
Figure 5 (click for expanded view)

Now matrices A and B are copied to the device only once. Matrix C, the result, is copied to the Host at the end of the kernel region in MatrixMult() on every iteration. As shown in the table below, the runtime improvement is small but significant (1.9s vs. 1.5s). This reflects a 19.5% decrease in runtime; a speedup of 1.24.

Runtimes for Various OpenACC Methods (in seconds)
OpenACC methodMatrix size 1000×1000Matrix size 3000×3000
no acceleration7.569265.69
#pragma acc kernels in MatrixMult()0.35401.917
#pragma acc kernels in MatrixMult() and
#pragma acc data in main()
0.05391.543

Table 2 Runtimes for five iterations of matrix multiplication (C=AxB).

As data sizes increase, the amount of work grows and the benefits of parallelization become incredibly clear. For the larger 3000x3000 matrices, a speedup factor of 172 is realized when both kernels and data directives are used.

Comparing Runtimes of OpenACC and OpenMP

Because OpenMP is also used as a method for parallelization of applications, it is useful to compare the two. To compare OpenACC with OpenMP, an OpenMP directive is added to the MatrixMult() routine:

[sourcecode language=”C”]
void MatrixMult(int size, float **restrict A, float **restrict B,
float **restrict C) {
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
pcopyout(C[0:size][0:size])
#pragma omp parallel for default(none) shared(A,B,C,size)
for (int i=0; i<size; ++i) {
for (int j=0; j<size; ++j) {
float tmp = 0.;
for (int k=0; k<size; ++k) {
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
}
[/sourcecode]

To compile the code with OpenMP parallelization, run:

pgcc -fast -mp ./matrix_ex_float.c -o ./matrix_ex_float_omp

The results were gathered on a Microway NumberSmasher server with dual 12-core Intel Xeon E5-2690v3 CPUs running at 2.6GHz. Runtimes were gathered when executing on 6, 12, and 24 of the CPU cores. This is achieved by setting the environment variable OMP_NUM_THREADS to 6, 12, and 24 respectively.

Number of ThreadsRuntime (in seconds)
637.758
1218.886
2410.348

Table 3 Runtimes achieved with OpenMP using 3000x3000 matrices and 5 iterations

It is clear that OpenMP is able to provide parallelization and good speedups (nearly linear). However, the GPU accelerators are able to provide more compute power than the CPUs. The results in Table 4 demonstrate that OpenMP and OpenACC both substancially increase performance. By utilizing a single NVIDIA Tesla M40 GPU, OpenACC is able to run 6.71 faster than OpenMP.

Speedups Over Serial Runtime
serialOpenMP speedupOpenACC speedup
125.67x172x

Table 4 Relative Speedups of OpenACC and OpenMP for 3000x3000 matrices.

OpenACC Bears Similarity to OpenMP

As previously mentioned, OpenACC shares some commonality with OpenMP. Both are open standards, consisting of compiler directives for accelerating applications. Open Multi-Processing (OpenMP) was created for accelerating applications on multi-core CPUs, while OpenACC was primarily created for accelerating applications on GPUs (although OpenACC can also be used to accelerate code on other target devices, such as multi-core CPUs). Looking ahead, there is a growing consensus that the roles of OpenMP and OpenACC will become more and more alike.

OpenACC Acceleration for Specific GPU Devices

GPU Hardware Specifics

When a system has multiple GPU accelerators, a specific GPU can be selected either by using an OpenACC library procedure call, or by simply setting the environment variable CUDA_VISIBLE_DEVICES in the shell. For example, this would select GPUs #0 and #5:

export CUDA_VISIBLE_DEVICES=0,5

On Microway’s GPU Test Drive Cluster, some of the Compute Nodes have a mix of GPUs, including two Tesla M40 GPUs labelled as devices 0 and 5. To see what devices are available on your machine, run the command deviceQuery, (which is included with the CUDA Toolkit). pgaccelinfo, which comes with the OpenACC Toolkit, reports similar information.

When an accelerated application is running, you can view the resource allocation on the device by executing the nvidia-smi utility. Memory usage and GPU usage, listed by application, are reported for all GPU devices in the system.

Gang, Worker, and Vector Clauses

Although CUDA and OpenACC both use similar ideas, their terminology differs slightly. In CUDA, parallel execution is organized into grids, blocks (threadBlocks), and threads. In OpenACC, a gang is like a CUDA threadBlock, which executes on a processing element (PE). On a GPU device, the processing element (PE) is the streaming multiprocessor (SM). A number of OpenACC gangs maps across numerous PEs (CUDA blocks).

An OpenACC worker is a group of vectors. The worker dimension extends across the height of a gang (threadBlock). Each vector is a CUDA thread. The dimension of vector is across the width of the threadBlock. Each worker consists of vector number of threads. Therefore, a worker corresponds to one CUDA warp only if vector takes on the value of 32; a worker does not have to correspond to a warp. For example, a worker can correspond to two warps if vector is 64, for example. The significance of a warp is that all threads in a warp run concurrently.

Diagram of an NVIDIA CUDA Grid, which is made up of multiple Thread Blocks
Figure 6 A CUDA grid consists of blocks of threads (threadBlocks), which can be arranged in one or two dimensions.

Figure 6 illustrates a threadBlock, represented as part of a 2D grid containing multiple threadBlocks. In OpenACC, the grid consists of a number of gangs, which can extend into one or two dimensions. As depicted in Figure 7, the gangs extend into one dimension. It is possible, however, to arrange gangs into a two dimensional grid. Each gang, or threadBlock, in both figures 6 and 7 is comprised of a 2D block of threads. The number of vectors, workers, and gangs can be finely tuned for a parallel loop.

Sometimes it is faster to have some kernels execute more than once on a block, instead of having each kernel execute only once per block. Discovering the optimal amount of kernel re-execution can require some trial and error. In OpenACC, this would correspond to a case where the number of gangs is less than a loop layer which is run in parallel across gangs and which has more iterations than gangs available.

In CUDA, threads execute in groups of 32 at a time. Groups of 32 threads, as mentioned, are called warps, and execute concurrently. In Figure 8, the block width is set to 32 threads. This makes more threads execute concurrently, so the program runs faster.

[expand title=”(click to expand) Additional runtime output, with kernel runtimes, grid size, and block size”]

Note: the kernel reports can only be generated by compiling with the time target, as shown below (read more about this in our next blog post). To compile with kernel reports, run:

pgcc -fast -acc -ta=nvidia,time -Minfo -o ./matrix_ex_float ./matrix_ex_float.c

Once the executable is compiled with the nvidia and time arguments, a kernel report will be generated during execution:

[john@node6 openacc_ex]$ ./matrix_ex_float 3000 5
./matrix_ex_float total runtime 1.3838

Accelerator Kernel Timing data
/home/john/MD_openmp/./matrix_ex_float.c
MatrixMult NVIDIA devicenum=0
time(us): 1,344,646
19: compute region reached 5 times
26: kernel launched 5 times
grid: [100x100] block: [32x32]
device time(us): total=1,344,646 max=269,096 min=268,685 avg=268,929
elapsed time(us): total=1,344,846 max=269,144 min=268,705 avg=268,969
19: data region reached 5 times
35: data region reached 5 times
/home/john/MD_openmp/./matrix_ex_float.c
main NVIDIA devicenum=0
time(us): 8,630
96: data region reached 1 time
31: data copyin transfers: 6
device time(us): total=5,842 max=1,355 min=204 avg=973
31: kernel launched 3 times
grid: [24] block: [128]
device time(us): total=19 max=7 min=6 avg=6
elapsed time(us): total=509 max=432 min=34 avg=169
128: data region reached 1 time
128: data copyout transfers: 3
device time(us): total=2,769 max=1,280 min=210 avg=923

[/expand]

Diagram of OpenACC gangs, workers and vectors
Figure 7 An OpenACC threadBlock has vertical dimension worker, and horizontal dimension vector. The grid consists of gang threadBlocks.

[sourcecode language=”C”]
float** MatrixMult(int size, int nr, int nc, float **restrict A, float **restrict B,
float **restrict C) {
#pragma acc kernels loop pcopyin(A[0:size][0:size],B[0:size][0:size]) \
pcopyout(C[0:size][0:size]) gang(100), vector(32)
for (int i = 0; i < size; ++i) {
#pragma acc loop gang(100), vector(32)
for (int j = 0; j < size; ++j) {
float tmp = 0.;
#pragma acc loop reduction(+:tmp)
for (int k = 0; k < size; ++k) {
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
return C;
}
[/sourcecode]
Figure 8 OpenACC code with gang and vector clauses. The fully accelerated OpenACC version of the C source code can be downloaded here.

The directive clause gang(100), vector(32), on the j loop, sets the block width to 32 threads (warp size), which makes parallel execution faster. Integer multiples of a warp size will also realize greater concurrency, but not usually beyond a width of 64. The same clause sets the grid width to 100. The directive clause on the outer i loop, gang(100), vector(32), sets the grid height to 100, and block height to 32. The block height specifies that the loop iterations are processed in SIMT groups of 32.

By adding the gang and vector clauses, as shown in Figure 8, the runtime is reduced to 1.3838 sec (a speedup of 1.12x over the best runtime in Table 2).

Targeting GPU Architectures with the Compiler

OpenACC is flexible in its support for GPU, which means support for a variety of GPU types and capabilities. The target options in the table below illustrate how different compute capabilities, GPU architectures, and CUDA versions can be targeted.

compute capabilityGPU architectureCUDA versionCPU
-ta=nvidia[,cc10|cc11|cc12|cc13|cc20] -ta=tesla:cc35, -ta=nvidia,cc35-ta=tesla, -ta=nvidia-ta=cuda7.5, -ta=tesla:cuda6.0-ta=multicore

Table 5 Various GPU target architecture options for the OpenACC compiler

OpenACC for Fortran

Although we have focused here on using OpenACC in the C programming language, there is robust OpenACC support for the Fortran language. The syntax for compiler directives is only slightly different. In the C language, with dynamic memory allocation and pointers, pointers must be restricted inside of parallel regions. This means that pointers, if not declared as restricted in main(), or subsequently cast as restricted in main(), must be cast as restricted when passed as input arguments to routines containing a parallel region. Fortran does not use pointers and handles memory differently, with less user control. Pointer-related considerations therefore do not arise with Fortran.

Summary

OpenACC is a relatively recent open standard for acceleration directives which is supported by several compilers, including, perhaps most notably, the PGI compilers.

Accelerating code with OpenACC is a fairly quick route to speedups on the GPU, without needing to write CUDA kernels in C or Fortran, thereby removing the need to refactor potentially numerous regions of compute-intense portions of a large software application. By making an easy path to acceleration accessible, OpenACC adds tremendous value to the CUDA API. OpenACC is a relatively new development API for acceleration, with the stable 2.0 release appearing in June 2013.

If you have an application and would like to get started with accelerating it with OpenACC or CUDA, you may want to try a free test drive on Microway’s GPU Test Cluster. On our GPU servers, you can test your applications on the Tesla K40, K80, or the new M40 GPU specialized for Deep Learning applications. We offer a wide range of GPU solutions, including:


The post Accelerating Code with OpenACC and the NVIDIA Visual Profiler appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/accelerating-code-with-openacc-and-nvidia-visual-profiler/feed/ 0
AVX2 Optimization and Haswell-EP (Xeon E5-2600v3) CPU Features https://www.microway.com/hpc-tech-tips/avx2-optimization-and-haswell-ep-cpu-features/ https://www.microway.com/hpc-tech-tips/avx2-optimization-and-haswell-ep-cpu-features/#respond Fri, 03 Oct 2014 22:14:29 +0000 http://https://www.microway.com/?p=4747 We’re very excited to be delivering systems with the new Xeon E5-2600v3 and E5-1600v3 CPUs. If you are the type who loves microarchitecture details and compiler optimization, there’s a lot to gain. If you haven’t explored the latest techniques and instructions for optimization, it’s never a bad time to start. Many end users don’t always […]

The post AVX2 Optimization and Haswell-EP (Xeon E5-2600v3) CPU Features appeared first on Microway.

]]>
We’re very excited to be delivering systems with the new Xeon E5-2600v3 and E5-1600v3 CPUs. If you are the type who loves microarchitecture details and compiler optimization, there’s a lot to gain. If you haven’t explored the latest techniques and instructions for optimization, it’s never a bad time to start.

Many end users don’t always see instruction changes as consequential. However, they can be absolutely critical to achieving optimal application performance. Here’s a comparison of Theoretical Peak Performance of the latest CPUs with and without FMA3:
Plot of Xeon E5-2600v3 Theoretical Peak Performance (GFLOPS)

Only a small set of codes will be capable of issuing almost exclusively FMA instructions (e.g., LINPACK). Achieved performance for well-parallelized & optimized applications is likely to fall between the grey and colored bars. Still, without employing a compiler optimized for FMA3 instructions, you are leaving significant potential performance of your Xeon E5-2600v3-based hardware purchase on the table.

Know your CPUs, know your instructions

With that in mind, we would like to summarize and link to these new resources from Intel:

Intel: Xeon E5-2600v3 Technical Overview

  • A brief summary of Haswell-NI (Haswell New Instructions) that add dedicated instructions for signal processing, encryption, and math functions
  • Summary of power improvements in the Haswell architecture
  • Detailed comparison of C600 and C610 series chipsets
  • Virtualization improvements and new security features

Intel: How AVX2 Improves Performance on Server Applications

  • Instructions on how to recompile your code for AVX2 instructions and supported compilers
  • Other methods of employing AVX2: Intel MKL, coding with intrinsic instructions, and assembly
  • Summary of LINPACK performance gains delivered simply by using AVX2

Deliver the highest performance for your applications by taking of advantage of the latest the Intel architecture. For more information, contact a Microway HPC expert:

 

The post AVX2 Optimization and Haswell-EP (Xeon E5-2600v3) CPU Features appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/avx2-optimization-and-haswell-ep-cpu-features/feed/ 0
CUB in Action – some simple examples using the CUB template library https://www.microway.com/hpc-tech-tips/cub-action-simple-examples-using-cub-template-library/ https://www.microway.com/hpc-tech-tips/cub-action-simple-examples-using-cub-template-library/#respond Wed, 18 Jun 2014 20:02:31 +0000 http://https://www.microway.com/?p=4047 In my previous post, I presented a brief introduction to the CUB library of CUDA primitives written by Duane Merrill of NVIDIA. CUB provides a set of highly-configurable software components, which include warp- and block-level kernel components as well as device-wide primitives. This time around, we will actually look at performance figures for codes that […]

The post CUB in Action – some simple examples using the CUB template library appeared first on Microway.

]]>
In my previous post, I presented a brief introduction to the CUB library of CUDA primitives written by Duane Merrill of NVIDIA. CUB provides a set of highly-configurable software components, which include warp- and block-level kernel components as well as device-wide primitives. This time around, we will actually look at performance figures for codes that utilize CUB primitives. We will also briefly compare the CUB-based codes to programs that use the analogous Thrust routines, both from a performance and programmability perspective. These comparisons utilize the CUB v1.3.1 and Thrust v1.7.0 releases and CUDA 6.0.

Before we proceed, I need to issue one disclaimer: the examples below were written after a limited amount of experimentation with the CUB library, and they do not necessarily represent the most optimized implementations. However, these examples do illustrate the flexibility of the API and they give an idea of the kind of performance that can be achieved using CUB with only modest programming effort.

Computing Euclidean vector norms – transformations and reductions

To begin with, let’s consider a variant of the reduction routine, which reads in a data array and computes the sum of the squares of the array elements. If we interpret the array as a vector, then the square root of the result of the reduction gives the Euclidean norm, or L2 norm, of the vector.

A kernel implementation that uses CUB’s block-level primitives is shown below:

template <typename T, int BLOCK_THREADS, int ITEMS_PER_THREAD,
        BlockReduceAlgorithm ALGORITHM, typename U> 
__global__ void TransformSumKernel(T* sum, T* input, U unaryOp)
{

  typedef BlockReduce<T, BLOCK_THREADS, ALGORITHM> BlockReduceT;

  __shared__ typename BlockReduceT::TempStorage  temp_storage;

  T data[ITEMS_PER_THREAD];

  LoadDirectBlockedVectorized(threadIdx.x 
                    input+blockIdx.x*BLOCK_THREADS*ITEMS_PER_THREAD, 
                    data);

  for(int item=0; item<ITEMS_PER_THREAD; ++item)
    data[item] = unaryOp(data[item]);

  T block_sum = BlockReduceT(temp_storage).Sum(data);
  if(threadIdx.x==0) atomicAdd(sum,block_sum);
  return
}

 

In this example, each thread block processes a tile of BLOCK_THREADS*ITEMS_PER_THREAD input elements, and for simplicity the size of the input array is assumed to be a multiple of the tile size. As its name suggests, the function LoadDirectBlockedVectorized uses vectorized loads to increase memory bandwidth utilization. (We could also have used the higher-level BlockLoad class to perform the global-memory loads. The BlockLoad class is templated on the load algorithm, making it trivial to switch between vectorized loads and other approaches.) Global memory loads are coalesced provided that ITEMS_PER_THREAD does not exceed the maximum allowed vector-load width (for example, four ints or floats). unaryOp is a unary C++ function object. In order to compute the sum of the squares of the elements of the input array, we choose unaryOp to be an instance of the Square<T> class, which is defined as follows:

template<typename T>
struct Square
{__host__ __device__ __forceinline__T operator()(const T& a) const {return a*a;}
};

 

CUB Block-level primitives and fancy iterators

CUB’s iterator classes facilitate more elegant and generic kernel implementations. The library contains a number of iterator classes, including iterators that implement textures and iterators that support various cache modifiers for memory accesses. However, for the moment, let’s just consider the TransformInputIterator class, which has the following declaration:

template < typename ValueType, typename ConversionOp, typename InputIterator,
typename Offset=ptr_diff >
class TransformInputIterator;

In this declaration, InputIterator is another iterator type, which could simply be a pointer to a native device type; ConversionOp is a unitary operator class, such as the Square class defined above; and ValueType is the input value type. The precise meaning of these template arguments should become clear in the discussion below. Utilizing iterators, we can compute the Euclidean norm squared using the following reduction-kernel template:

template<typename T, int BLOCK_THREADS, int ITEMS_PER_THREAD, 
         BlockReduceAlgorithm ALGORITHM, typename I>
__global__ void SumKernel(T* sum, I input_iter)
{

 typedef BlockReduct<T BLOCK_THREADS, ALGORITHM> BlockReduceT;

 __shared__ typename lockReduceT::TempStorage temp_storage;

  T data[ITEMS_PER_THREAD];

  LoadDirectStriped(threadIdx.x 
          input_iter + blockIdx.x*blockDim.xTEMS_PER_THREAD, 
          data);

  T block_sum = BlockReduceT(temp_storage).Sum(data);
  if(threadIdx.x == 0) atomicAdd(sum,block_sum)
  return;
}

where input_iter is an instance of the TransformInputIterator class:

TransformInputIterator<T, Square<T>, T*> input_iter(input, Square<T>());

At present, CUB only supports vectorized loads for native device-type pointers, and the above kernel template uses the LoadDirectStriped<BLOCK_THREADS> routine to load data from global memory. Each thread loads ITEMS_PER_THREAD array elements separated by stride BLOCK_THREADS from global memory. BLOCK_THREADS is, of course, a multiple of the warp size, and data loads are coalesced.

Device-wide primitives

CUB contains a range of device-level primitives, including a device-wide reduction optimized for different architectures. Therefore, the only reason for implementing custom kernels for this particular calculation is that they might provide some performance advantage. The calculation can be implemented with CUB’s device-level API by using the input iterator defined above and calling

DeviceReduce::Sum(temp_storage,temp_storage_bytes,input_iter,output,num_elements);

The squared norm is then returned in the variable output. The same calculation can be performed using the Thrust library by calling

thrust::transform_reduce(d_in.begin(), d_in.end(), Square(), static_cast(0), thrust::plus())

in the client application.

CUB Performance

Fig. 1 shows performance data for codes that use the reduction kernels defined above, together with performance results for code using the DeviceReduce::Sum primitive. The corresponding performance figures for thrust::transform_reduce are also shown. These results were obtained on tests involving 32-bit floating-point data running on a Tesla K40c. Limited experimentation indicated that setting the thread-block size (BLOCK_THREADS) to 128 gave the best performance for both SumKernel and TransformSumKernel, and the block-wide reduction algorithm was set to BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY in both cases. The number of data items processed per thread (ITEMS_PER_THREAD) was set to 4 in runs involving TransformSumKernel, which uses vectorized loads. Choosing ITEMS_PER_THREAD=8, gave better performance in SumKernel. On all problem sizes, the CUB-based codes significantly outperform the code that utilizes the transform_reduce Thrust routine. We also see that the code that utilizes the DeviceReduce::Sum primitive is always competitive with code that utilizes our hand-tuned kernels, and, in fact, gives the best performance on all but the largest input arrays. This result is very encouraging since the use of CUB’s device-wide primitives in a client application requires very little programming effort, comparable to the effort required to utilize the corresponding Thrust primitives.

Plot of NVIDIA CUB performance on the reduce operation
Fig. 1 – Performance measurements for the calculation of the square of the Euclidean norm of a float32 vector on a Tesla K40c

A second CUB example – Transformed prefix sum

Let’s consider the slightly more contrived example of the calculation of the inclusive prefix sum of the squares of the elements of an integer array. Given the input and output arrays int a_in[N] and int a_out[N], this calculation corresponds to the following serial C++ code:

a_out[0] = a_in[0]*a_in[0];
for(int i=1; i<N; ++i){
a_out[i] = a_in[i]*a_in[i] + a_out[i-1];
}

 

In analogy to the reduction example above, the calculation can also be implemented using CUB’s DeviceScan::InclusiveSum function together with the TransformInputIterator class. We tested the CUB-based approach and an implementation using Thrust on 32-bit integer data on a Tesla K40c. However, in our initial tests the CUB-based implementation showed lower-than-expected performance on large input arrays. These performance figures correspond to the green points in Fig. 2 below, where the input iterator passed to DeviceScan::InclusiveSum was of type TransformInputIterator<int, Square<int>, int*>. In fact, on the largest input arrays, better performance was obtained from the Thrust-based code (red points). Furthermore, we found that the code obtained by omitting the square operation and simply passing a pointer to the data in global memory (in this case, an int* pointer) to DeviceScan::InclusiveSum achieved approximately twice the performance on large problem sizes.

Plot of NVIDIA CUB scan performance
Fig. 2 – Performance of code to compute the inclusive prefix sum of the squared elements of an int32 array measured on a Tesla K40c

A quick look at the CUB source code revealed the reason for the disparity in performance. It turns out that on compute-capability 3.5 devices, DeviceScan::InclusiveSum is configured to load data through texture cache automatically – provided the input iterator is a pointer to a native device type. However, for more general iterator types, the use of texture cache has to be specified explicitly in the client application. Fortunately, this is easy to implement using the CacheModifiedInputIterator class defined in CUB. To utilize texture cache, input_iter, the iterator passed to the prefix-sum routine, should be defined as follows in the client code:

// loads using cached_iter go through texture cache on sm_35 devices
CacheModifiedInputIterator<LOAD_LDG,int> cached_iter(d_in);
// Use the following iterator to return the squares of the data elements// loaded from global memory using cached_iter
TransformInputIterator<int,Square<int>,CacheModifiedInputIterator<LOAD_LDG,int> >input_iter(cached_iter, Square<int>());

 

The LOAD_LDG template argument specifies the use of texture cache on sm_35 hardware. The blue points in Fig. 2 correspond to performance figures obtained using the LOAD_LDG option, which, on large data sets, gives a 2x speedup over the code that does not utilize texture cache. CUB’s iterators support a number of other access modes, and although a comprehensive discussion of iterators is beyond the scope of this post, we encourage the reader to explore the iterator classes.

Summary

In this post, we have looked at variants of the reduction and prefix-sum algorithms that involve a transformation of the input data and how they might be implemented using the CUB library. In particular, we’ve considered implementations involving CUB’s device-level functions and iterator classes. These implementations compare favorably with code based on custom kernels and offer superior performance to the corresponding Thrust-based implementations. The examples also illustrate the flexibility and composability provided by CUB’s iterator classes.

The post CUB in Action – some simple examples using the CUB template library appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/cub-action-simple-examples-using-cub-template-library/feed/ 0
Introducing CUDA UnBound (CUB) https://www.microway.com/hpc-tech-tips/introducing-cuda-unbound-cub/ https://www.microway.com/hpc-tech-tips/introducing-cuda-unbound-cub/#comments Mon, 14 Apr 2014 13:21:38 +0000 http://https://www.microway.com/?p=3780 CUB – a configurable C++ template library of high-performance CUDA primitives Each new generation of NVIDIA GPUs brings with it a dramatic increase in compute power and the pace of development over the past several years has been rapid. The Tesla M2090, based on the Fermi GF110 architecture anounced in 2010, offered global memory bandwidth […]

The post Introducing CUDA UnBound (CUB) appeared first on Microway.

]]>
CUB – a configurable C++ template library of high-performance CUDA primitives

Each new generation of NVIDIA GPUs brings with it a dramatic increase in compute power and the pace of development over the past several years has been rapid. The Tesla M2090, based on the Fermi GF110 architecture anounced in 2010, offered global memory bandwidth of up to 177 Gigabytes per second and peak double-precision floating-point performance of 665 Gigaflops. By comparison, today’s Tesla K40 (Kepler GK110b architecture) has peak memory bandwidth of 288 Gigabytes per second and provides reported peak double-precision performance of over 1.4 Teraflops. However, the K40’s reign as the most advanced GPGPU hardware is coming to an end, and Kepler will shortly be superseded by Maxwell-class cards.

Actually achieving optimal performance on diverse GPU architectures can be challenging, since it relies on the implementation of carefully-crafted kernels that incorporate extensive knowledge of the underlying hardware and which take full advantage of relevant features of the CUDA programming model. This places a considerable burden on the CUDA developer seeking to port her application to a new generation of GPUs or looking to ensure performance across a range of architectures.

Fortunately, many CUDA applications are formulated in terms of a small set of primitives, such as parallel reduce, scan, or sort. Before attempting to handcraft these primitive operations ourselves, we should consider using one of the libraries of optimized primitives available to CUDA developers. Such libraries include Thrust and CUDPP, but in this post, we will focus on the CUB library developed by Duane Merrill of NVIDIA Research. CUB – the name derives from “CUDA Unbound” – provides generic high-performance primitives targeting multiple levels of application development. For example, CUB supports a set of device-wide primitives, which are called from the host, and in this regard, the functionality provided by CUB overlaps with Thrust to some degree. However, unlike Thrust, CUB also provides a set of kernel components that operate at the thread-block and thread-warp levels.

Thread-block reduction – a simple CUB example

A key feature of the CUB library, and one that makes CUB an attractive option for a wide range of performance-critical applications, is the fact that software components are not specialized for a particular GPU architecture or problem type. CUB is a C++ template library which utilizes policy-based design to provide highly-configurable kernel components that can be tuned for different GPU architectures and applications. To see what exactly this means, let’s consider how we might implement a reduction kernel using CUB. The library includes a templated BlockReduce class to perform reduction operations across a single thread block. It is declared as follows:

template <typename T, int BLOCK_SIZE, BlockReduceAlgorithm ALGORITHM> class BlockReduce;

T denotes the type of data on which the reduction operation is performed, BLOCK_SIZE is the number of threads in the thread block, and BlockReduceAlgorithm is an enumeration of different algorithms that can be used to perform the reduction. Note that the binary operation that specifies the type of reduction being performed (which, more often than not, involves computing the sum or the maximum or minimum of a data set) is not included in the class declaration. Using the BlockReduce class and atomic operations, a kernel to compute the maximum value in an array of integers can be implemented as follows:

1)  template<int BLOCK_SIZE, BlockReduceAlgorithm ALGORITHM>
2)  __global__ 
3)  void maxKernel(int* max, int* input)
4)  { 
5)   int id=blockIdx.x*blockDim.x + threadIdx.x; 
6)   typedef cub::BlockReduce<int,BLOCK_SIZE,ALGORITHM> BlockReduceT; 
7)
8)   // Allocate temporary storage in shared memory 
9)   __shared__ typename BlockReduceT::Temp temp_storage; 
10)
11)  int val=input[id]; 
12)  int block_max=BlockReduceT(temp_storage).Reduce(val,cub::Max());
13)
14)  // update global max value
15)  if(threadIdx.x == 0) atomicMax(max,block_max); 
16)
17)  return;  
18) }

Line 9 of the kernel above allocates temporary storage in device shared memory for use in the reduction algorithm. The quantity and layout of this storage depend on the choice of algorithm, the type of the data, the number of threads per block, and the target GPU architecture. The optimal shared-memory configuration, which provides sufficient temporary storage and avoids unnecessary bank conflicts, is determined at compile time using the template arguments selected in the client code. However, the shared-memory configuration details themselves are hidden from the client application.

On line 12 of our kernel, the BlockReduceT constructor (which takes as an argument the temporary storage allocated above) is called, generating a temporary object, which then invokes its Reduce method. The second argument in the Reduce method is an instance of CUB’s Max functor class. This class is defined such that if maxObject is an instance of the class Max, then maxObject(a,b) returns the maximum of a and b. Other binary operations supported in CUB include binary addition, the binary min operation, and variants of max and min that identify the position of the first occurance of the maximum or minimum value in a data array. The result of a thread-block reduction is returned to the first thread of each block (which has threadIdx.x == 0). Finally, each thread block calls a single atomic operation to update the global maximum. Note that line 15 of the kernel assumes that the value pointed to by max is initialized to some minimum value before the kernel is launched.

Optimizing performance by limiting concurrency

Currently, CUB supports three different block-reduction algorithms, corresponding to the enumerators BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_REDUCE_RAKING, and BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY. The latter algorithm is specialized for commutative binary operations (such as the Max operation in our example, where the relative ordering of inputs does not affect the output), while the other algorithms also support non-commutative binary operators. The background to these algorithms is decribed in detail in a series of publications by Merrill and collaborators. As described in those papers, a core feature of the algorithms used in CUB is that they balance concurrency with serial computation in order maximize performance on GPU hardware. In contrast, earlier algorithms targeting GPU architectures tended to involve high levels of concurrency, where the number of logical threads assigned to a problem scales with the problem size. However, in reduction and scan calculations, logical threads have to share data and synchronize with each other, and the cost of this inter-thread cooperation scales with the amount of concurrency in the algorithm (i.e., it depends on the number of threads). Performance can be improved by choosing a level of concurrency that ensures that the GPU hardware is fully utilized while minimizing communication and synchronization overheads.

To understand how the CUB routines utilize serial processing, consider the raking block-reduction algorithms mentioned above. In these algorithms, after an initial step, which we discuss below, each thread in the block writes data to shared memory. At this point, a single warp of threads “rakes” the shared-memory array, with each thread in the warp performing a serial reduction on some segment of the data in shared memory. At the end of this step, a single warp-width of data remains to be reduced, and one warp-level reduction completes the calculation. Further serialization can be achieved by having each thread in the thread block perform a serial partial reduction in registers at the beginning of the block-level reduction routine. To do this, we modify our reduction kernel as follows:

template<int VALS_PER_THREAD, int BLOCK_SIZE, 
         BlockReduceAlgorithm ALGORITHM> 
__global__ 
void maxKernel(int* max, int* input) 
{ 
  int id=blockIdx.x*blockDim.x + threadIdx.x; 
  typedef cub::BlockReduce<int,BLOCK_SIZE,ALGORITHM> BlockReduceT;

  // Allocate temporary storage in shared memory 
  __shared__ typename BlockReduceT::Temp temp_storage; 

  // Assign multiple values to each block thread 
  int val[VALS_PER_THREAD]; 

  // Code to initialize the val array has been omitted 
  int block_max=BlockReduceT(temp_storage).Reduce(val,cub::Max()); 

  // update global max value 
  if(threadIdx.x == 0) atomicMax(max,block_max); 

  return; 
}

Thus, each thread in the block reduces VALS_PER_THREAD items in registers as an initial step in the block reduction.

It’s worth noting that on devices of compute capability 3.0 and above CUB will utilize CUDA’s shuffle feature to perform efficient warp-level reductions, but it reverts to a shared-memory implementation on older hardware (for a recent description of reductions using shuffle, see this Parallel ForAll blog post).

CUDA Unbound

The block reduction example illustrates the extreme configurability of CUB. In essence, CUB provides an outline of the reduction algorithm, but leaves performance-critical details, such as the exact choice of algorithm and the degree of concurrency unbound and in the hands of the user. These parameters can be tuned in order maximimize performance for a particular architecture and application. Since the parameter values are specified in the client application at compile time, this flexibility incurs no runtime performance penalty. The CUB library provides most benefit if integrated into a client-application auto-tuning procedure. In this case, on each new architecture and problem type, the client application would launch a series of short jobs to explore the CUB tuning space and determine the choice of template arguments that optimize performance.

Although we have focused solely on CUB’s support for block-wide reductions in this post, the library also includes highly-configurable scan and sort implementations. Among the other primitives implemented in CUB are block-wide data-exchange operations and parallel histogram calculations, and all of these implementations are flexible enough to ensure high performance in diverse applications running on a range of NVIDIA architectures.

Well, that completes our brief introduction to the CUB library. We’ll revisit CUB in our next post, when we’ll look at concrete performance figures for kernels utilizing different CUB routines on a variety of problems and different GPU architectures.

The post Introducing CUDA UnBound (CUB) appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/introducing-cuda-unbound-cub/feed/ 1
CUDA Code Migration (Fermi to Kepler Architecture) on Tesla GPUs https://www.microway.com/hpc-tech-tips/cuda-code-migration-fermi-to-kepler-architecture-on-tesla-gpus/ https://www.microway.com/hpc-tech-tips/cuda-code-migration-fermi-to-kepler-architecture-on-tesla-gpus/#respond Sun, 27 Oct 2013 19:14:11 +0000 http://https://www.microway.com/?p=3266 The debut of NVIDIA’s Kepler architecture in 2012 marked a significant milestone in the evolution of general-purpose GPU computing. In particular, Kepler GK110 (compute capability 3.5) brought unrivaled compute power and introduced a number of new features to enhance GPU programmability. NVIDIA’s Tesla K20 and K20X accelerators are based on the Kepler GK110 architecture. The […]

The post CUDA Code Migration (Fermi to Kepler Architecture) on Tesla GPUs appeared first on Microway.

]]>
The debut of NVIDIA’s Kepler architecture in 2012 marked a significant milestone in the evolution of general-purpose GPU computing. In particular, Kepler GK110 (compute capability 3.5) brought unrivaled compute power and introduced a number of new features to enhance GPU programmability. NVIDIA’s Tesla K20 and K20X accelerators are based on the Kepler GK110 architecture. The higher-end K20X, which is used in the Titan and Bluewaters supercomputers, contains a massive 2,688 CUDA cores and achieves peak single-precision floating-point performance of 3.95 Tflops. In contrast, the Fermi-architecture Tesla M2090 (compute capability 2.0) has peak single-precision performance of 1.3 Tflops.

In addition to the increase in raw power, GK110 includes a number of features designed to facilitate efficient GPU utilization. Of these, Hyper-Q technology and support for dynamic parallelism have been particularly well publicized. Hyper-Q facilitates the concurrent execution of multiple kernels on a single device and also enables multiple CPU processes to simultaneously launch work on a single GPU. The dynamic parallelism feature means that kernels can be launched from the device, which greatly simplifies the implementation and improves the performance of divide-and-conquer algorithms, for example. Other Kepler additions include support for bindless textures, which offer greater flexibility and performance than texture references. Shared-memory bank widths have also increased from 4 to 8 bytes, with a corresponding increase in shared-memory bandwidth and a reduction in bank conflicts in many applications.

Read Whitepaper

Migrating Your Code from Nvidia Tesla Fermi to Tesla Kepler K20X, with Examples from QUDA Lattice QCD Library

In this white paper, we provide an overview of the new features of Kepler GK110 and highlight the differences in functionality and performance between the new architecture and Fermi. We cite a number of examples drawn from disparate sources on the web, but also draw on our own experiences involving QUDA, an open-source library for performing Lattice QCD calculations on GPUs. In general, codes developed on Fermi ought to see substantial performance gains on GK110 without any modification. In the case of data-dependent and recursive algorithms, however, far greater gains may be achieved by exploiting dynamic parallelism. More generally, relatively minor code modifications, such as switching to bindless textures, or changing shared-memory accesses to take advantage of increased bandwidth, can also result in significant improvements in performance.

The post CUDA Code Migration (Fermi to Kepler Architecture) on Tesla GPUs appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/cuda-code-migration-fermi-to-kepler-architecture-on-tesla-gpus/feed/ 0
Avoiding GPU Memory Performance Bottlenecks https://www.microway.com/hpc-tech-tips/avoiding-gpu-memory-performance-bottlenecks/ https://www.microway.com/hpc-tech-tips/avoiding-gpu-memory-performance-bottlenecks/#respond Tue, 01 Oct 2013 03:36:22 +0000 http://https://www.microway.com/?p=2903 This post is Topic #3 (post 3) in our series Parallel Code: Maximizing your Performance Potential. Many applications contain algorithms which make use of multi-dimensional arrays (or matrices). For cases where threads need to index the higher dimensions of the array, strided accesses can’t really be avoided. In cases where strided access is actually avoidable, […]

The post Avoiding GPU Memory Performance Bottlenecks appeared first on Microway.

]]>
This post is Topic #3 (post 3) in our series Parallel Code: Maximizing your Performance Potential.

Many applications contain algorithms which make use of multi-dimensional arrays (or matrices). For cases where threads need to index the higher dimensions of the array, strided accesses can’t really be avoided. In cases where strided access is actually avoidable, every effort to avoid accesses with a stride greater than one should be taken.

So all this advice is great and all, but I’m sure you’re wondering “What actually is strided memory access?” The following example will illustrate this phenomenon and outline its effect on the effective bandwidth:

__global__ void strideExample (float *outputData, float *inputData, int stride=2)
{
    int index = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    outputData[index] = inputData[index];
}

In the above code, threads within a warp access data words in memory with a stride of 2. This leads to a load of two L1 cache lines per warp. The actual accessing of the memory is shown below.

Diagram of NVIDIA Kepler Strided Memory Accesses

Accesses with a stride of 2 result in a 50% load/store efficiency (shown above), since half of the elements involved in the transaction are not used (becoming wasted bandwidth). As the stride increases, the effective bandwidth decreases until there is a single cache line for each of the threads in a warp (wow, that’s a lot of lost performance!).

Strided accesses can debilitate performance of even the most optimized algorithms. For large strides, the effective bandwidth is poor, regardless of the architecture of compute capability version. Intuitively, this makes sense. When concurrent threads are simultaneously accessing data located in memory addresses that are far apart in the physical memory, the accesses cannot be combined. For these types of situations, you absolutely must not use global memory if you wish to realize any sort of performance gain from your application for accesses with a stride greater than 1. In cases where you are stuck with strided memory accesses, you must ensure that as much data as possible is used from each cache line fetching operation.

So, if I haven’t made it clear enough: if you can avoid global memory, you should. In my personal experiences programming with CUDA, you really can’t go wrong if you intelligently make use of shared memory. With the exception of bank conflicts (discussed in Shared Memory Optimization), you don’t suffer the painful penalties that accompany global memory usage when you have non-sequential memory accesses, or misaligned accesses by warps in shared memory.


For those of us who are more advanced, if you can make use of registers without register pressure or read-after-write dependencies, you should. I briefly discussed register memory in previous posts, but feel that it warrants a bit more discussion here.

Shared memory allows communications between threads, which is very convenient. However, for those of us looking to squeeze out every last drop of performance from our applications, you really need to make use of registers when you can. Think of it this way – shared memory is kind of the “jack of all trades” memory. It’s suitable for “most” applications and operations, but for register operations (without read-after-write issues) there is no comparison. Typically, register access consumes zero extra clock cycles per instruction. While this lack of processing latency makes register memory very appealing, read-after-write dependencies have a latency of roughly 24 clock cycles. When such a dependency appears in a loop of code, this latency will add up very quickly.

The only other downside of register memory is called register pressure. Register pressure occurs when there are just simply not enough registers for a given task. Although every multiprocessor in a GPU contains literally thousands of 32 bit registers, these get partitioned amongst concurrent threads. You can set the maximum number of registers that can be allocated (by the compiler) via the command line.

To summarize, when you’re developing your algorithms and applications you really need to be aware of how you’re making use of memory:

  • Global memory is great for beginner programmers, as it drastically simplifies coding for those who aren’t skilled or experienced in regards to CUDA programming. Performance will be lower.
  • If you aren’t needing to squeeze out every drop of performance, shared memory can take you to where you need to be. The benefits of thread-to-thread communications within a warp makes many algorithms easier to code and implement, making shared memory a very attractive option.
  • Register memory is the fastest, but a little more tricky. There are hard limits to what you can do with register memory, but if what your algorithm requires fits inside those confines, then definitely make use of registers.
  • Very specific types of applications can really benefit from using texture and local memory, but if you’re in the market for those types of memory, you probably wouldn’t be reading this blog in the first place.

The next portion of this blog will step away from the memory aspect of performance optimization and into optimizing configurations and the art of keeping all the multiprocessors on your device busy throughout the execution of your kernel.

The post Avoiding GPU Memory Performance Bottlenecks appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/avoiding-gpu-memory-performance-bottlenecks/feed/ 0