Test Drive Archives - Microway https://www.microway.com/category/test-drive/ We Speak HPC & AI Thu, 30 May 2024 20:10:20 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science https://www.microway.com/hpc-tech-tips/what-can-you-do-with-a-15k-nvidia-data-science-workstation-change-healthcare/ https://www.microway.com/hpc-tech-tips/what-can-you-do-with-a-15k-nvidia-data-science-workstation-change-healthcare/#respond Wed, 04 Mar 2020 22:31:20 +0000 https://www.microway.com/?p=12259 NVIDIA’s Data Science Workstation Platform is designed to bring the power of accelerated computing to a broad set of data science workflows. Recently, we found out what happens when you lend a talented data scientist (with a serious appetite for after-hours projects + coffee) a $15k accelerated data science tool. You can recreate a massive […]

The post What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science appeared first on Microway.

]]>
NVIDIA’s Data Science Workstation Platform is designed to bring the power of accelerated computing to a broad set of data science workflows. Recently, we found out what happens when you lend a talented data scientist (with a serious appetite for after-hours projects + coffee) a $15k accelerated data science tool. You can recreate a massive Pubmed literature search project on a Data Science WhisperStation in hours versus weeks.

Kyle Gallatin, an engineer at Pfizer, has deep data science credentials. He’s been working on projects for over 10 years. At the end of 2019 we gave him special access to one of our Data Science WhisperStations in partnership with NVIDIA:

When NVIDIA asked if I wanted to try one of the latest data science workstations, I was stoked. However, a sobering thought followed the excitement: what in the world should I use this for?

I thought back to my first data science project: a massive, multilingual search engine for medical literature. If I had access to the compute and GPU libraries I have now in 2020 back in 2017, what might I have been able to accomplish? How much faster would I have accomplished it?

Experimentation, Performance, and GPU Accelerated Data Science Tooling

Gallatin used Data Science WhisperStation to rapidly create an accelerated data science workflow for a healthcare—and tell us about his experience. And it was a remarkable one.

Not only was a previously impossible workflow made possible, but portions of the application were accelerated up to 39X!

The Data Science Workstation allowed him to design a Pubmed healthcare article search engine where he:

  1. Ingested a larger database than ever imagined (30,000,000 research article abstracts!)
  2. Didn’t require massive code changes to GPU accelerate the algorithm
  3. Used familiar looking tools for his workflow
  4. Had unsurpassed agility—he could search large portions of the abstract database in .1 seconds!

This last point is really critical and shows why we believe the NVIDIA Data Science Workstation Platform and its RAPIDS tools are so special. As Kyle put it:

Data science is a field grounded in experimentation. With big data or large models, the number of times a scientist can try out new configurations or parameters is limited without massive resources. Everyone knows the pain of starting a computationally-intensive process, only be blindsided by an unforeseen error literal hours into running it. Then you have to correct it and start all over again.

Walkthrough with Step-by-Step Instructions

The new article is available on Medium. It provides a complete step-by-step walkthrough of how NVIDIA Rapids tools and NVIDIA Quadro RTX 6000 with NVLink were utilized to revolutionize this process.

A short set of Kyle’s key findings about the environment and the hardware are below. We’re excited about how this kind of rapid development could change healthcare:

Running workflows with GPU libraries can speed up code by orders of magnitude — which can mean hours instead of weeks with every experiment run

Additionally, if you’ve ever set up a data science environment from scratch you know it can really suck. Having Docker, RAPIDs, tensorflow, pytorch and everything else installed and configured out-of-the-box saved hours in setup time

..

With these general-purpose data science libraries offering massive computational enhancements for traditionally CPU-bound processes (data loading, cleansing, feature engineering, linear models, etc…), the path is paved to entirely new frontier of data science.

Read on at Medium.com

The post What Can You Do with a $15k NVIDIA Data Science Workstation? – Change Healthcare Data Science appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/what-can-you-do-with-a-15k-nvidia-data-science-workstation-change-healthcare/feed/ 0
One-shot Learning Methods Applied to Drug Discovery with DeepChem https://www.microway.com/hpc-tech-tips/one-shot-learning-methods-applied-drug-discovery-deepchem/ https://www.microway.com/hpc-tech-tips/one-shot-learning-methods-applied-drug-discovery-deepchem/#respond Wed, 26 Jul 2017 14:01:15 +0000 https://www.microway.com/?p=8929 Experimental data sets for drug discovery are sometimes limited in size, due to the difficulty of gathering this type of data. Drug discovery data sets are expensive to obtain, and some are the result of clinical trials, which might not be repeatable for ethical reasons. The ClinTox data set, for example, is comprised of data […]

The post One-shot Learning Methods Applied to Drug Discovery with DeepChem appeared first on Microway.

]]>
Experimental data sets for drug discovery are sometimes limited in size, due to the difficulty of gathering this type of data. Drug discovery data sets are expensive to obtain, and some are the result of clinical trials, which might not be repeatable for ethical reasons. The ClinTox data set, for example, is comprised of data from FDA clinical trials of drug candidates, where some data sets are derived from failures, due to toxic side effects [2].For cases where training data is scarce, application of one-shot learning methods have demonstrated significantly improved performance over methods consisting only of graphical convolution networks.The performance of one-shot network architectures will be discussed here for several drug discovery data sets, which are described in Table 1.

These data sets, along with one-shot learning methods, have been integrated into the DeepChem deep learning framework, as a result of research published by Altae-Tran, et al. [1].While data remains scarce for some problem domains, such as drug discovery, one-shot learning methods could pose an important alternative network architecture, which can possibly far outperform methods which use only graphical convolution.

DatasetCategoryDescriptionNetwork TypeNumber of TasksCompounds
Tox21PhysiologytoxicityClassification128,014
SIDERPhysiologyside reactionsClassification271,427
MUVBiophysicsbioactivityClassification1793,127

Table 1. DeepChem drug discovery data sets investigated with one-shot learning.

One-Shot Network Architectures Produce Most Accurate Results When Applied to Small Population Training Sets

The original motivation for investigating one-shot neural networks arose from the fact that humans can learn sufficient representations, given small amounts of data, and then later apply a learned representation to correctly distinguish between objects which have been observed only once.The one-shot network architecture has previously been developed, and applied to image data, with this motivational context in mind [3, 5].

The question arose, as to whether an artificial neural network, given a small data set, could similarly learn a sufficient number of features through training, and perform at a satisfactory level.After some period of development, one-shot learning methods have emerged to demonstrate good success [3,4,5,6].

The description provided here of the one-shot approach focuses mostly on methodology, and less on the theoretical and experimental results which support the method.The simplest one-shot method computes a distance weighted combination of support set labels.The distance metric can be defined using a structure called a Siamese network, where two identical networks are used.The first twin produces a vector output for the molecule being queried, while the other twin produces a vector representing an element of the support set.Any difference between the outputs can be interpreted as a dissimilarity measure between the query structure and any particular structure in the support set.A separate dissimilarity measure can be computed for each element in the support set, and then a normalized, weighted representation of the query structure can be determined.For example, if the query structure is significantly less dissimilar to two structures, out of, say, twenty, in the support set, then the weighted representation will be nearly the average of the vectors which represent the two support structures which most resemble the queried structure.

There are two other one-shot methods which take more complex considerations into account.In the Siamese network one-shot approach, the vector embeddings of both the query structure and each individual support structure is computed independently of the support set.However, it has been shown empirically, that by taking into account the context of all support set elements, when computing the vector embeddings of the query, and each individual support structure, better one-shot network performance can be realized.This approach is called full context embedding, since the full context of the support set is taken into account when computing every vector embedding.In the full context embedding approach, the embeddings for the every support structure are allowed to influence the embedding of the query structure.

The full context embedding approach uses Siamese, i.e. matching, networks like before, but once the embeddings are computed, they are then further processed by Long Short-Term Memory (LSTM) structures.The embeddings, before processing by LSTM structures, will be referred to here as pre-contextualized vectors. The full contextual embeddings for the support structures are produced using an LSTM structure called a bidirectional LSTM (biLSTM), while the full contextual embedding for the query structure is produced by an LSTM structure called an attentional LSTM (attLSTM).An LSTM is a type of recurring neural network, which can process sequences of input.With the biLSTM, the support set is viewed as a sequence of vectors. A bidirectional LSTM is used, instead of just an LSTM, in order to reduce dependence on the sequence order.This improves model performance because the support set has no natural order.However, not all dependence on sequence order is removed with the biLSTM.

The attLSTM constructs an order-independent full contextual embedded vector of the query structure.The full details of the attLSTM will not be discussed here, beyond saying that both the biLSTM and attLSTM are network elements which interpret some set of structures as a sequence of pre-contextualized vectors, and converts a sequence into a single full context embedded vector.One full context embedded vector is produced for the support set of structures, and one is produced for the query structure.

A further improvement has been made to the one-shot model described here.As mentioned, the biLSTM does not produce an entirely order-independent full context embedding for each pre-contextualized vector, corresponding to a support structure.As mentioned, the support set does not contain any natural order to it, so any sequence order dependence present in the full context embedded support vector is an unwanted artifact, and will lead to reduced model performance.There is another problem, which is that, in the way they have been defined, the full context embedded vectors of the support structures depend only on the pre-contextualized support vectors, and not on the pre-contextualized query vector.On the other hand, the full context embedded vector of the query structure depends on both its own pre-contextualized vector, and the pre-contextualized vectors of the support set.This asymmetry indicates that some additional information is not accounted for in the model, and that performance could be improved if this asymmetry could be removed, and if the order dependence of the full context embedded support vectors could also be removed.

To address this problem, a new LSTM model was developed by Altae-Tran, et al., called the Iteratively Refined LSTM (IterRefLSTM).The full details of how the IterRefLSTM model operates is beyond the scope of this discussion.A full explanation can be found in Altae-Tran, et al.Put briefly, the full contextual embedded vectors of the support and query structures are co-evolved, in an iterative process, which uses an attLSTM element, and results in removal of order-dependence in the full contextual embedding for the support, as well removal of the asymmetry in dependency between the full context embedded vectors of the support and query structures.

A brief summary of the one-shot network architectures discussed is presented in Table 2.

ArchitectureDescription
Siamese Networksscore comparison, dissimilarity measure
Attention LSTM (attLSTM)better extraction of prior data, contains order-dependence of input data
Iterative Refinement LSTMs (IterRefLSTM)similar to attLSTM, but removes all order dependence of data by iteratively evolving the query and support embeddings simultaneously in an iterative loop

Table 2. One-shot networks used for investigating low-population biological assay data sets.

Computed Results of One-Shot Performance Metric is Compared to Published Values

A comparison of independently computed values is made here with published values from Altae-Tran, et al. [1].Quantitative results for classification tasks associated with the Tox21, SIDER, and MUV datasets were obtained by evaluating the the area under the receiver operating characteristic curve (read more on AUROC).For datasets having more than one task, the median of the performance metric over all tasks in the held-out data sets is reported.A k-fold cross-validation was then done, with k=4.The mean of performances across all cross-validations was then taken, and reported as the performance measure for the data set.A discussion of the standard deviation is given further below.

Since the tasks for Tox21, SIDER, and MUV are all classification tasks for binary assay data, with positive and negative results from a clinical trial, for example, the performance values, as mentioned, are reported with the AUROC metric.With AUROC, a value of 0.5 indicates no predictive power, while a result of 1.0 indicates that every outcome in the held out data set has been predicted correctly [Kennis Research, 9]. A value less than 0.5 can be interpreted as a value of 1.0 minus the metric value.This operation corresponds to inverting the model, where True is now False, and vice versa.This way, a metric value between 0.5 and 1.0 can always be realized. Each data set performance is reported with a standard deviation, containing dependence on the dispersion of metric values across classifications tasks, and then k cross-validations.

Our computed values match well with those published by Altae-Tran, et al. [1], and essentially confirm their published performance metric values, from their ACS Central Science publication.The first and second columns in Table 3 show classification performance for GC tasks, and RF, respectively, as computed by Altae-Tran, et al.Single task GC and RF results are presented as a baseline of comparison to one-shot methods.

The use of k-fold cross validation improves the estimated predicted performance of the model, as it would perform if trained on all of the data, and not just a training subset, with a portion reserved testing.Since we cannot directly measure the performance of a model trained on the full data set (since no testing data would remain), the k-fold cross validation is used to provide a best guess of a performance estimate we cannot see (until final deployment), where a deployed network would be trained on all of the data.

 Tox21SIDERMUV
Random Forests‡,⁑0.539 ± 0.0490.557 ± 0.0590.751 ± 0.062Ω
Graphical Convolution‡,⁑0.625 ± 0.0360.482 ± 0.0380.583 ± 0.061
Siamese Networks0.783 ± 0.0090.660 ± 0.0880.500 ± 0.043
AttLSTM0.759 ± 0.0070.607 ± 0.0800.500 ± 0.058
IterRefLSTM0.807 ± 0.003Ω0.751 ± 0.002Ω0.533 ± 0.051

Table 3. AUROC performance metric values for each one-shot method, plus the random forests (RF), and graphical convolution (GC) methods.Metric values were measured across Tox21, SIDER, and MUV test data sets, using a trained modelΦ.Randomness arises from using a trained model to evaluate the AUROC metric on a test set.First a support setΨ, S, of 20 data points is chosen from the set of data points for a test task.The metric is then evaluated over the remaining points in a test task data set.This process is repeated 20 times for every test task in the data set. The mean and standard deviation for all AUROC measures generated in this way are computed.

Finally, for each data set (Tox21, SIDER, and MUV), the reported performance result is actually the median performance value across all test tasks for a data set.This indirectly implies that the individual metric performances on individual tasks is unimportant, and that they more or less tend to all do well or poorly together, without too much variance across tasks.However, a median measure can mask outliers, where performance on one task might be very bad.If outliers can be removed for rational reasons, then using the median across task performance can be an effective way of removing the influence of outliers.


The performance measures for RF and GC were computed with one-fold cross validation (i.e. no cross-validation).This is because the RF and GC scripts available with our current version of DeepChem (July, 2017), are written for performing only one-fold validation with these models.

The variances of k-fold cross validation performance estimates were determined from pooling all performance values, and then finding the median variance of the entire pool.More complex techniques exist for estimating the variance from a cross-validated set, and the reader is invited to investigate other methods [Nadeau, et al.].

Ω This performance measure by IterRefLSTM on the Tox21 data set is the only performance which rates rates as good.IterRefLSTM performance on the SIDER dataset performs fairly, while RF on MUV, rates as only fair.

Φ Since network inference (predicting outcomes) can be done much faster than network training, due to the computationally expensive backprogragation algorithm, only a batch, B, of data points, and not the entire training data, excluding support data, are selected to train.A support set, S, of 20 data points, along with a batch of queries, B, of 128 data points, is selected for each training set task, in each of the the held-out training sets, for a given episode of training.

A number of training episodes equal to 2000 * ntrain is performed, with one step of minimization performed by the ADAM optimizer per episode[11]. ntrain is the number of test tasks in a test set.After the total number of training episodes has been computed, an intermediate information structure for the the attLSTM, and IterRefLSTM models, called the embedding vector set, described earlier, is produced.It should be noted that the same size of support set, S, is also used during model testing on the held out testing tasks.

Ψ Every support set, S, whether selected during training or testing, was chosen so that it contained 10 positive and 10 negative samples for the task in question.In the full study done in [1], however, variations on the number of positive and negatives samples are explored for the support set, S.Investigators found that by sampling more data points in S, rather than increasing the number of backpropagation iterations, better model performance resulted.


It should be noted that, for a support set of 10 positive, and 10 negative assay results, our computed results for the Siamese method on MUV do not show any predictive performance.The results published by Altaei-Tran, however, indicate marginally predictive, but poor predictability, with an AUROC metric value of 0.601 ± 0.041.

Our metric was computed several times on both a Tesla P100 16GB GPU, and a Tesla M40 GPU, but we found, with this particular support, the Siamese model has no predictive power, with a metric value of 0.500 ± 0.043 (see Table 3). Our other computed results for the AttLSTM and IterRefLSTM concur with published results, which show that neither one-shot learning method has predictive power on MUV data, with a support set containing 10 positive, and 10 negative assay results.

The Iterative Refinement LSTM shows a narrower dispersion of scores than other one-shot Learning models.This result agrees with published standard deviation values for LSTM in Altae-Tran, et al. [1].

Speedups factors are determined by comparing runtimes on the NVIDIA Tesla P100 GPU, to runtimes on the Tesla M40 GPU, and are presented in Tables 4 and 5.Speedup factors are found to be not as pronounced for one-shot methods, and an explanation of the speedup results is presented.The approach for training and testing one-shot methods is described, as they involve some extra considerations which do not apply to graphical convolution.

 Tesla P100 runtimesTesla M40 runtimes
 Tox21SIDERMUVTox21SIDERMUV
Random Forests253784243783
Graphical Convolution38796441100720
Siamese8572,1801,4649562,4071,617
AttLSTM9332,4051,5911,0412,5811,725
IterRefLSTM1,0062,5111,6801,1012,7211,834

Table 4. Runtimes for each one-shot model on the NVIDIA Tesla M40 and Tesla P100 16GB PCIe GPU.All runtimes are in seconds.

RF are run entirely on CPU, and reflect CPU runtimes. Their run times are shown with strikethrough, to indicate that their values are not be considered for determining GPU speedup factors.

A quick inspection of the results in Table 4 shows that the one-shot methods perform better on the Tox21 and SIDER data sets, but not on the MUV data.A reason for poor performance of one-shot methods in MUV data is proposed below.

Limitations of One-Shot Networks

Compared to previous methods, one-shot networks demonstrate extraction of more information from the prior (support) data than RF or GC, but with a limitation.One-shot methods are only successful when data in the held out testing set is sufficiently similar to data seen during training.Networks trained using one-shot methods do not perform well when trying to classify data that is too dissimilar from the sample data used for training.In the context of drug discovery, this problem is encountered when trying to apply one-shot learning to the Maximum Unbiased Validation (MUV) dataset, for example [10].Benchmark results show that all three one-shot learning methods explored here do little better than pure chance when making classification predictions with MUV data (see Table 3).

The MUV dataset contains around 93,000 compounds, and represents a diverse collection of molecular scaffolds, compared to Tox21 and SIDER.One-shot methods do not perform as well on this data set, probably because there is less structural similarity between the elements of the MUV dataset, compared to Tox21 and SIDER.One-shot networks require some amount structural similarity, within the data set, in order to extrapolate from limited data, and correctly classify new, but similar, compounds.

A metric of self-similarity within a data set could be computed as a data set size-independent, extensive measure, where every element is compared to every other measurement, and some attention measure is evaluated, such a cosine distance.The attention measure can be summed through all unique comparisons, and then be normalized, by diving by the number of unique comparisons between N elements in the set to all other elements in the set.

 Tox21SIDERMUV
GC1.0791.26611.25α
Siamese1.116λ1.1041.105
AttLSTM1.116λ1.1161.084
IterRefLSTM1.094λ1.0841.092

Table 5. Speedup Factors, comparing the Tesla P100 16GB GPU to the Tesla M40. All speedups are Tesla P100 runtimes divided by Tesla M40 runtimes.


α The greatest speedup is observed with GC on the MUV data set (Table 5).

GC also exhibits the most precipitous drop in performance, as it transitions to one-shot models. Table 4 indicates that the GC model performs better across all data sets, compared to one-shot methods.This is not surprising, because the purely graphical model is more susceptible to GPU acceleration.However, it is crucial to note that GC models perform worse than one-shot models on Tox21, SIDER, but not MUV.On MUV, GC has nearly no predictive ability (Table 3), compared the one-shot models which have absolutely no predictability with MUV.

λ The one-shot newtorks, while providing substantial improvement in performance, do not seem to show a significant speedup, observing the values for the data sets, in the rows for Siamese, attLTSM, or IterRefLSTM.The nearly absent-speedup could arise from high GPU-system memory transfers.Note, however, that although small, there is a slight but consistent improvement is speedup for the one-shot networks for the Tox21 set.The Tox21 data set may therefore require fewer transfers to system memory.A general observation of the element flatline in speedup for one-shot methods may be from the LSTM elements.


Generally, deep convolutional network models, such as GC, or models which benefit from having a large data set containing structurally diverse groups, such as RF and GC, perform better on the MUV data.RF, for example, shows the best performance, even if very poor.Deep networks have demonstrated that, provided enough layers, they have the information-holding capacity required in order to learn and retain representations for the MUV data.Their information-holding capacity is what enables them to classify between the large number of structurally diverse classes in MUV.It may be the case that the hyper parameters for the graphical convolutional network at not set such that the GC model would yield a poor to fair level of performance on MUV.In their paper, Altae-Tran stated that hyperparameters for the convolutional networks were not optimized, and that there may be an opportunity to improve performance there [1].

Remarks on Neural Network Information Structure, and How One-Shot Networks are Different

All neural networks require training data in order to develop structure under training pressure.Feature complexity, in image classification networks, becomes stratified, under training pressure, through the network’s layers.The lowest layers emerge as edge detectors, with successive layers building upon features from previous layers.The second layer, for example, can build corner detectors, or curved edge detectors, by detecting combinations of simpler edges.Through a buildup of feature complexity, eventually, higher layers can emerge which can detect complex, high-level features such as faces.The combinatorial size of the detectable feature space grows with the number of viable filters (kernels) connecting each layer to the preceding layer.With Natural Language Processing (NLP) networks, layer complexity progresses from sentence features, to paragraphs, then chapters, and finally whole book vector representations, which consist of succinct thematic summaries of written works.

To reiterate, all networks require information structure, acquired under training pressure, to develop some inner representation, or “belief” about data.Deep networks allow for more diverse structures to be learned, compared to one-shot networks, which are limited in their ability to learn diverse representations.One-shot structural features are designed to improve extraction of information for support data, in order to learning a representation which can be used to extrapolate from a smaller group of similar classes.One-shot methods do not perform as well with MUV, compared to RF, for the reason that they are not designed to produce a useful network from a data set having the level of dissimilarity between molecular scaffolds between elements, such as with MUV.

Transfer Learning with One-Shot Learning Network Architecture

A network trained on the Tox21 data set was evaluated on the SIDER data set.The results, given by the performance metric values, shown in Table 6, indicate that the network trained on Tox21 has nearly no predictive power on the SIDER data.This indicates that the performance does not generalize well to new chemical scaffolds, which supports the explanation for why one-shot methods do poorly at predicting the results for the MUV dataset.

 SiameseattnLSTMIterRefLSTM
To SIDER from Tox210.5050.5020.504

Table 6. Transfer Learning to SIDER from Tox21. These results agree with the performance metric values reported for transfer learning in [1], and support the conclusion that transfer learning between data sets will result in no predictive capability, unless the data sets are significantly similar.

Conclusion

For binary classification tasks associated with small population data sources, one-shot learning methods may provide significantly better results compared to baseline performances of graphical convolution and random forests.The results show that the performance of one shot learning methods may depend on the diversity of molecular scaffolds in a data set.With MUV, for example, one shot methods did not extrapolate well to unseen molecular scaffolds.Also, the failure of transfer learning from the Tox21 network, to correctly predict SIDER assay outcomes, also indicates that data set training may not be easily generalized with one shot networks.

The Iterative Refinement LSTM method developed in [1] demonstrates that LSTMs can generalize to similar experimental assays which are not identical to assays in the data set, but which have some common relation.

References

1.) Altae-Tran, Han, Ramsundar, Bharath, Pappu, Aneesh S., and Pande, Vijay”Low Data Drug Discovery with One-Shot Learning.” ACS central science 3.4 (2017): 283-293.
2.) Wu, Zhenqin, et al. “MoleculeNet: A Benchmark for Molecular Machine Learning.” arXiv preprint arXiv:1703.00564 (2017).
3.) Hariharan, Bharath, and Ross Girshick. “Low-shot visual object recognition.” arXiv preprint arXiv:1606.02819 (2016).
4.) Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. “Siamese neural networks for one-shot image recognition.” ICML Deep Learning Workshop. Vol. 2. 2015.
5.) Vinyals, Oriol, et al.Matching networks for one shot learning.” Advances in Neural Information Processing Systems. 2016.
6.) Wang, Peilu, et al. “A unified tagging solution: Bidirectional LSTM recurrent neural network with word embedding.” arXiv preprint arXiv:1511.00215 (2015).
7.) Duvenaud, David K., et al.Convolutional networks on graphs for learning molecular fingerprints.” Advances in neural information processing systems. 2015.
8.) Lusci, Alessandro, Gianluca Pollastri, and Pierre Baldi. “Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules.” Journal of chemical information and modeling 53.7 (2013): 1563-1575.
9. Receiver Operating Curves Applet, Kennis Research, 2016.
10. Maximum Unbiased Validation Chemical Data Set
11. Kingma, D. and Ba, J. Adam: a Method for Stochastic Optimization, arXiv preprint: arxiv.org/pdf/1412.6980v8.pdf.
12. University of Nebraska Medical Center online information resource, AUROC
13. Inference for the Generalization of Error

The post One-shot Learning Methods Applied to Drug Discovery with DeepChem appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/one-shot-learning-methods-applied-drug-discovery-deepchem/feed/ 0
GPU-accelerated HPC Containers with Singularity https://www.microway.com/hpc-tech-tips/gpu-accelerated-hpc-containers-singularity/ https://www.microway.com/hpc-tech-tips/gpu-accelerated-hpc-containers-singularity/#respond Tue, 11 Apr 2017 16:44:44 +0000 https://www.microway.com/?p=8673 Fighting with application installations is frustrating and time consuming. It’s not what domain experts should be spending their time on. And yet, every time users move their project to a new system, they have to begin again with a re-assembly of their complex workflow. This is a problem that containers can help to solve. HPC […]

The post GPU-accelerated HPC Containers with Singularity appeared first on Microway.

]]>
Fighting with application installations is frustrating and time consuming. It’s not what domain experts should be spending their time on. And yet, every time users move their project to a new system, they have to begin again with a re-assembly of their complex workflow.

This is a problem that containers can help to solve. HPC groups have had some success with more traditional containers (e.g., Docker), but there are security concerns that have made them difficult to use on HPC systems. Singularity, the new tool from the creator of CentOS and Warewulf, aims to resolve these issues.

Singularity helps you to step away from the complex dependencies of your software apps. It enables you to assemble these complex toolchains into a single unified tool that you can use just as simply as you’d use any built-in Linux command. A tool that can be moved from system to system without effort.

Surprising Simplicity

Of course, HPC tools are traditionally quite complex, so users seem to expect Singularity containers to also be complex. Just as virtualization is hard for novices to wrap their heads around, the operation of Singularity containers can be disorienting. For that reason, I encourage you to think of your Singularity containers as a single file; a single tool. It’s an executable that you can use just like any other program. It just happens to have all its dependencies built in.

This means it’s not doing anything tricky with your data files. It’s not doing anything tricky with the network. It’s just a program that you’ll be running like any other. Just like any other program, it can read data from any of your files; it can write data to any local directory you specify. It can download data from the network; it can accept connections from the network. InfiniBand, Omni-Path and/or MPI are fully supported. Once you’ve created it, you really don’t think of it as a container anymore.

GPU-accelerated HPC Containers

When it comes to utilizing the GPUs, Singularity will see the same GPU devices as the host system. It will respect any device selections or restrictions put in place by the workload manager (e.g., SLURM). You can package your applications into GPU-accelerated HPC containers and leverage the flexibilities provided by Singularity. For example, run Ubuntu containers on an HPC cluster that uses CentOS Linux; run binaries built for CentOS on your Ubuntu system.

As part of this effort, we have contributed a Singularity image for TensorFlow back to the Singularity community. This image is available pre-built for all users on our GPU Test Drive cluster. It’s a fantastically easy way to compare the performance of CPU-only and GPU-accelerated versions of TensorFlow. All one needs to do is switch between executables:

Executing the pre-built TensorFlow for CPUs

[eliot@node2 ~]$ tensorflow_cpu ./hello_world.py
Hello, TensorFlow!
42

Executing the pre-built TensorFlow with GPU acceleration

[eliot@node2 ~]$ tensorflow_gpu ./hello_world.py
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla P100-SXM2-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.4805
pciBusID 0000:06:00.0
Total memory: 15.89GiB
Free memory: 15.61GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties:
name: Tesla P100-SXM2-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.4805
pciBusID 0000:07:00.0
Total memory: 15.89GiB
Free memory: 15.61GiB

[...]

I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3:   Y Y Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:06:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-SXM2-16GB, pci bus id: 0000:07:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla P100-SXM2-16GB, pci bus id: 0000:84:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0)
Hello, TensorFlow!
42

As shown above, the tensorflow_cpu and tensorflow_gpu executables include everything that’s needed for TensorFlow. You can just think of them as ready-to-run applications that have all their dependencies built in. All you need to know is where the Singularity container image is stored on the filesystem.

Caveats of GPU-accelerated HPC containers with Singularity

In earlier versions of Singularity, the nature of NVIDIA GPU drivers required a couple extra steps during the configurations of GPU-accelerated containers. Although GPU support is still listed as experimental, Singularity now offers a --nv flag which passes through the appropriate driver/library files. In most cases, you will find that no additional steps are needed to access NVIDIA GPUs with a Singularity container. Give it a try!

Taking the next step on GPU-accelerated HPC containers

There are still many use cases left to be discovered. Singularity containers open up a lot of exciting capabilities. As an example, we are leveraging Singularity on our OpenPower systems (which provide full NVLink connectivity between CPUs and GPUs). All the benefits of Singularity are just as relevant on these platforms. The Singularity images cannot be directly transferred between x86 and POWER8 CPUs, but the same style Singularity recipes may be used. Users can run a pre-built Tensorflow image on x86 nodes and a complimentary image on POWER8 nodes. They don’t have to keep all the internals and dependencies in mind as they build their workflows.

Generating reproducible results is another anticipated benefit of Singularity. Groups can publish complete and ready-to-run containers alongside their results. Singularity’s flexibility will allow those containers to continue operating flawlessly for years to come – even if they move to newer hardware or different operating system versions.

If you’d like to see Singularity in action for yourself, request an account on our GPU Test Drive cluster. For those looking to deploy systems and clusters leveraging Singularity, we provide fully-integrated HPC clusters with Singularity ready-to-run. We can also assist by building optimized libraries, applications, and containers. Contact an HPC expert.

This post was updated 2017-06-02 to reflect recent changes in GPU support.

The post GPU-accelerated HPC Containers with Singularity appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/gpu-accelerated-hpc-containers-singularity/feed/ 0
Accelerating Code with OpenACC and the NVIDIA Visual Profiler https://www.microway.com/hpc-tech-tips/accelerating-code-with-openacc-and-nvidia-visual-profiler/ https://www.microway.com/hpc-tech-tips/accelerating-code-with-openacc-and-nvidia-visual-profiler/#respond Mon, 14 Mar 2016 15:00:48 +0000 http://https://www.microway.com/?p=6249 Comprised of a set of compiler directives, OpenACC was created to accelerate code using the many streaming multiprocessors (SM) present on a GPU. Similar to how OpenMP is used for accelerating code on multicore CPUs, OpenACC can accelerate code on GPUs. But OpenACC offers more, as it is compatible with multiple architectures and devices, including […]

The post Accelerating Code with OpenACC and the NVIDIA Visual Profiler appeared first on Microway.

]]>
Comprised of a set of compiler directives, OpenACC was created to accelerate code using the many streaming multiprocessors (SM) present on a GPU. Similar to how OpenMP is used for accelerating code on multicore CPUs, OpenACC can accelerate code on GPUs. But OpenACC offers more, as it is compatible with multiple architectures and devices, including multicore x86 CPUs and NVIDIA GPUs.

Here we will examine some fundamentals of OpenACC by accelerating a small program consisting of iterations of simple matrix multiplication. Along the way, we will see how to use the NVIDIA Visual Profiler to identify parts of the code which call OpenACC compiler directives. Graphical timelines displayed by the NVIDIA Visual Profiler visually indicate where greater speedups can be achieved. For example, applications which perform excessive host to device data transfer (and vice versa), can be significantly improved by eliminating excess data transfer.

Industry Support for OpenACC

OpenACC is the result of a collaboration between PGI, Cray, and CAPS. It is an open specification which sets out compiler directives (sometimes called pragmas). The major compilers supporting OpenACC at inception came from PGI, Cray, and CAPS. The OpenACC Toolkit (which includes the PGI compilers) is available for download from NVIDIA

The free and open source GNU GCC compiler supports OpenACC. This support may trail the commercial implemenations.

Introduction to Accelerating Code with OpenACC

Logo of the OpenACC standard for Accelerator DirectivesOpenACC facilitates the process of accelerating existing applications by requiring changes only to compute-intense sections of code, such as nested loops. A nested loop might go through many serial iterations on a CPU. By adding OpenACC directives, which look like specially-formatted comments, the loop can run in parallel to save significant amounts of runtime. Because OpenACC requires only the addition of compiler directives, usually along with small amounts of re-writing of code, it does not require extensive re-factoring of code. For many code bases, a few dozen effectively-placed compiler directives can achieve significant speedup (though it should be mentioned that most existing applications will likely require some amount of modification before they can be accelerated to near-maximum performance).

OpenACC is relatively new to the set of frameworks, software development kits, and programming interfaces available for accelerating code on GPUs. In June 2013, the 2.0 stable release of OpenACC was introduced. OpenACC 3.0 is current as of November 2019. The 1.0 stable release of OpenACC was first made available in November, 2011.

Diagram of the Maxwell architecture's Streaming Multiprocessor (SMM)
Figure 1 The Maxwell Architecture Streaming Multiprocessor (SM)

By reading OpenACC directives, the compiler assembles CUDA kernels from each section of compute-intense code. Each CUDA kernel is a portion of code that will be sent to the many GPU Streaming Multiprocessor processing elements for parallel execution (see Figure 1).

The Compute Unified Device Architecture (CUDA) is an application programming interface (API), which was developed by NVIDIA for the C and Fortran languages. CUDA allows for parallelization of computationally-demanding applications. Those looking to use OpenACC do not need to know CUDA, but those looking for maximum performance usually need to use some direct CUDA calls. This is accomplished either by the programmer writing tasks as CUDA kernels, or by calling a CUDA ‘drop-in’ library. With these libraries, a developer invokes accelerated routines without having to write any CUDA kernels. Such CUDA ‘drop-in’ libraries include CUBLAS, CUFFT, CURAND, CUSPARSE, NPP, among others. The libraries mentioned here by name are included in the freely available CUDA toolkit.

While OpenACC makes it easier for scientists and engineers to accelerate large and widely-used code bases, it is sometimes only the first step. With CUDA, a more extensive process of code refactoring and acceleration can be undertaken. Greater speedups can be achieved using CUDA. OpenACC is therefore a relatively easy first step toward GPU acceleration. The second (optional), and more challenging step requires code refactoring with CUDA.

OpenACC Parallelization Reports

There are several tools available for reporting information on the parallel execution of an OpenACC application. Some of these tools run within the terminal and are text-based. The text reports can be generated by setting particular environment variables (more on this below), or by invoking compiler options when compiling at the command line. Text reports will provide detail on which portions of the code can be accelerated with kernels.

The NVIDIA Visual Profiler, has a graphical interface which displays a timeline detailing when data transfers occur between the host and device. Kernel launches and runtimes are indicated with a colored horizontal bar. The graphical timeline and text reports in the terminal together provide important information which could indicate sections of code that are reducing performance. By locating inefficiencies in data transfers, for example, the runtime can be reduced by restructuring parallel regions. The example below illustrates a timeline report showing excessive data transfers between the system and the GPU (the host and the device).

Applying OpenACC to Accelerate Matrix Operations

Start with a Serial Code

To illustrate OpenACC usage, we will examine an application which performs common matrix operations. To begin, look at the serial version of the code (without OpenACC compiler directives) in Figure 2:

[sourcecode language=”C”]
#include "stdio.h"
#include "stdlib.h"
#include "omp.h"
#include "math.h"

void fillMatrix(int size, float **restrict A) {
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
A[i][j] = ((float)i);
}
}
}
float** MatrixMult(int size, float **restrict A, float **restrict B,
float **restrict C) {
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
float tmp = 0.;
for (int k = 0; k < size; ++k) {
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
return C;
}
float** MakeMatrix(int size, float **restrict arr) {
int i;
arr = (float **)malloc( sizeof(float *) * size);
arr[0] = (float *)malloc( sizeof(float) * size * size);
for (i=1; i<size; i++){
arr[i] = (float *)(arr[i-1] + size);
}
return arr;
}
void showMatrix(int size, float **restrict arr) {
int i, j;
for (i=0; i<size; i++){
for (j=0; j<size; j++){
printf("arr[%d][%d]=%f \n",i,j,arr[i][j]);
}
}
}
void copyMatrix(float **restrict A, float **restrict B, int size){
for (int i=0; i<size; ++i){
for (int j=0; j<size; ++j){
A[i][j] = B[i][j];
}
}
}
int main (int argc, char **argv) {
int i, j, k;
float **A, **B, **C;

if (argc != 3) {
fprintf(stderr,"Use: %s size nIter\n", argv[0]);
return -1;
}
int size = atoi(argv[1]);
int nIter = atoi(argv[2]);

if (nIter <= 0) {
fprintf(stderr,"%s: Invalid nIter (%d)\n", argv[0],nIter);
return -1;
}
A = (float**)MakeMatrix(size, A);
fillMatrix(size, A);
B = (float**)MakeMatrix(size, B);
fillMatrix(size, B);
C = (float**)MakeMatrix(size, C);

float startTime_tot = omp_get_wtime();
for (int i=0; i<nIter; i++) {
float startTime_iter = omp_get_wtime();
C = MatrixMult(size, A, B, C);
if (i%2==1) {
//multiply A by B and assign back to A on even iterations
copyMatrix(A, C, size);
}
else {
//multiply A by B and assign back to B on odd iterations
copyMatrix(B, C, size);
}
float endTime_iter = omp_get_wtime();
}
float endTime_tot = omp_get_wtime();
printf("%s total runtime %8.5g\n", argv[0], (endTime_tot-startTime_tot));
free(A); free(B); free(C);
return 0;
}
[/sourcecode]

Figure 2 Be sure to include the stdio.h and stdlib.h header files. Without these includes, you may encounter segmentation faults during dynamic memory allocation for 2D arrays.

If the program is run in the NVIDIA Profiler without any OpenACC directive, a console output will not include a timeline. Bear in mind that the runtime displayed in the console includes runtime overhead from the profiler itself. To get a more accurate measurement of runtime, run without the profiler at the command line. To compile the serial executable with the PGI compiler, run:

pgcc -fast -o ./matrix_ex_float ./matrix_ex_float.c

The serial runtime, for five iterations with 1000x1000 matrices, is 7.57 seconds. Using larger 3000x3000 matrices, with five iterations increases the serial runtime to 265.7 seconds.

Parallelizing Matrix Multiplication

The procedure-calling iterative loop within main() cannot, in this case, be parallelized because the value of matrix A depends on a series of sequence-dependent multiplications. This is the case with all sequence-dependent evolution of data, such as with time stepped iterations in molecular dynamics (MD). In an obvious sense, loops performing time evolution cannot be run in parallel, because the causality between discrete time steps would be lost. Another way of stating this is that loops with backward dependencies cannot be made parallel.

With the application presented here, the correct matrix product is dependent on the matrices being multiplied together in the correct order, since matrix multiplication does not commute, in general. If the loop was run in parallel, the outcome would be unpredictable, and very likely not what the programmer intended. For example, the correct output for our application, after three iterations, takes on the form AxBxAxBxB. This accounts for the iterative reassignments of A and B to intermediate forms of the product matrix, C. After four iterations, the sequence becomes AxBxAxBxBxAxBxB. The main point: if this loop were to run in parallel, this sequence would very likely be disrupted into some other sequence, through the uncontrolled process of which threads, representing loop iterations, execute before others on the GPU.

[sourcecode language=”C”]
for (int i=0; i<nIter; i++) {
float startTime_iter = omp_get_wtime();
C = MatrixMult(size, A, B, C);
if (i%2==1) {
//multiply A by B and assign back to A on even iterations
copyMatrix(A, C, size);
}
else {
//multiply A by B and assign back to B on odd iterations
copyMatrix(B, C, size);
}
float endTime_iter = omp_get_wtime();
}
[/sourcecode]

We’ve established that the loop in main() is non-parallelizable, having an implicit dependence on the order of execution of loop iterations. To achieve a speedup, one must examine the routine within the loop: MatrixMult()

[sourcecode language=”C”]
float** MatrixMult(int size, float **restrict A, float **restrict B,
float **restrict C) {
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
pcopyout(C[0:size][0:size])
{
float tmp;
for (int i=0; i<size; ++i) {
for (int j=0; j<size; ++j) {
tmp = 0.;
for (int k=0; k<size; ++k) {
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
}
return C;
}
[/sourcecode]

Here, a kernels OpenACC directive has been placed around all three for loops. Three loops happens to be the maximum number of nested loops that can be parallelized within a single nested structure. Note the syntax for an OpenACC compiler directive in C takes on the following form:

#pragma acc kernels [clauses]

In the code above, the kernels directive tells the compiler that it should try to convert this section of code into a CUDA kernel for parallel execution on the device. Instead of describing a long list of OpenACC directives here, an abbreviated list of commonly used directives appears below in Table 1 (see the references for complete API documentation):

Commonly used OpenACC directives
#pragma acc parallelStart parallel execution on the device. The compiler will generate parallel code whether the result is correct or not.
#pragma acc kernelsHint to the compiler that kernels may be generated for the defined region. The compiler may generate parallel code for the region if it determines that the region can be accelerated safely. Otherwise, it will output warnings and compile the region to run in serial.
#pragma acc dataDefine contiguous data to be allocated on the device; establish a data region minimizing excessive data transfer to/from GPU
#pragma acc loopDefine the type of parallelism to apply to the proceeding loop
#pragma acc regionDefine a parallel region where the compiler will search for code segments to accelerate. The compiler will attempt to automatically parallelize whatever it can, and report during compilation exactly what portions of the parallel region have been accelerated.

Table 1 OpenACC Compiler Directives

Along with directives, there can be modifying clauses. In the example above, we are using the kernels directive with the pcopyin(list) and pcopyout(list) clauses. These are abbreviations for present_or_copyin(list), and present_or_copyout(list).

  • pcopy(list) tells the compiler to copy the data to the device, but only if data is not already present. Upon exiting from the parallel region, any data which is present will be copied to the host.
  • pcopyin(list) tells the compiler to copy to the device if the data is not already there.
  • pcopyout(list) directs the compiler to copy the data if it is on the device, else the data is allocated to the device memory and then copied to the host. The variables, and arrays in list are those which will be copied.
  • present_or_copy(list) clauses avoid the reduced performance of excessive data copies, since the data needed may already be present.

After adding the kernels directive to MatrixMult(), compile and run the executable in the profiler. To compile a GPU-accelerated OpenACC executable with PGI, run:

pgcc -fast -acc -ta=nvidia -Minfo -o ./matrix_ex_float ./matrix_ex_float.c

The -Minfo flag is used to enable informational messages from the compiler. These messages are crucial for determining whether the compiler is able to apply the directives successfully, or whether there is some problem which could possibly be solved. For an example of a compiler message reporting a warning, see the section ‘Using a Linearized Array Instead of a 2D Array’ in the next OpenACC blog, entitled ‘More Tips on OpenACC Code Acceleration‘.

To run the executable in the NVIDIA Visual Profiler, run:

nvvp ./matrix_ex 1000 5

During execution, the 1000x1000 matrices – A and B – are created and multiplied together into a product. The command line argument 1000 specifies the dimensions of the square matrix and the argument 5 sets the number of iterations for the loop to run through. The NVIDIA Visual Profiler will display the timeline below:

Screenshot of NVIDIA Visual Profiler Timeline showing the test case where pcopyin and pcopyout are used in MatrixMult().
Figure 3 (click for expanded view)

Note that there are two Host to Device transfers of matrices A and B at the start of every iteration. Data transfers to the device, occurring after the first transfer, are excessive. In other words, every data copy after the first one is wasted time and lost performance.

Using the OpenACC data Directive to Eliminate Excess Data Transfer

Because the parallel region consists of only the two loops in the MatrixMult() routine, every time this routine is called entire copies of matrices A & B are passed to the device. Since the data only needs to be sent before the first iteration, it would make sense to expand the data region to encompass every call to MatrixMult(). The boundary of the data region must be pushed out to encompass the loop in main(). By placing a data directive just outside of this loop, as shown in Figure 4, the unnecessary copying of A and B to the device after the first iteration is eliminated:

[sourcecode language=”C”]
#pragma acc data pcopyin(A[0:size][0:size],B[0:size][0:size],C[0:size][0:size]) \
pcopyout(C[0:size][0:size])
{
float startTime_tot = omp_get_wtime();
for (int i=0; i<nIter; i++) {
float startTime_iter = omp_get_wtime();
C = MatrixMult(size, A, B, C);
if (i%2==1) {
//multiply A by B and assign back to A on even iterations
copyMatrix(A, C, size);
}
else {
//multiply A by B and assign back to B on odd iterations
copyMatrix(B, C, size);
}
float endTime_iter = omp_get_wtime();
}
float endTime_tot = omp_get_wtime();
}
[/sourcecode]
Figure 4 A data region is established around the for loop in main()

After recompiling and re-running the executable in NVIDIA’s Visual Profiler nvvp, the timeline in Figure 5 shows that the unnecessary transfers are now gone:

Screenshot of NVIDIA Visual Profiler Timeline for test case where pcopyin and pcopyout are used in MatrixMult() and the data region is used in main().
Figure 5 (click for expanded view)

Now matrices A and B are copied to the device only once. Matrix C, the result, is copied to the Host at the end of the kernel region in MatrixMult() on every iteration. As shown in the table below, the runtime improvement is small but significant (1.9s vs. 1.5s). This reflects a 19.5% decrease in runtime; a speedup of 1.24.

Runtimes for Various OpenACC Methods (in seconds)
OpenACC methodMatrix size 1000×1000Matrix size 3000×3000
no acceleration7.569265.69
#pragma acc kernels in MatrixMult()0.35401.917
#pragma acc kernels in MatrixMult() and
#pragma acc data in main()
0.05391.543

Table 2 Runtimes for five iterations of matrix multiplication (C=AxB).

As data sizes increase, the amount of work grows and the benefits of parallelization become incredibly clear. For the larger 3000x3000 matrices, a speedup factor of 172 is realized when both kernels and data directives are used.

Comparing Runtimes of OpenACC and OpenMP

Because OpenMP is also used as a method for parallelization of applications, it is useful to compare the two. To compare OpenACC with OpenMP, an OpenMP directive is added to the MatrixMult() routine:

[sourcecode language=”C”]
void MatrixMult(int size, float **restrict A, float **restrict B,
float **restrict C) {
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
pcopyout(C[0:size][0:size])
#pragma omp parallel for default(none) shared(A,B,C,size)
for (int i=0; i<size; ++i) {
for (int j=0; j<size; ++j) {
float tmp = 0.;
for (int k=0; k<size; ++k) {
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
}
[/sourcecode]

To compile the code with OpenMP parallelization, run:

pgcc -fast -mp ./matrix_ex_float.c -o ./matrix_ex_float_omp

The results were gathered on a Microway NumberSmasher server with dual 12-core Intel Xeon E5-2690v3 CPUs running at 2.6GHz. Runtimes were gathered when executing on 6, 12, and 24 of the CPU cores. This is achieved by setting the environment variable OMP_NUM_THREADS to 6, 12, and 24 respectively.

Number of ThreadsRuntime (in seconds)
637.758
1218.886
2410.348

Table 3 Runtimes achieved with OpenMP using 3000x3000 matrices and 5 iterations

It is clear that OpenMP is able to provide parallelization and good speedups (nearly linear). However, the GPU accelerators are able to provide more compute power than the CPUs. The results in Table 4 demonstrate that OpenMP and OpenACC both substancially increase performance. By utilizing a single NVIDIA Tesla M40 GPU, OpenACC is able to run 6.71 faster than OpenMP.

Speedups Over Serial Runtime
serialOpenMP speedupOpenACC speedup
125.67x172x

Table 4 Relative Speedups of OpenACC and OpenMP for 3000x3000 matrices.

OpenACC Bears Similarity to OpenMP

As previously mentioned, OpenACC shares some commonality with OpenMP. Both are open standards, consisting of compiler directives for accelerating applications. Open Multi-Processing (OpenMP) was created for accelerating applications on multi-core CPUs, while OpenACC was primarily created for accelerating applications on GPUs (although OpenACC can also be used to accelerate code on other target devices, such as multi-core CPUs). Looking ahead, there is a growing consensus that the roles of OpenMP and OpenACC will become more and more alike.

OpenACC Acceleration for Specific GPU Devices

GPU Hardware Specifics

When a system has multiple GPU accelerators, a specific GPU can be selected either by using an OpenACC library procedure call, or by simply setting the environment variable CUDA_VISIBLE_DEVICES in the shell. For example, this would select GPUs #0 and #5:

export CUDA_VISIBLE_DEVICES=0,5

On Microway’s GPU Test Drive Cluster, some of the Compute Nodes have a mix of GPUs, including two Tesla M40 GPUs labelled as devices 0 and 5. To see what devices are available on your machine, run the command deviceQuery, (which is included with the CUDA Toolkit). pgaccelinfo, which comes with the OpenACC Toolkit, reports similar information.

When an accelerated application is running, you can view the resource allocation on the device by executing the nvidia-smi utility. Memory usage and GPU usage, listed by application, are reported for all GPU devices in the system.

Gang, Worker, and Vector Clauses

Although CUDA and OpenACC both use similar ideas, their terminology differs slightly. In CUDA, parallel execution is organized into grids, blocks (threadBlocks), and threads. In OpenACC, a gang is like a CUDA threadBlock, which executes on a processing element (PE). On a GPU device, the processing element (PE) is the streaming multiprocessor (SM). A number of OpenACC gangs maps across numerous PEs (CUDA blocks).

An OpenACC worker is a group of vectors. The worker dimension extends across the height of a gang (threadBlock). Each vector is a CUDA thread. The dimension of vector is across the width of the threadBlock. Each worker consists of vector number of threads. Therefore, a worker corresponds to one CUDA warp only if vector takes on the value of 32; a worker does not have to correspond to a warp. For example, a worker can correspond to two warps if vector is 64, for example. The significance of a warp is that all threads in a warp run concurrently.

Diagram of an NVIDIA CUDA Grid, which is made up of multiple Thread Blocks
Figure 6 A CUDA grid consists of blocks of threads (threadBlocks), which can be arranged in one or two dimensions.

Figure 6 illustrates a threadBlock, represented as part of a 2D grid containing multiple threadBlocks. In OpenACC, the grid consists of a number of gangs, which can extend into one or two dimensions. As depicted in Figure 7, the gangs extend into one dimension. It is possible, however, to arrange gangs into a two dimensional grid. Each gang, or threadBlock, in both figures 6 and 7 is comprised of a 2D block of threads. The number of vectors, workers, and gangs can be finely tuned for a parallel loop.

Sometimes it is faster to have some kernels execute more than once on a block, instead of having each kernel execute only once per block. Discovering the optimal amount of kernel re-execution can require some trial and error. In OpenACC, this would correspond to a case where the number of gangs is less than a loop layer which is run in parallel across gangs and which has more iterations than gangs available.

In CUDA, threads execute in groups of 32 at a time. Groups of 32 threads, as mentioned, are called warps, and execute concurrently. In Figure 8, the block width is set to 32 threads. This makes more threads execute concurrently, so the program runs faster.

[expand title=”(click to expand) Additional runtime output, with kernel runtimes, grid size, and block size”]

Note: the kernel reports can only be generated by compiling with the time target, as shown below (read more about this in our next blog post). To compile with kernel reports, run:

pgcc -fast -acc -ta=nvidia,time -Minfo -o ./matrix_ex_float ./matrix_ex_float.c

Once the executable is compiled with the nvidia and time arguments, a kernel report will be generated during execution:

[john@node6 openacc_ex]$ ./matrix_ex_float 3000 5
./matrix_ex_float total runtime 1.3838

Accelerator Kernel Timing data
/home/john/MD_openmp/./matrix_ex_float.c
MatrixMult NVIDIA devicenum=0
time(us): 1,344,646
19: compute region reached 5 times
26: kernel launched 5 times
grid: [100x100] block: [32x32]
device time(us): total=1,344,646 max=269,096 min=268,685 avg=268,929
elapsed time(us): total=1,344,846 max=269,144 min=268,705 avg=268,969
19: data region reached 5 times
35: data region reached 5 times
/home/john/MD_openmp/./matrix_ex_float.c
main NVIDIA devicenum=0
time(us): 8,630
96: data region reached 1 time
31: data copyin transfers: 6
device time(us): total=5,842 max=1,355 min=204 avg=973
31: kernel launched 3 times
grid: [24] block: [128]
device time(us): total=19 max=7 min=6 avg=6
elapsed time(us): total=509 max=432 min=34 avg=169
128: data region reached 1 time
128: data copyout transfers: 3
device time(us): total=2,769 max=1,280 min=210 avg=923

[/expand]

Diagram of OpenACC gangs, workers and vectors
Figure 7 An OpenACC threadBlock has vertical dimension worker, and horizontal dimension vector. The grid consists of gang threadBlocks.

[sourcecode language=”C”]
float** MatrixMult(int size, int nr, int nc, float **restrict A, float **restrict B,
float **restrict C) {
#pragma acc kernels loop pcopyin(A[0:size][0:size],B[0:size][0:size]) \
pcopyout(C[0:size][0:size]) gang(100), vector(32)
for (int i = 0; i < size; ++i) {
#pragma acc loop gang(100), vector(32)
for (int j = 0; j < size; ++j) {
float tmp = 0.;
#pragma acc loop reduction(+:tmp)
for (int k = 0; k < size; ++k) {
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
return C;
}
[/sourcecode]
Figure 8 OpenACC code with gang and vector clauses. The fully accelerated OpenACC version of the C source code can be downloaded here.

The directive clause gang(100), vector(32), on the j loop, sets the block width to 32 threads (warp size), which makes parallel execution faster. Integer multiples of a warp size will also realize greater concurrency, but not usually beyond a width of 64. The same clause sets the grid width to 100. The directive clause on the outer i loop, gang(100), vector(32), sets the grid height to 100, and block height to 32. The block height specifies that the loop iterations are processed in SIMT groups of 32.

By adding the gang and vector clauses, as shown in Figure 8, the runtime is reduced to 1.3838 sec (a speedup of 1.12x over the best runtime in Table 2).

Targeting GPU Architectures with the Compiler

OpenACC is flexible in its support for GPU, which means support for a variety of GPU types and capabilities. The target options in the table below illustrate how different compute capabilities, GPU architectures, and CUDA versions can be targeted.

compute capabilityGPU architectureCUDA versionCPU
-ta=nvidia[,cc10|cc11|cc12|cc13|cc20] -ta=tesla:cc35, -ta=nvidia,cc35-ta=tesla, -ta=nvidia-ta=cuda7.5, -ta=tesla:cuda6.0-ta=multicore

Table 5 Various GPU target architecture options for the OpenACC compiler

OpenACC for Fortran

Although we have focused here on using OpenACC in the C programming language, there is robust OpenACC support for the Fortran language. The syntax for compiler directives is only slightly different. In the C language, with dynamic memory allocation and pointers, pointers must be restricted inside of parallel regions. This means that pointers, if not declared as restricted in main(), or subsequently cast as restricted in main(), must be cast as restricted when passed as input arguments to routines containing a parallel region. Fortran does not use pointers and handles memory differently, with less user control. Pointer-related considerations therefore do not arise with Fortran.

Summary

OpenACC is a relatively recent open standard for acceleration directives which is supported by several compilers, including, perhaps most notably, the PGI compilers.

Accelerating code with OpenACC is a fairly quick route to speedups on the GPU, without needing to write CUDA kernels in C or Fortran, thereby removing the need to refactor potentially numerous regions of compute-intense portions of a large software application. By making an easy path to acceleration accessible, OpenACC adds tremendous value to the CUDA API. OpenACC is a relatively new development API for acceleration, with the stable 2.0 release appearing in June 2013.

If you have an application and would like to get started with accelerating it with OpenACC or CUDA, you may want to try a free test drive on Microway’s GPU Test Cluster. On our GPU servers, you can test your applications on the Tesla K40, K80, or the new M40 GPU specialized for Deep Learning applications. We offer a wide range of GPU solutions, including:


The post Accelerating Code with OpenACC and the NVIDIA Visual Profiler appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/accelerating-code-with-openacc-and-nvidia-visual-profiler/feed/ 0
Keras and Theano Deep Learning Frameworks https://www.microway.com/hpc-tech-tips/keras-theano-deep-learning-frameworks/ https://www.microway.com/hpc-tech-tips/keras-theano-deep-learning-frameworks/#respond Mon, 05 Oct 2015 15:15:52 +0000 http://https://www.microway.com/?p=5720 Here we will explore how to use the Theano and Keras Python frameworks for designing neural networks in order to accomplish specific classification tasks.In the process, we will see how Keras offers a great amount of leverage and flexibility in designing neural nets.In particular, we will examine two active areas of research: classification of textual […]

The post Keras and Theano Deep Learning Frameworks appeared first on Microway.

]]>
Theano_Keras_Blog_masthead_flattened_masthead

Here we will explore how to use the Theano and Keras Python frameworks for designing neural networks in order to accomplish specific classification tasks.In the process, we will see how Keras offers a great amount of leverage and flexibility in designing neural nets.In particular, we will examine two active areas of research: classification of textual and image data.

The Theano Framework

Theano was originally developed as a symbolic math processor at the University of Montreal, for the purpose of performing symbolic differentiation or integration, on complicated non-linear functions.The University of Montreal offers many learning resources for Theano and its spin-off frameworks of Keras, Blocks, and Fuel.The reader is encouraged to explore their resources and summer school 2015 course content.

The symbolic math processing feature of Theano gives it a similarity to Mathematica, or Maple.However, the Theano interface is lower-level, since Python is a scripted programming language.In Theano, functions are represented symbolically, but they can also be evaluated at coordinates, producing numerical values.Theano can be used for solving problems outside the realm of neural networks, such as logistic regression.Because Theano computes the gradient symbolically, it removes the chance of errors arising from determining an analytic form of the gradient by hand, and then translating it into code.Since Theano makes use of GPUs for computation in general, other types of machine learning approaches, outside of the scope neural networks, can also realize performance gains in Theano.Any CUDA-capable GPU will work with Theano.

Method of Backpropagation

Theano was not designed for the goal of building neural networks.Rather, it was widely adopted by neural network and machine learning researchers as a useful development environment for computing the gradients of an error function with respect to the weights of a network.The calculation of these gradients is important for the neural network training method called backpropagation.The method is so-called because of the way the chain rule of differentiation applies at each layer of the network, from the output layer, chaining backward (or “propagating” backward) toward the input layer.Strictly speaking, there are no signals propagating backward in this picture.The term propagation is applied here in the sense of how the chain rule of differentiation is performed in successive steps for each layer, progressing (no actual movement) from the output to the input.I mention this here to clear any confusion, since the network is an abstract mathematical model for a biological neural network, where electrical signals do actually propagate forward, along axons, to reach neurons directed in forward positions.

Building Neural Networks with the Theano with Keras Frameworks: Exploring the MNIST and IMDB datasets with Feedforward Networks
Keras and Theano Deep Learning Frameworks are first used to compute sentiment from a movie review data set and then classify digits from the MNIST dataset

Gradient Instability Problem

Neural network gradients can have instability, which poses a challenge to network design.If a network is too deep, and the weights are too small, signals will become attenuated as they move deeper into a network.The same is true of the gradient, when calculated from the output layer, and then chained into successive layers toward the input.This can make it difficult to effectively train neurons which are positioned more than several layers back toward the input. After the gradient is computed through several layers, the weights become multiplied into the gradient expression. The gradient can then become attenuated by the weights, if they are too small. This immediately leads to corrective terms becoming too small for neurons in the deeper layers, thereby making the network training ineffective by backpropagation. In others words, the training will not be able to reach far enough backward into a network. This problem is especially pronounced in recurring neural networks (RNNs), where a network’s output feeds back to its input, increasing the network’s depth.

The opposite of the attenuation scenario can also occur, where if the weights are too large, a signal will grow exponentially, and cause the network to be unstable, with corrective terms growing without bound. So how does Theano help to address these problems? If a network’s weights are too small or too large, backpropagation will not work. There is a way to balance the weights connecting the network layers, such that they do not grow too large or small. This method is called L2 Regularization, and involves adding the 2-norms of the Weight matrices to the cost function. L2 Regularization also helps to reduce overfitting to data.This, however, makes the cost function more complicated.

Successive chaining of nonlinear activation functions, such as logistic functions, hyperbolic tangents (tanh), or rectified linear units (ReLUs), along with L2 regularization leads to a pretty complicated cost function. Theano calculates the error gradient symbolically, and precisely, despite the complexity, with respect to each weight value, yielding an analytic form, which is evaluated numerically. This precision helps to eliminate numerical errors that arise in computing the gradient, using a Runge-Kutta methods. Numerical errors accumulate and worsen as the gradient is computed deeper into a network. Since each layer is connected by a non-linear function, the effect of these errors worsen. Theano produces an analytic function at every layer in a network, eliminating the accumulation of numerical errors.

The Advantage of using Theano for Developing Neural Networks

Theano makes the backpropagation training method more effective, by producing more accurate corrections for each neuron, even if they are positioned far back toward the input layer.It cannot fix a badly balanced network, but it will make the training more effective for a correctly balanced network with self-balancing mechanisms.Modern networks, such as those trained to play computer Atari games, can only remember a short distance into the past [1].

Various approaches have been considered for the initial assignment of network weights. One method is the Xavier algorithm, which balances initial weight assignments such that attenuation or unstable signal growth does not occur initially in convolutional neural networks (CNNs) [2]. The weights in this method are assigned within a uniform distribution having bounding values determined by network properties.In recurring networks, additional mechanisms must be introduced in order to prevent signal attenuation. Memory elements can be positioned in the network, where they effectively sustain signal strength at each stage of recurrence. Two memory models are the Long Short Term Memory (LSTM) [3], and the Gated Recurring Unit (GRU) [4]. The GRU is simpler in structure compared to the LSTM and has been demonstrated to perform better under certain circumstances. The LSTM model, however, has been shown to produce the best network performance given more training time, and a certain constant initial bias parameter value.

The Keras Framework

Keras puts all of these neural network features and enhancements at the developer’s fingertips. It is possible to define a training and testing set, and then train a multi-layered recurring network with LSTM (or GRU), within an abbreviated body of code consisting approximately of merely fifty to seventy lines. Keras is a high-level framework built on top of Theano.As a framework upon a framework, it provides a great amount of leverage.While Keras provides a high-level interface, it is still possible to program at the lower level Theano framework within the same body of code.

Since we will be using an NVIDIA Tesla K80 GPU card, we want to examine a network which has sufficient complexity, such that using a GPU provides some practical benefit.Simple models, or smaller components of larger networks, such as Perceptrons, Autoencoders (Restricted Boltzmann Machines), or Multi-layer Perceptrons (MLPs), do not contain enough neurons and connecting weights to require the use of GPUs.These smaller networks can instead be solved on a CPU within reasonable time. Larger networks, inspired by biological models, such as LeNet[5], AlexNet[6], GoogLeNet[7], and other deep network architectures, do require GPUs in order to decrease compute time to a practical range.Modern neural networks designed in order to do image classification, or Natural Language Processing (NLP), require a GPU [8].

A schematic of a perceptron having one hidden layer.
A schematic of a perceptron having one hidden layer

Using Keras, doing 1D convolutions on text data, or 2D convolutions on image data, requires about the same amount of code. The framework makes these different sorts of network convolutions relatively easy.We will examine two different types of network models: one which uses 1D convolution on textual data, and another which uses 2D convolution on image data.

We will first train a neural network on textual data contained in an IMDB movie review data set[9]. The network will then be demonstrated on a test set in order to classify reviews and categorize each as positive or negative in sentiment. In our second use case, we will create a neural network image classifier, by defining a neural network similar to LeNet5. We will train this network on the MNIST data[10], and then use the trained network to classify a set of test images.

Building a Movie Review Sentiment Classifier using Keras and Theano Deep Learning Frameworks

This tutorial will assume that you have already set up a working Python environment and that you have installed CUDA, cuDNN, Theano, Keras, along with their associated Python dependencies. The process of setting up a development environment with Keras and Theano is beyond the scope of this tutorial, but very good resources exist which explain how to do this. Using python virtual environments can greatly facilitate the task of setting up a working environment on a multi-user cluster, and is recommended. Virtual environments also make this process easier for non-administrative users who require installation of their own python packages.

To begin, you should define some Python runtime parameters in your ~/.theanorc file.
[sourcecode language=”xml”]
floatX = float32
force_device = True
device = gpu0
mode=FAST_RUN
[nvcc]
fastmath = True
[cuda]
root=/path/to/cuda
[/sourcecode]
On the node where I will build the network, there are four GPUs, numbered 0 through 3, all on NVIDIA K80 cards. Here I have selected gpu0. If you would like to compare runtimes to CPU, you can just set "device = CPU".You will also need to set floatX to be float32, along with your path to CUDA.Theano does not yet support float64 (it will soon), so float32 must, for now, be assigned to floatX.The environment variables, PATH and LD_LIBRARY, will need to be set to the correct respective directories.

On computing clusters, you will first need to log onto a node having a GPU.Logging into GPU-enabled nodes is can be done using an interactive session.

Import Keras modules:
[sourcecode language=”python”]
from __future__ import absolute_import
from __future__ import print_function
import numpy as np
np.random.seed(2222) # use the same random number seed, if you want reproducibility
[/sourcecode]

Import other needed components, such as RMSprop, Sequential, layer types, etc.:
[sourcecode language=”python”]
from keras.preprocessing import sequence
from keras.optimizers import RMSprop
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
from keras.layers.convolutional import Convolution1D, MaxPooling1D
from keras.datasets import imdb
from keras.utils.dot_utils import Grapher
[/sourcecode]

Now some network parameters are defined:
[sourcecode language=”python”]
vocab_size = 5000#this is the number of features to use
#or, size of the vocabulary
paddedlength = 100#this is the length to which each sentence is padded
batch_size = 32#number of input batches processed per epoch
embedding_dims = 100#number of dimensions into which each vocabulary index is embedded
num_filters = 250#number of filters to apply/learn in the 1D convolutional layer
filter_length = 3#linear length of each filter (this is 1D)
hidden_dims1 = 250#number of output neurons for the first Dense layer
hidden_dims2 = 100#number of output neurons for the second Dense layer
epochs = 5#number of training epochs
[/sourcecode]

The IMDB data is loaded and stored into training, validation, and test sets:
[sourcecode language=”python”]
# split the X & Y data 80%/20% in terms of training and test data
#
(X_train, Y_train), (X_test, Y_test) = imdb.load_data(nb_words=vocab_size,test_split=0.2)
# each training and test sentence is padded to be paddedlength
#
X_train = sequence.pad_sequences(X_train, maxlen=paddedlength)
X_test = sequence.pad_sequences(X_test, maxlen=paddedlength)
[/sourcecode]

Next, the network is build by adding layers to the nn object, which is the data object for the neural network model:
[sourcecode language=”python”]
grapher = Grapher()
nn = Sequential()

# the vocab vectors for this dataset each begin as just an integer representing its frequency in the dataset as a whole
# the array of integers representing a sequence of words (or sentence) is transformed so that each word in the sequence
# is represented by fixed-size vectors, each having embedding_dims dimensions
#
nn.add(Embedding(vocab_size, embedding_dims))
nn.add(Dropout(0.5))

# we add a Convolution1D, which will learn num_filters (250) filters
# word group filters of size filter_length– here, it is 3
# the 1D convolution captures short multi-word features across the sentence
#
nn.add(Convolution1D(input_dim=embedding_dims, nb_filter=num_filters, filter_length=filter_length, border_mode="valid", activation="relu", subsample_length=1))
[/sourcecode]

A pooling layer is added to give the network some invariance to word vector position.Neural networks exhibit better performance by also training on the reverse of sentences, where the order of words are reversed, but not the letters within the words [8]Here, we train only on the original order of the sentence.

[sourcecode language=”python”]
# we use standard max pooling in 1D, halving the output of the previous layer
nn.add(MaxPooling1D(pool_length=2))

# We flatten the output of the convolutional layer
# this reduces each embedded word vector to one dimension
# this way, the output of this layer can be fully connected to a proceeding dense layer:
#
nn.add(Flatten())

# calculate the number of output neurons for this conv layer
# a "valid" type 1D convolution is sequence_length – filter_length +1
# then we divide by 2 for the maxpooling
# in the expression below paddedlength-filter_length is divided by 1 first, to refect the fact that
# the subsample_length was set to 1 as an input argument to Convolution1D()
#
output_size = num_filters * (((paddedlength – filter_length) / 1) + 1) / 2
[/sourcecode]

A fully connected layer is added with a ReLU activation function (rectified linear unit). ReLU has been demonstrated to improve the learning rate in convolutional neural networks [6]. The dropout method is used here in order to reduce overfitting to the data.When overfitting occurs, the network becomes too fit to particular details of images in the training set which do not generalize well.

[sourcecode language=”python”]
# A Dense layer is added which is fully connected to the previous convolutional layer.
# The activation function on the output of this Dense layer is set to ReLU, in order to improve learning speed.
# The hidden_dims1 here is the number of output neurons for the Dense layer.
# output_size is the number of outputs from the previous convolutional layer, or inputs into each neuron of the Dense layer
#
nn.add(Dense(output_size, hidden_dims1))
nn.add(Dropout(0.25))
nn.add(Activation(‘relu’))

nn.add(Dense(hidden_dims1, hidden_dims2))
nn.add(Dropout(0.25))
nn.add(Activation(‘relu’))
[/sourcecode]

Finally, the network converges onto a single neuron, with a logistic activation. This neuron will, after training, provide the network’s estimation of the sentiment of a text input. The network, after having been trained on a sentiment labelled set of movie reviews, is now able to evaluate sentences not previously encountered as input, and then output its computational estimate of the sentiment.It is fascinating that sentences can be analyzed in this way using one-dimensional convolution, with fairly good results, as we will see below.

[sourcecode language=”python”]
# The output layer consists of a single neuron, which is fully connected from the previous layer.
# The output signal is then tranformed with a sigmoid (logistic) function
nn.add(Dense(hidden_dims2, 1))
nn.add(Activation(‘sigmoid’))
# write the neural network model representation to a png image
grapher.plot(nn, ‘nn_imdb.png’)
[/sourcecode]

The cost function is defined here as binary cross entropy, and root mean square gradient backpropagation as the training method.
[sourcecode language=”python”]
nn.compile(loss=’binary_crossentropy’, optimizer=’adam’, class_mode="binary")
nn.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=epochs, show_accuracy=True, validation_data=(X_test, Y_test))
[/sourcecode]

Finally, the network will be tested on some test samples:
[sourcecode language=”python”]
# assess the network performance on unseen test data
results = nn.evaluate(X_test, Y_test, show_accuracy=True, verbose=0)
print(‘score:’, results[0])
print(‘accuracy:’, results[1])
[/sourcecode]

Network Diagram for a 1D Convolutional Feedforward Network Designed for the Purpose of Evaluation IMDB Movie Review Sentiment as Positive or Negative
Network Diagram for a 1D Convolutional Feedforward Network Designed for the Purpose of Evaluation IMDB Movie Review Sentiment as Positive or Negative

On a CPU, the calculation requires 7864 seconds (2 hr, 11 mins, 4 seconds), resulting in 92.1% accuracy on the training set and 84.2% accuracy on the validation set, with a validation loss of 38.5%, within four training epochs.Overfitting was observed to increase beyond the fourth epoch.

Using a Tesla K80 GPU, the calculation is completed in 112 seconds, yielding essentially the same accuracies.This reflects a speedup factor of 17.6, using the Tesla K80 GPU, compared to CPU- a huge performance boost.The speedup was observed to vary according to the network architecture.Overall, it was observed to be very good for this type of 1D convolutional network.

Neural Network ArchitectureHardware ConfigurationSpeedup Factor1
1-D convolutional neural network Adam optimizerK80 GPU17.6
1-D convolutional neural network Adam optimizerCPU1

1speedups are with respect to runtimes on a CPU for the respective neural network architecture.

Building an Image Classifier Using Keras and Theano Deep Learning Frameworks

Now we will turn to using Keras in order to define a neural network having an architecture similar to that of LeNet5, developed by Yann LeCun [11].This network is a convolutional feedforward network, which was, like other convolutional neural networks, inspired by biological data taken from physiological experiments done on the cat visual cortex [12].

At the input of the network, there is a square 2-dimensional input layer, 28 pixels on side.The networks consists of this input layer, which feeds into a convolutional layer, having 32 filters, each of size 3×3, with ReLU activations on outputs going to the next layer.The first convolutions layer then feeds into a second identical convolutional layer.The second convolutional layers feeds into a Max Pooling layer, which has dropout and a flattened output.The flattened output is then fed into a fully connected (dense) layer, having ReLU activation, with dropout.The final layer is fully connected to the previous dense layer with a softmax activation.

In Keras, the code for this is:
[sourcecode language=”python”]
from __future__ import absolute_import
from __future__ import print_function
import numpy as np
np.random.seed(2222)# for reproducibility

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.utils import np_utils
from keras.regularizers import l2, activity_l2
from keras.utils.dot_utils import Grapher
[/sourcecode]

Defining network parameters:
[sourcecode language=”python”]
batch_size = 128
num_classes = 10
epochs = 12

# x and y dimensions of input images
shapex, shapey = 28, 28
# number of convolutional filters to use
num_filters = 32
# side length of maxpooling square
num_pool = 2
# side length of convolution square
num_conv = 3
[/sourcecode]

The MNIST data is loaded and stored into training, validation, and test sets:
[sourcecode language=”python”]
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
X_train = X_train.reshape(X_train.shape[0], 1, shapex, shapey)
X_test = X_test.reshape(X_test.shape[0], 1, shapex, shapey)
X_train = X_train.astype("float32")
X_test = X_test.astype("float32")
X_train /= 255
X_test /= 255
print(‘X_train shape:’, X_train.shape)
print(X_train.shape[0], ‘train samples’)
print(X_test.shape[0], ‘test samples’)

# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(Y_train, num_classes)
Y_test = np_utils.to_categorical(Y_test, num_classes)
[/sourcecode]

Now instantiate the grapher and nn objects. Successive layers are added to nn:
[sourcecode language=”python”]
grapher = Grapher()
nn = Sequential()

# declare the first layers of convolution and pooling
#
nn.add(Convolution2D( num_filters, 1, num_conv, num_conv, border_mode=’full’ ))
nn.add(Activation(‘relu’))
nn.add(Convolution2D(num_filters, num_filters, num_conv, num_conv))
nn.add(Activation(‘relu’))
nn.add(MaxPooling2D( poolsize = (num_pool,num_pool) ))
nn.add(Dropout(0.5))

nn.add(Convolution2D( num_filters, num_filters, num_conv, num_conv, border_mode=’full’ ))
nn.add(Activation(‘relu’))
nn.add(Convolution2D(num_filters, num_filters, num_conv, num_conv))
nn.add(Activation(‘relu’))
nn.add(MaxPooling2D( poolsize = (num_pool,num_pool) ))
nn.add(Dropout(0.5))

nn.add(Flatten())

# three convolutional layers chained together — might work for larger images
# full, valid, maxpool then full, valid, maxpool
n_neurons = num_filters * (shapex/num_pool/num_pool) * (shapey/num_pool/num_pool)

print(n_neurons)
nn.add(Dense(n_neurons, 128))# flattens n_neuron connections per neuron in the fully connected layer
# here, the fully connected layer is defined to have 128 neurons
# therefore all n_neurons inputs from the previous layer connecting to each
# of these fully connected neurons (FC layer), reduces it’s input to a single
# output signal.Here’s the activation function is given be ReLU.
nn.add(Activation(‘relu’))
nn.add(Dropout(0.5))# dropout is then applied

# finally the 128 outputs of the previous FC layer are fully connected to num_classes of neurons, which
# is activated by a softmax function
nn.add( Dense(128, num_classes, W_regularizer=l2(0.01) ))
nn.add( Activation(‘softmax’) )
# write the neural network model representation to a png image
grapher.plot(nn, ‘nn_mnist.png’)
[/sourcecode]

The neural network object, nn, is compiled in Theano, and then trained on the validation set:
[sourcecode language=”python”]
nn.compile(loss=’categorical_crossentropy’, optimizer=’adam’)
nn.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=epochs, show_accuracy=True, verbose=1, validation_data=(X_test, Y_test))
[/sourcecode]
Note here that we are using the categorical cross entropy as our cost function to be minimized.The optimizer chosen here is Adam.Adam was proposed as an optimizer method by Kingma and Ba [13]

Finally, test the neural network against an unseen test data set:
[sourcecode language=”python”]
results = nn.evaluate(X_test, Y_test, show_accuracy=True, verbose=0)
print(‘score:’, results[0])
print(‘accuracy:’, results[1])
[/sourcecode]


Network Diagram for a 2D Convolutional Feedforward Network Designed for the Purpose of Evaluating Images of Digits from the MNIST Data Set

Using RMSProp, the training on 60,000 samples, and validation on 10,000 samples, required 898 seconds on a CPU, using only 1 layer on convolution and subsampling.After 12 epochs, the optimization using RMSProp ended with 97.93% accuracy.There is only one convolutional layer in this network, and the number of filter features has been set to 26.

The number of filters was then increased to 32 in order to draw a comparison.The calculation takes ~18% more time to complete.With the larger number of filters, more convolutions are performed.The accuracy, however, remains essentially the same at 97.94%.The Adam optimizer was used here.Compared to the RMSprop optimization method, the Adam optimizer progresses through each epoch must faster, with a speedup factor of ~28x.

Other published results reach accuracies of over 99%.Here we are simplifying the network by removing a second convolutional and max pooling layer.When the Adam optimization algorithm is used, the accuracies are well over 97%.The table below shows various runtime comparisons using one and two-layer convolutional neural networks on the MNIST digits data.An L2-Regularization was added to the cost function.The Adam optimizer was used in computing all results in the table.

Benchmarking Results for Modified LeNet

Neural Network ArchitectureHardware ConfigurationSpeedup Factor2
LeNet (modified) layers: (1) 2xconv3x3 + 1xsubsampling2x2 L2 Regularization Adam optimizerK80 GPU2.9
LeNet (modified) layers: (1) 2xconv3x3 + 1xsubsampling2x2 L2 Regularization Adam optimizerCPU1
LeNet (modified) layers: (2) 2xconv3x3 + 1xsubsampling2x2 L2 Regularization Adam optimizerK80 GPU33.6
LeNet (modified) layers: (2) 2xconv3x3 + 1xsubsampling2x2 L2 Regularization Adam optimizerCPU1

2speedups are with respect to runtimes on a CPU for the respective neural network architecture.

The networks in this tutorial were run on a Tesla K80 GPU. The neural networks built in this tutorial, however, can also be built using other NVIDIA GPU accelerators, such as the NVIDIA GeForce GTX Titan X, or the NVIDIA Quadro line (K6000, for example).Both of these GPUs are available in Microway’s Deep Learning WhisperStation™, a quiet, desktop-sized GPU supercomputer pre-configured for extensive Deep Learning computation.

The NVIDIA GPU hardware on Microway’s HPC cluster is available for “Test Driving”.If you are interested in testing your own deep learning software, you can request a GPU Test Drive.If you are interested in neural networks for image classification, we have more information in our blog on using NVIDIA DIGITS.

Speeding up Theano

A Python wrapper for a re-implementation of convnet, written in C++, is now available. It runs faster on GPUs, so if you are looking for a boost in speedup beyond what Theano can provide, you might want to look into pylearn2, which was also developed at the University of Montreal.

There is a method for running Theano on more than one GPU.This requires a bit of extra coding, and is described in Theano’s GitHub page on using multiple GPUs.

References

1. Playing Atari with Deep Reinforcement Learning. arXiv preprint: arxiv.org/abs/1312.5602.
2. Xavier, G, Bengio, Y. https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf.
3. Hochreiter, S, Schmidhuber, J. Long Short-Term Memory.Neural Computation 9(8):1735-1780, 1997
4. Cho, K., et al.Learning Phrase Representations using RNN Encoder Decoder for Statistical Machine Translation. arXiv preprint: arXiv:1406.1078, 2014.
5. Le Cun, Y. and Bengio, Y.Convolutional Networks for Images, Speech, and Time-Series. In M. A. Arbib, editor, The Handbook of Brain Theory and Neural Networks. MIT Press, 1995.
6. Krizhevsky, A., Sutskever, I.Hinton, G.E.ImageNet Classification with Deep Convolutional Neural Networks. Part of: Advances in Neural Information Processing Systems. 25 (NIPS 2012).
7. Szegedy, C., Liu, W., Jia, Y., et al.Going Deeper with Convolutions. arXiv preprint: https://arxiv.org/abs/1409.4842.
8. Nogueira dos Santos, C., Gatti, M.Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts.Proc. of COLING 2014., Aug. 2014, pp 69-78.
9. Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
10. LeCun et al. (1999): The MNIST Dataset Of Handwritten Digits (Images).
11. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.
12. Hubel, D.H., Wiesel, T.N. Receptive Fields of Single Neurones in the Cat’s Striate Cortex. J. Physiol. (1959) 148, 574-591.
13. Kingma, D. and Ba, J.Adam: a Method for Stochastic Optimization, arXiv preprint: arxiv.org/pdf/1412.6980v8.pdf.
14. Cowan, M. Neural Network LaTeX package, ctan.org/tex-archive/graphics/pgf/contrib/neuralnetwork.

Special Thanks

We wish to extend special thanks to Dr. Andrew Mass, of Stanford University, for allowing use of the IMDB text data set. The IMDB data set used here was originally published in [9].

The post Keras and Theano Deep Learning Frameworks appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/keras-theano-deep-learning-frameworks/feed/ 0
Caffe Deep Learning Tutorial using NVIDIA DIGITS on Tesla K80 & K40 GPUs https://www.microway.com/hpc-tech-tips/caffe-deep-learning-using-nvidia-digits-tesla-gpus/ https://www.microway.com/hpc-tech-tips/caffe-deep-learning-using-nvidia-digits-tesla-gpus/#respond Thu, 17 Sep 2015 14:14:25 +0000 http://https://www.microway.com/?p=5485 In this Caffe deep learning tutorial, we will show how to use DIGITS in order to train a classifier on a small image set.  Along the way, we’ll see how to adjust certain run-time parameters, such as the learning rate, number of training epochs, and others, in order to tweak and optimize the network’s performance.  […]

The post Caffe Deep Learning Tutorial using NVIDIA DIGITS on Tesla K80 & K40 GPUs appeared first on Microway.

]]>
NVIDIA DIGITS Deep Learning Tutorial

In this Caffe deep learning tutorial, we will show how to use DIGITS in order to train a classifier on a small image set.  Along the way, we’ll see how to adjust certain run-time parameters, such as the learning rate, number of training epochs, and others, in order to tweak and optimize the network’s performance.  Other DIGITS features will be introduced, such as starting a training run using the network weights derived from a previous training run, and using a completed classifier from the command line.

Caffe Deep Learning Framework

The Caffe Deep Learning framework has gained great popularity.It originated in the Berkeley Vision and Learning Center (BVLC) and has since attracted a number of community contributors.

NVIDIA maintains their own branch of Caffe – the latest version (0.13 at the time of writing) can be downloaded from NVIDIA’s github.

NVIDIA DIGITS & Caffe Deep Learning GPU Training System (DIGITS)

NVIDIA DIGITS is a production quality, artificial neural network image classifier available for free from NVIDIA. DIGITS provides an easy-to-use web interface for training and testing your classifiers, while using the underlying Caffe Deep Learning framework.

The latest version of NVIDIA DIGITS (2.1 at the time of writing) can be downloaded here.

NVIDIA DIGITS Deep Learning Tutorial
neural network distinguishes Land Rover from Jeep Cherokee

Hardware for NVIDIA DIGITS and Caffe Deep Learning Neural Networks

The hardware we will be using are two Tesla K80 GPU cards, on a single compute node, as well as a set of two Tesla K40 GPUs on a separate compute node. Each Tesla K80 card contains two Kepler GK210 chips, 24 GB of total shared GDDR5 memory, and 2,496 CUDA cores on each chip, for a total of 4,992 CUDA cores. The Tesla K40 cards, by comparison, each contain one GK110B chip, 12 GB of GDDR5 memory, and 2,880 CUDA cores.

Since the data associated with a trained neural network classifier is not heavy in data weight, a classifier could be easily deployed onto a mobile embedded system, and run, for example, by an NVIDIA Tegra processor. In many cases, however, neural network image classifiers are run on GPU-accelerated servers at a fixed location.

Runtimes will be compared for various configurations of these Tesla GPUs (see gpu benchmarks below). The main objectives of this tutorial, however, can be achieved using other NVIDIA GPU accelerators, such as the NVIDIA GeForce GTX Titan X, or the NVIDIA Quadro line (K6000, for example).  Both of these GPUs are available in Microway’s Deep Learning WhisperStation™, a quiet, desktop-sized GPU supercomputer pre-configured for extensive Deep Learning computation.  The NVIDIA GPU hardware on Microway’s HPC cluster is available for “Test Driving”. Readers are encouraged to request a GPU Test Drive.

Introduction to Deep Learning with DIGITS

To begin, let’s examine the creation of a small image dataset.  The images were downloaded using a simple in-house bash shell script.  Images were chosen to consist of two categories: one of recent series of the SUV Land Rover, and the other of recent series of the Jeep Cherokee – both comprised mostly of the 2014 or 2015 models.

The process of building a deep learning artificial neural network image classifier for these two types of SUVs using NVIDIA DIGITS is described in detail below in a video tutorial.  As a simple proof of concept, only these two SUV types were included in the data set.  A larger data set could be easily constructed including an arbitrary number of vehicle types.  Building a high quality data set is somewhat of an art, where consideration must be given to:

  • sizes of features in relation to convolution filter sizes
  • having somewhat uniform aspect ratios, so that potentially distinguishing features do not get distorted too differently from image to image during the squash transformation of DIGITS
  • ensuring that ample sampling of images taken from various angles are present in the data set (side view, front, back, close-ups, etc.) – this will train the network to recognize more facets of the objects to be classified

The laborious task of planning and creating a quality image data set is an investment into the final performance quality of the deep learning network, so care and attention at this stage will yield better performance during classifier testing and deployment.  The original SUV image data set was expanded by window sub-sampling the original set of images, and then by also applying horizontal, vertical, and combined flips of the sub-sampled, as well as of the original images.

Neural Network Image Classifier Performance Considerations

Beforehand, some performance-oriented questions we can pose are:

  • Can the classifier distinguish SUV type from front, back, side, and top viewpoints?
  • To what level of accuracy can the classifier distinguish image categories?
  • What sort of discernable, high-level object features will be learned by the network?

(We recommend viewing the NVIDIA DIGITS Deep Learning Tutorial video with 720p HD)

GPU Benchmarks for Caffe deep learning on Tesla K40 and K80

A GoogLeNet neural network model computation was benchmarked on the same learning parameters and dataset for the hardware configurations shown in the table below. All other aspects of hardware were the same across these configurations.

Hardware ConfigurationSpeedup Factor1
2 NVIDIA K80 GPU cards
(4 GK210 chips)
2.55
2 NVIDIA K40 GPU cards
(2 GK110B chips)
1.56
1 NVIDIA K40 GPU card
(1 GK110B chip)
1

1compared against the runtime on a single Tesla K40 GPU

The runtimes in this table reflect 30 epochs of training the GoogLeNet model with a learning rate of 0.005. The batch size was set to 120, compared to the default of 24. This was done in order to use a greater percentage of GPU memory.

In this tutorial, we specified a local directory for DIGITS to construct the image set. If you instead provide text files for the training and validation images, you may want to ensure that the default setting of Shuffle lines is set to “Yes”. This is important if you downloaded your images sequentially, by category. If the lines from such files are not shuffled, then your validation set may not guide the training as well as it would if the image URLs are random in order.

Although NVIDIA DIGITS already supports Caffe deep learning, it will soon support the Torch and Theano frameworks, so check back with Microway’s blog for more information on exciting new developments and tips on how you can quickly get started on using Deep Learning in your research.

Further Reading on NVIDIA DIGITS Deep Learning and Neural Networks

1. Srivastava, et al., Journal of Machine Learning Research, 15 (2014), 1929-1958
2. NVIDIA devblog: Easy Multi-GPU Deep Learning with DIGITS 2 https://devblogs.nvidia.com/parallelforall/easy-multi-gpu-deep-learning-digits-2/
3. Szegedy, et al., Going Deeper with Convolutions, 2014, https://arxiv.org/abs/1409.4842
4. Krizhevsky, et al., ImageNet Classification with Deep Convolutional Neural Networks, ILSVRC-2010
LeCun, et al., Proc. of the IEEE, Nov. 1998, pgs. 1-46
5. Fukushima, K., Biol. Cybernetics, 36, 1980, pgs. 193-202

The post Caffe Deep Learning Tutorial using NVIDIA DIGITS on Tesla K80 & K40 GPUs appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/caffe-deep-learning-using-nvidia-digits-tesla-gpus/feed/ 0
Common PCI-Express Myths for GPU Computing Users https://www.microway.com/hpc-tech-tips/common-pci-express-myths-gpu-computing/ https://www.microway.com/hpc-tech-tips/common-pci-express-myths-gpu-computing/#comments Mon, 04 May 2015 16:22:19 +0000 http://https://www.microway.com/?p=3417 At Microway we design a lot of GPU computing systems. One of the strengths of GPU-compute is the flexibility PCI-Express bus. Assuming the server has appropriate power and thermals, it enables us to attach GPUs with no special interface modifications. We can even swap to new GPUs under many circumstances. However, we encounter a lot of misinformation […]

The post Common PCI-Express Myths for GPU Computing Users appeared first on Microway.

]]>
At Microway we design a lot of GPU computing systems. One of the strengths of GPU-compute is the flexibility PCI-Express bus. Assuming the server has appropriate power and thermals, it enables us to attach GPUs with no special interface modifications. We can even swap to new GPUs under many circumstances. However, we encounter a lot of misinformation about PCI-Express and GPUs. Here are a number of myths about PCI-E:

1. PCI-Express is controlled through the chipset

No longer in modern Intel CPU-based platforms. Beginning with the Sandy Bridge CPU architecture in 2012 (Xeon E5 series CPUs, Xeon E3 series CPUs, Core i7-2xxx and newer) Intel integrated the PCI-Express controller into the the CPU die itself. Bringing PCI-Express onto the CPU die came with a substantial latency benefit. This was a major change in platform design, and Intel coupled it with the addition of PCI-Express Gen3 support.

AMD Opteron 6300/4300 CPUs are still the exception: PCI-Express is delivered only through the AMD SR56xx chipset (PCI-E Gen2 only) for these platforms. They will slightly underperform competing Intel Xeon platforms when paired with PCI-Express Gen2 GPUs (Tesla K20/K20X) due to the latency differential. Opteron 6300/4300 CPUs will substantially underperform competing Xeon platforms when PCI-Express Gen3 GPUs are installed.

2. A host system with the newest Intel CPU architecture always delivers optimal performance

Not always true. Intel tends to launch its newest CPU architectures on it’s lowest end CPU products first. Once they are proven in lower end applications, the architecture migrates up to higher end segments months or even years later. The problem? The lowest end, newest architecture CPUs can feature the least number of PCI-Express lanes per socket:

CPUCore i7-5xxx?Xeon E3-1200v3/Core i7-47xx/48xxXeon E5-1600v3, Core i7 58xx/59xxXeon E5-2400v2Xeon E5-2600v3
CPU SocketLikely Socket 1150Socket 1150Socket 2011-3/R3Socket 1356Socket 2011-3/R3
CPU Core ArchitectureBroadwellHaswellHaswellIvy BridgeHaswell
Launch Date2015Q2 2013Q3 2014Q1 2014Q3 2014
PCI-Express Lanes Per MotherboardLikely 16 Gen316 Gen340 Gen3 (Xeon) 28-40 Gen3 (Core i7)48 Gen3 (both CPUs populated)80 Gen3 (both CPUs populated)

Socket 1150 CPUs debuted in mid-2013 and were the only offering with the latest and greatest Haswell architecture for over a year; however, the CPUs available only delivered 16 PCI-Express Gen3 lanes per socket. It was tempting for some users to outfit a system with a modestly priced (and “latest”) Core i7-4700 series “Haswell” CPU during this period. However, this choice could have fundamentally hindered application performance. We’ll see this again when Intel debuts Broadwell for the same socket in 2015.

 

3. The least expensive host system possible is best when paired with multiple GPUs

Not necessarily and in many cases certainly not. It all comes down to how your application works, how long execution time is, and whether PCI Express transfers are happening throughout. An attempt at small cost savings could have big consequences. Here’s a few examples that counter this myth:

a. Applications running entirely on GPUs with many device-to-device transfers

Your application may be performing almost all of its work on the GPUs and orchestrating constant CUDA device-to-device transfers throughout its run. But a host motherboard and CPU with insufficient PCI-Express lanes may not allow full bandwidth transfers between the GPUs, and that could cripple your job performance.

Many inexpensive Socket 1150 motherboards (max 16 PCI-E lanes) have this issue: install 2 GPUs into what appear as x16 slots, and both operate as x8 links electrically. The forced operation at x8 speeds means that a maximum of half the optimal bandwidth is available for your device-to-device transfers. A capable PCI-Express switch may change the game for your performance in this situation.

b. Applications with extremely short execution time on each GPU

In this case, the data transfer may be the largest piece of total job execution time. If you purchase a low-end CPU without sufficient PCI-Express lanes (and bandwidth) to serve simultaneous transfers to/from all your GPUs, the contention will result in poor application performance.

c. Applications constantly streaming data into and out of the GPU

The classic example here is video/signals processing. If you have a constant stream of HD video or signals data being processed by the GPU in real-time, restricting the size of the pipe to your processing devices (GPUs) is a poor design decision.

I don’t know if any of the above fit me…

If you are unable to analyze your job, we do have some reluctant secondary recommendations. The least expensive CPU configuration providing enough lanes for PCI-Express x16 links to all your GPUs is in our experience the safest purchase. An inexpensive CPU SKU in a specific CPU/platform series (ex: no need to purchase an E5-2690v3 CPU vs. E5-2620v3) is fine if you don’t need fast CPU performance. There are very notable exceptions.

4. PCI-Express switches always mean poor application performance

This myth is very common. In reality performance highly application dependent, and sometimes switches yield superior performance.

Where switching matters

There’s no question that PLX Switches have capacity constraints: 16 PCI-E lanes are nearly always driving 2-4 PCI-Express x16 devices. But PLX switching also has one critical advantage: it fools each device into believing it has a full x16 link present, and it will deliver all 16 lanes of bandwidth to a device if available upstream. 2-4 GPUs attached to a single PLX switch @ PCI-E x16 links will nearly always outperform 2-4 GPUs operating at PCI-E x8 speeds without one.

Furthermore, if you hide latency with staggered CUDA Host-Device transfers, the benefits of a denser compute platform (no MPI coding) could far outweigh the PCI- E bandwidth constraints. First, profile your code to learn more about it. Then optimize your transfers for the architecture.

Superior Performance in the Right Situation

In certain cases PLX switches deliver superior performance or additional features. A few examples:

a. In 2:1 configurations utilizing device-device transfers, full bandwidth is available between neighboring devices. AMBER is a great example of an application where this of strong benefit.

Octoputer PCI-E Block Diagram
Microway Octoputer PCI-E Block Diagram

b. Next, in applications leveraging GPU Direct RDMA, switches deliver superior performance. This feature enables a direct transfer between GPU and another device (typically an IB adapter).

Courtesy of NVIDIA
Courtesy of NVIDIA

See this presentation for more information on this feature.

c. For multi-GPU configurations where maximum device-device bandwidth between pairs of GPUs at once is of paramount importance, 2 PCI-E switches off of a single socket are likely to offer higher performance vs. an unswitched dual socket configuration. This is due to the added latency and bandwidth constraint of a QPI-hop from CPU0 to CPU1 in a dual socket configuration. Our friends at Cirrascale have explored this bandwidth challenge skillfully.

d. For 4-GPU configurations where maximum device-device bandwidth between all GPUs at once is of paramount importance and host-device bandwidth is not, 1 switch off of single socket may be even better.

Switched-PCI-E-4GPUs
PCI-E Configuration for 4 GPUs on Single Switch

4:1 designs with an appropriate switch offer full bandwidth device-device transfers for 48-96 total PCI-E lanes (including uplink lanes) on the same PCI-E tree. This is impossible with any switchless configuration, and it provides maximum bandwidth for P2P transfers.

However, please don’t assume you can find your ideal PCI-E switch configuration for sale today. Switches are embedded down on motherboards and designs take years to make it to market. For example, we have yet to see many devices with PEX 8796 switch come to market.

Switching is…complex

We’re just starting to see data honestly assessing switched performance in the marketplace. What happens to a particular application’s performance if you sacrifice host-device for device-device bandwidth by pairing a CPU with weak PCI-E I/O capability with a healthy PCI-E switch? Is your total runtime faster? On which applications? How about a configuration that has no switch and simply restricts bandwidth?  Does either save you much in total system cost?

Studies of ARM + GPU platform performance (switched and unswitched) and the availability of more platforms with single-socket x86 CPUs + PCI-E switches are starting to tell us more. We’re excited to see the data, but we treat these dilemmas very conservatively until someone can prove to us that restricted bandwidth to the host will result in superior performance for an application.

Concluding thoughts

No one said GPU computing was easy. Understanding your application’s behavior during runs is critical to designing the proper system to run it. Use resource monitors, profilers, and any tool you can to assist. We have a whole blog post series that may help you. Take a GPU Test Drive with Microway to verify.

We encourage you to enlist an expert to help design your system once you know. Complete information about your application’s behavior ensures we can design the system that will perform best for you. As an end-user, you realize a system that is ready to do useful work immediately after delivery. This ensures you get the most complete value out of your hardware purchase.

Finally, we have guidance for when you are in doubt or when you have no data. In this case we recommend any Xeon E5-1600v3 or Xeon E5-2600v3 CPU: they deliver the most PCI-E lanes per socket (40). It’s one of the most robust configurations that keeps you out of trouble. Still, only comprehensive testing will determine the choice with the absolute best price-performance calculation for you. Understand these myths, test your code, let us guide you, and you will procure the best system for your needs!

The post Common PCI-Express Myths for GPU Computing Users appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/common-pci-express-myths-gpu-computing/feed/ 3
Introducing the NVIDIA Tesla K80 GPU Accelerator (Kepler GK210) https://www.microway.com/hpc-tech-tips/introducing-nvidia-tesla-k80-gpu-accelerator-kepler-gk210/ https://www.microway.com/hpc-tech-tips/introducing-nvidia-tesla-k80-gpu-accelerator-kepler-gk210/#comments Mon, 17 Nov 2014 14:00:46 +0000 http://https://www.microway.com/?p=4928 NVIDIA has once again raised the bar on GPU computing with the release of the new Tesla K80 GPU accelerator.  With up to 8.74 TFLOPS of single-precision performance with GPU Boost, the Tesla K80 has massive capability and leading density. Here are the important performance specifications: To achieve this performance, Tesla K80 is really two […]

The post Introducing the NVIDIA Tesla K80 GPU Accelerator (Kepler GK210) appeared first on Microway.

]]>
NVIDIA has once again raised the bar on GPU computing with the release of the new Tesla K80 GPU accelerator.  With up to 8.74 TFLOPS of single-precision performance with GPU Boost, the Tesla K80 has massive capability and leading density.

NVIDIA Tesla K80

Here are the important performance specifications:

  • Two GK210 chips on a single PCB
  • 4992 total SMX CUDA cores: 2496 on each chip!
  • Total of 24GB GDDR5 memory; aggregate memory bandwidth of 480GB/sec
  • 5.6 TFLOPS single precision, 1.87 TFLOPS double precision
  • 8.74 TFLOPS single precision, 2.91 TFLOPS double precision with GPU Boost
  • 300W TDP

To achieve this performance, Tesla K80 is really two GPUs in one. This Tesla K80 block diagram illustrates how each GK210 GPU has its own dedicated memory and how they communicate at x16 speeds with the PCIe bus using a PCIe switch:

Tesla K80 block diagram

In order to maintain a TDP rating of 300W, the default clock speed is 560MHz , rather than the 745MHz of Tesla K40. Both GPUs, however, have the same boost speed of 875MHz. NVIDIA has also evolved the GPU Boost feature substantially: GPU boost is now dynamically utilized. Rather than 3 manually selected levels, over 10 levels of boost are available for every application run, whenever thermals permit.

Tesla K80 is also remarkable for its density. Packaging two GPUs on a single PCB enables existing or slightly modified server designs to support more Tesla GPUs (2X the density). Balancing clock speeds and CUDA core count within the 300W TDP also delivers the best performance per watt of any Tesla GPU!

Pricing is approximately 30% more than the Tesla K40 GPU.  With some applications achieving performance gain of up to 90% compared to K40, the K80 appears to be a decent price-performance bargain. Keep in mind, though, that your application performance will vary.  As you can see from the graph below, Caffe and CHROMA show substantial improvements with K80 while other applications like GROMACS and CP2K will achieve more modest speedups over K40.

TeslaK80 K40 Comparison Chart

This GPU is passively cooled only, meaning that we are only able to deliver Tesla K80-enabled clusters and servers. A noisy tower/4U is also available—should your environment tolerate it as a workstation.

As always the best way to determine what K80 can do for you is to try it yourself. We encourage you to sign up for a test drive and be one of the first to see how powerful this GPU can be with your code.

You can also contact us to discuss what platforms best suit your needs. From 1U servers with three Tesla K80s to 4U servers with a staggering eight Tesla K80s, we can help design the server or cluster that works best for you.

The post Introducing the NVIDIA Tesla K80 GPU Accelerator (Kepler GK210) appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/introducing-nvidia-tesla-k80-gpu-accelerator-kepler-gk210/feed/ 2
How to Benchmark GROMACS GPU Acceleration on HPC Clusters https://www.microway.com/hpc-tech-tips/benchmark-gromacs-gpu-acceleration-hpc-clusters/ https://www.microway.com/hpc-tech-tips/benchmark-gromacs-gpu-acceleration-hpc-clusters/#respond Tue, 21 Oct 2014 15:42:32 +0000 http://https://www.microway.com/?p=4676 We know that many of our readers are interested in seeing how molecular dynamics applications perform with GPUs, so we are continuing to highlight various packages. This time we will be looking at GROMACS, a well-established and free-to-use (under GNU GPL) application.  GROMACS is a popular choice for scientists interested in simulating molecular interaction. With NVIDIA […]

The post How to Benchmark GROMACS GPU Acceleration on HPC Clusters appeared first on Microway.

]]>
Cropped shot of a GROMACS adh simulation (visualized with VMD)

We know that many of our readers are interested in seeing how molecular dynamics applications perform with GPUs, so we are continuing to highlight various packages. This time we will be looking at GROMACS, a well-established and free-to-use (under GNU GPL) application.  GROMACS is a popular choice for scientists interested in simulating molecular interaction. With NVIDIA Tesla K40 GPUs, it’s common to see 2X and 3X speedups compared to the latest multi-core CPUs.

Logging on to the Test Drive Cluster

To obtain access, fill out this quick and easy form: sign up for a GPU Test Drive. Once you obtain approval, you’ll receive an email with a list of commands to help you get your benchmark running. For your convenience, you can also reference a more detailed step-by-step guide below.

To begin, log in to the Microway Test Drive cluster using SSH. Don’t worry if you’re unfamiliar with SSH – we include an instruction manual for logging in. SSH is built-in on Linux and MacOS; Windows users need to install one application.

Run GROMACS on CPUs and GPUs

This first step is very easy. Simply enter the GROMACS directory and run the default benchmark script which we have pre-written for you:

cd gromacs
sbatch run-gromacs-on-TeslaK40.sh

Remember that Linux is case sensitive!

Managing GROMACS Jobs on the Cluster

Our cluster uses SLURM for resource management. Keeping track of your job is easy using the squeue command. For real-time information on your job, run: watch squeue (hit CTRL+c to exit). Alternatively, you can tell the cluster to e-mail you when your job is finished by editing the GROMACS batch script file (although this must be done before submitting jobs with sbatch). Run:

nano run-gromacs-on-TeslaK40.sh

Within this file, add the following two lines to the #SBATCH section (specifying your own e-mail address):

#SBATCH --mail-user=yourname@example.com
#SBATCH --mail-type=END

If you would like to monitor the compute node which is running your job, examine the output of squeue and take note of which node your job is running on. Log into that node using SSH and then use the tools of your choice to monitor it. For example:

ssh node2
nvidia-smi
htop

(hit q to exit htop)

See the speedup of GPUs vs. CPUs

The results from our benchmark script will be placed in an output file called gromacs-K40.xxxx.output.log – below is a sample of the output running on CPUs:

=======================================================================
= Run CPU-only water scaling benchmark system (1536)
=======================================================================
               Core t (s)   Wall t (s)        (%)
       Time:     1434.957       71.763     1999.6
                 (ns/day)    (hour/ns)
Performance:        1.206       19.894

Just below it is the GPU-accelerated run (showing a ~2.8X speedup):

=======================================================================
= Run Tesla_K40m GPU-accelerated water scaling benchmark system (1536)
=======================================================================
               Core t (s)   Wall t (s)        (%)
       Time:      508.847       25.518     1994.0
                 (ns/day)    (hour/ns)
Performance:        3.393        7.074

Should you require more information on a particular run, it’s available in the benchmarks/water/ directory. If your job has any problems, the errors will be logged to the file gromacs-K40.xxxx.output.errors

The chart below demonstrates the performance improvements between a CPU-only GROMACS run (on two 10-core Ivy Bridge Intel Xeon CPUs) and a GPU-accelerated GROMACS run (on two NVIDIA Tesla K40 GPUs):

GROMACS Speedups on NVIDIA Tesla K40 GPUs

Benchmarking your GROMACS Inputs

If you’re familiar with BASH, you can of course create your own batch script, but we recommend using the run-gromacs-your-files.sh file as a template for when you want to run you own simulations.  You can upload these files yourself or you can build them. If you opt for the latter, you need to load the appropriate software packages by running:

module load cuda/6.5 gcc/4.8.3 openmpi-cuda/1.8.1 gromacs

Once your files are either created or uploaded, you’ll need to ensure that the batch script is referencing the correct input files. The relevant parts of the run-gromacs-your-files.sh file are:

echo  "=================================================================="
echo  "= Run CPU-only water scaling benchmark system (1536)"
echo  "=================================================================="

srun --mpi=pmi2 -n $num_processes -c $num_threads_per_process mdrun_mpi -s topol.tpr -npme 0 -resethway -noconfout -nb cpu -nsteps 10000 -pin on -v

and for execution on GPUs:

echo  "=================================================================="
echo  "= Run ${GPU_TYPE} GPU-accelerated benchmark"
echo  "=================================================================="

srun --mpi=pmi2 -n $num_processes -c $num_threads_per_process mdrun_mpi -s topol.tpr -npme  0 -resethway -noconfout -nsteps 1000 -pin on -v

Although you might not be familiar with all of the above GROMACS flags, you should hopefully recognize the .tpr file.  This binary file contains the atomic-level input of the equilibration, temperature, pressure, and other inputs that the grompp module has processed.  The flags themselves are important for benchmarking and are explained below:

  • -npme  0: This flag is normally used to tell GROMACS how many threads to use.  However, unless you have compute nodes with different numbers of cores, it’s best to let MPI manage the threads.
  • -resethway: As the name suggests, this flag acts as a time reset.  Half way through the job, GROMACS will reset the counter so that any overhead from memory initialization or load balancing won’t affect the benchmark score.
  • -noconfout: For when you want to once again reduce overhead, this flag tells GROMACS to not create a toconfout.gro file.
  • -nsteps 1000: A tag that you’re probably familiar with, this one lets you set the maximum number of integration steps.  It’s useful to change if you don’t want to waste too much time waiting for your benchmark to finish.
  • -pin on: Finally, this tag lets you set affinities for the cores, meaning that threads will remain locked to cores and won’t jump around.

If you’d like to visualize your results, you will need to initialize a graphical session on our cluster. You are welcome to contact us if you’re uncertain of this step. After you have access to an X-session, you can run VMD by typing the following:

module load vmd
vmd

Next Steps for GROMACS GPU Acceleration

As you can see, we’ve set up our Test Drive so that running GROMACS on a GPU cluster isn’t much more difficult than running it on your own workstation. Benchmarking CPU vs GPU performance is also very easy. If you’d like to learn more, contact one of our experts or sign up for a GPU Test Drive today!

solvated alcohol dehydrogenase (ADH) protein in a rectangular box (134,000 atoms)
solvated alcohol dehydrogenase (ADH) protein in a rectangular box (134,000 atoms)

Citation for GROMACS:

https://www.gromacs.org/

Berendsen, H.J.C., van der Spoel, D. and van Drunen, R., GROMACS: A message-passing parallel molecular dynamics implementation, Comp. Phys. Comm. 91 (1995), 43-56.

Lindahl, E., Hess, B. and van der Spoel, D., GROMACS 3.0: A package for molecular simulation and trajectory analysis, J. Mol. Mod. 7 (2001) 306-317.

Featured Illustration:

Solvated alcohol dehydrogenase (ADH) protein in a rectangular box (134,000 atoms)
https://www.gromacs.org/topic/heterogeneous_parallelization.html

Citation for VMD:

Humphrey, W., Dalke, A. and Schulten, K., “VMD – Visual Molecular Dynamics” J. Molec. Graphics 1996, 14.1, 33-38
https://www.ks.uiuc.edu/Research/vmd/

The post How to Benchmark GROMACS GPU Acceleration on HPC Clusters appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/benchmark-gromacs-gpu-acceleration-hpc-clusters/feed/ 0
Benchmark MATLAB GPU Acceleration on NVIDIA Tesla K40 GPUs https://www.microway.com/hpc-tech-tips/benchmark-matlab-gpu-acceleration-nvidia-tesla-k40-gpus/ https://www.microway.com/hpc-tech-tips/benchmark-matlab-gpu-acceleration-nvidia-tesla-k40-gpus/#respond Fri, 17 Oct 2014 20:44:41 +0000 http://https://www.microway.com/?p=4891 MATLAB is a well-known and widely-used application – and for good reason. It functions as a powerful, yet easy-to-use, platform for technical computing. With support for a variety of parallel execution methods, MATLAB also performs well. Support for running MATLAB on GPUs has been built-in for a couple years, with better support in each release. […]

The post Benchmark MATLAB GPU Acceleration on NVIDIA Tesla K40 GPUs appeared first on Microway.

]]>
MATLAB solving a second order wave equation on Tesla GPUs

MATLAB is a well-known and widely-used application – and for good reason. It functions as a powerful, yet easy-to-use, platform for technical computing. With support for a variety of parallel execution methods, MATLAB also performs well. Support for running MATLAB on GPUs has been built-in for a couple years, with better support in each release. If you haven’t tried yet, take this opportunity to test MATLAB performance on GPUs. Microway’s GPU Test Drive makes the process quick and easy. As we’ll show in this post, you can expect to see 3X to 6X performance increases for many tasks (with 30X to 60X speedups on select workloads).

Access a Compute Node with GPU-accelerated MATLAB

Getting started with MATLAB on our GPU cluster is easy: complete this form to sign up for MATLAB GPU benchmarking. We will send you an e-mail with detailed instructions for logging in and starting up MATLAB. Once you’re in, all you need to do is click the MATLAB icon and the latest version of GPU-Accelerated MATLAB will pop up:
Mathworks MATLAB R2014b splashscreen

We use NoMachine to export the graphical sessions from our cluster to your local PC/laptop. This makes login extremely user-friendly, ensures your interactive session performs well and provides a built-in method for file transfers in and out of the GPU cluster. MATLAB is fairly well-known for performing sluggishly over standard Unix/Linux graphical sessions (e.g., X11 forwarding, VNC), but you’ll have no such issues here.

You’ll be dropped into a standard MATLAB workspace. A variety of parallelized demonstrations of GPU usage are included with MATLAB. Pick one and give it a try! You can type paralleldemo_gpu and then hit <TAB> to see the full list of options.

Main MATLAB R2014b window

Measure MATLAB GPU Speedups

Below we show the output from several of the built-in MATLAB parallel GPU demos. A few are text-only, but several include a graphical component or performance plot. The first example runs a quick test on memory transfer speeds and computational throughput. Results from both the GPU and the host (CPUs) are shown:

>> paralleldemo_gpu_benchmark
Using a Tesla K40m GPU.
Achieved peak send speed of 3.44069 GB/s
Achieved peak gather speed of 2.20036 GB/s
Achieved peak read+write speed on the GPU: 233.613 GB/s
Achieved peak read+write speed on the host: 12.9773 GB/s
Achieved peak calculation rates of 398.9 GFLOPS (host), 1345.8 GFLOPS (GPU)

Note that the host results will be impacted by the number of local workers available in the Parallel Computing Toolbox. Since version R2011b, the default has been limited to 12 threads/CPU cores. With the release of R2014a, Mathworks removed that limit. For these tests we changed the number of workers to 20 in the Parallel Preferences dialog box.

The next demo generates plots of the speedup between matrix multiplications on dual 10-core Xeon CPUs versus a single NVIDIA Tesla K40 GPU. Both single-precision and double-precision floating-point calculations were run.

GPU-Accelerated Stencil Operations

MATLAB also includes a couple of Stencil Operation demos running on a GPU. These include both a “generic” implementation and an optimized implementation using GPU shared & texture memory. As shown below, MATLAB GPU speedups can be 30+ times faster than MATLAB on CPUs with properly-optimized algorithms.

>> paralleldemo_gpu_mexstencil
Average time on the GPU: 1.119ms per generation
Average time of 0.038ms per generation (29.4x faster).
Average time of 0.019ms per generation (58.9x faster).
First version using gpuArray:  1.119ms per generation.
MEX with shared memory: 0.038ms per generation (29.4x faster).
MEX with texture memory: 0.019ms per generation (58.9x faster).

Running your own test of MATLAB GPU speedups

To see a list of other useful demos, take a look at the GPU-accelerated examples on Mathworks FileExchange. You’ll find a large number of useful demonstrations, including:

  • GPU acceleration for FFTs
  • Heat transfer equations
  • Navier-Stokes equations for incompressible fluids
  • Anisotropic Diffusion
  • Gradient Vector Flow (GVF) force field calculation
  • 3D linear and trilinear interpolation
  • more than 60 others

Also consider that hundreds of MATLAB’s standard functions support GPU acceleration. . Utilizing these capabilities is quite straightforward: your data must be loaded into a gpuArray. With this done, pass the gpuArray to any of MATLAB’s standard functions and the operations will be carried out on the GPU!

MATLAB paramSweep demo

Will GPU acceleration speed up your research?

With our pre-configured GPU cluster, running MATLAB on high-performance GPUs is as easy as running it on your own workstation. Find out for yourself how much faster you’ll be able to work if you add GPUs to your toolbelt. Sign up for a GPU Test Drive today!


Featured Illustration:

“Solving 2nd Order Wave Equation on the GPU Using Spectral Methods” by Jiro Doke
Mathworks MATLAB Central

The post Benchmark MATLAB GPU Acceleration on NVIDIA Tesla K40 GPUs appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/benchmark-matlab-gpu-acceleration-nvidia-tesla-k40-gpus/feed/ 0