CUDA Archives - Microway https://www.microway.com/tag/cuda/ We Speak HPC & AI Thu, 30 May 2024 20:09:17 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 NVIDIA Tesla M40 24GB GPU Accelerator (Maxwell GM200) Up Close https://www.microway.com/hpc-tech-tips/nvidia-tesla-m40-24gb-gpu-accelerator-maxwell-gm200-close/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-m40-24gb-gpu-accelerator-maxwell-gm200-close/#respond Fri, 01 Apr 2016 14:30:02 +0000 https://www.microway.com/?p=7197 NVIDIA has announced a new version of their popular Tesla M40 GPU – one with 24GB of high-speed GDDR5 memory. The name hasn’t really changed – the new GPU is named NVIDIA Tesla M40 24GB. If you are curious about the original version with less memory, we have a detailed examination of the original M40 […]

The post NVIDIA Tesla M40 24GB GPU Accelerator (Maxwell GM200) Up Close appeared first on Microway.

]]>
NVIDIA has announced a new version of their popular Tesla M40 GPU – one with 24GB of high-speed GDDR5 memory. The name hasn’t really changed – the new GPU is named NVIDIA Tesla M40 24GB. If you are curious about the original version with less memory, we have a detailed examination of the original M40 GPU.

As support for GPUs grows – particularly in the exploding fields of Machine Learning and Deep Learning – there has been increasing need for large quantities of GPU memory. The Tesla M40 24GB provides the most memory available to date in a single-GPU Tesla card. The remaining specifications of the new M40 match that of the original: 7 TFLOPS of single-precision floating point performance.

The Tesla M40 continues to be the only high-performance Tesla compute GPU based upon the “Maxwell” architecture. “Maxwell” provides excellent performance per watt, as evidenced by the fact that this GPU provides 7 TFLOPS within a 250W power envelope.

Maximum single-GPU memory and performance: Tesla M40 24GB GPU

Available in Microway NumberSmasher GPU Servers and GPU Clusters

Photo of the NVIDIA Tesla M40 24GB GPU Accelerator bottom edge

Specifications

  • 3072 CUDA GPU cores (GM200)
  • 7.0 TFLOPS single; 0.21 TFLOPS double-precision
  • 24GB GDDR5 memory
  • Memory bandwidth up to 288 GB/s
  • PCI-E x16 Gen3 interface to system
  • Dynamic GPU Boost for optimal clock speeds
  • Passive heatsink design for installation in qualified GPU servers

Technical Details

The nvidia-smi status report shown below reflects the capabilities of the new M40 24GB GPU:

[root@node4 ~]# nvidia-smi -a -i 0

==============NVSMI LOG==============

Timestamp                           : Fri May 20 15:35:26 2016
Driver Version                      : 361.28

Attached GPUs                       : 4
GPU 0000:84:00.0
    Product Name                    : Tesla M40
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Enabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : xxxxxxxxxxxxx
    GPU UUID                        : GPU-dbacebc6-3878-d72d-ebe9-87fb50xxxxxx
    Minor Number                    : 3
    VBIOS Version                   : 84.00.56.00.03
    MultiGPU Board                  : No
    Board ID                        : 0xXXXX
    Inforom Version
        Image Version               : G600.xxxx.xx.xx
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x84
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0xXXXXXXXX
        Bus Id                      : 0000:84:00.0
        Sub System Id               : 0x117110DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 23039 MiB
        Used                        : 23009 MiB
        Free                        : 30 MiB
    BAR1 Memory Usage
        Total                       : 32768 MiB
        Used                        : 4 MiB
        Free                        : 32764 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 99 %
        Memory                      : 100 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 51 C
        GPU Shutdown Temp           : 92 C
        GPU Slowdown Temp           : 89 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 124.63 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 180.00 W
        Max Power Limit             : 250.00 W
    Clocks
        Graphics                    : 1113 MHz
        SM                          : 1113 MHz
        Memory                      : 3004 MHz
        Video                       : 1025 MHz
    Applications Clocks
        Graphics                    : 1114 MHz
        Memory                      : 3004 MHz
    Default Applications Clocks
        Graphics                    : 947 MHz
        Memory                      : 3004 MHz
    Max Clocks
        Graphics                    : 1114 MHz
        SM                          : 1114 MHz
        Memory                      : 3004 MHz
        Video                       : 1024 MHz
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes                       : None

NVIDIA deviceQuery on Tesla M40 24GB

The output below, from the CUDA 7.5 SDK samples, shows the output of the architecture and capabilities of the Tesla M40 24GB GPU accelerators.

deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla M40"
  CUDA Driver Version / Runtime Version          8.0 / 7.5
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 23040 MBytes (24159059968 bytes)
  (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores
  GPU Max Clock rate:                            1112 MHz (1.11 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Tesla M40
Result = PASS

Additional Tesla M40 24GB Information

To learn more about the differences between the Tesla M40 24GB and other versions of the Tesla product line, please review our “Kepler” and “Maxwell” Tesla GPU knowledge center articles:

To learn more about GPU-accelerated servers and clusters which provide the Tesla M40 24GB, please see our NVIDIA GPU technology page. Although we are able to provide the M40 in tower workstation systems, the design of the heatsink does not allow for quiet workstations.

This post was last updated on 2016-06-23

The post NVIDIA Tesla M40 24GB GPU Accelerator (Maxwell GM200) Up Close appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-m40-24gb-gpu-accelerator-maxwell-gm200-close/feed/ 0
Accelerating Code with OpenACC and the NVIDIA Visual Profiler https://www.microway.com/hpc-tech-tips/accelerating-code-with-openacc-and-nvidia-visual-profiler/ https://www.microway.com/hpc-tech-tips/accelerating-code-with-openacc-and-nvidia-visual-profiler/#respond Mon, 14 Mar 2016 15:00:48 +0000 http://https://www.microway.com/?p=6249 Comprised of a set of compiler directives, OpenACC was created to accelerate code using the many streaming multiprocessors (SM) present on a GPU. Similar to how OpenMP is used for accelerating code on multicore CPUs, OpenACC can accelerate code on GPUs. But OpenACC offers more, as it is compatible with multiple architectures and devices, including […]

The post Accelerating Code with OpenACC and the NVIDIA Visual Profiler appeared first on Microway.

]]>
Comprised of a set of compiler directives, OpenACC was created to accelerate code using the many streaming multiprocessors (SM) present on a GPU. Similar to how OpenMP is used for accelerating code on multicore CPUs, OpenACC can accelerate code on GPUs. But OpenACC offers more, as it is compatible with multiple architectures and devices, including multicore x86 CPUs and NVIDIA GPUs.

Here we will examine some fundamentals of OpenACC by accelerating a small program consisting of iterations of simple matrix multiplication. Along the way, we will see how to use the NVIDIA Visual Profiler to identify parts of the code which call OpenACC compiler directives. Graphical timelines displayed by the NVIDIA Visual Profiler visually indicate where greater speedups can be achieved. For example, applications which perform excessive host to device data transfer (and vice versa), can be significantly improved by eliminating excess data transfer.

Industry Support for OpenACC

OpenACC is the result of a collaboration between PGI, Cray, and CAPS. It is an open specification which sets out compiler directives (sometimes called pragmas). The major compilers supporting OpenACC at inception came from PGI, Cray, and CAPS. The OpenACC Toolkit (which includes the PGI compilers) is available for download from NVIDIA

The free and open source GNU GCC compiler supports OpenACC. This support may trail the commercial implemenations.

Introduction to Accelerating Code with OpenACC

Logo of the OpenACC standard for Accelerator DirectivesOpenACC facilitates the process of accelerating existing applications by requiring changes only to compute-intense sections of code, such as nested loops. A nested loop might go through many serial iterations on a CPU. By adding OpenACC directives, which look like specially-formatted comments, the loop can run in parallel to save significant amounts of runtime. Because OpenACC requires only the addition of compiler directives, usually along with small amounts of re-writing of code, it does not require extensive re-factoring of code. For many code bases, a few dozen effectively-placed compiler directives can achieve significant speedup (though it should be mentioned that most existing applications will likely require some amount of modification before they can be accelerated to near-maximum performance).

OpenACC is relatively new to the set of frameworks, software development kits, and programming interfaces available for accelerating code on GPUs. In June 2013, the 2.0 stable release of OpenACC was introduced. OpenACC 3.0 is current as of November 2019. The 1.0 stable release of OpenACC was first made available in November, 2011.

Diagram of the Maxwell architecture's Streaming Multiprocessor (SMM)
Figure 1 The Maxwell Architecture Streaming Multiprocessor (SM)

By reading OpenACC directives, the compiler assembles CUDA kernels from each section of compute-intense code. Each CUDA kernel is a portion of code that will be sent to the many GPU Streaming Multiprocessor processing elements for parallel execution (see Figure 1).

The Compute Unified Device Architecture (CUDA) is an application programming interface (API), which was developed by NVIDIA for the C and Fortran languages. CUDA allows for parallelization of computationally-demanding applications. Those looking to use OpenACC do not need to know CUDA, but those looking for maximum performance usually need to use some direct CUDA calls. This is accomplished either by the programmer writing tasks as CUDA kernels, or by calling a CUDA ‘drop-in’ library. With these libraries, a developer invokes accelerated routines without having to write any CUDA kernels. Such CUDA ‘drop-in’ libraries include CUBLAS, CUFFT, CURAND, CUSPARSE, NPP, among others. The libraries mentioned here by name are included in the freely available CUDA toolkit.

While OpenACC makes it easier for scientists and engineers to accelerate large and widely-used code bases, it is sometimes only the first step. With CUDA, a more extensive process of code refactoring and acceleration can be undertaken. Greater speedups can be achieved using CUDA. OpenACC is therefore a relatively easy first step toward GPU acceleration. The second (optional), and more challenging step requires code refactoring with CUDA.

OpenACC Parallelization Reports

There are several tools available for reporting information on the parallel execution of an OpenACC application. Some of these tools run within the terminal and are text-based. The text reports can be generated by setting particular environment variables (more on this below), or by invoking compiler options when compiling at the command line. Text reports will provide detail on which portions of the code can be accelerated with kernels.

The NVIDIA Visual Profiler, has a graphical interface which displays a timeline detailing when data transfers occur between the host and device. Kernel launches and runtimes are indicated with a colored horizontal bar. The graphical timeline and text reports in the terminal together provide important information which could indicate sections of code that are reducing performance. By locating inefficiencies in data transfers, for example, the runtime can be reduced by restructuring parallel regions. The example below illustrates a timeline report showing excessive data transfers between the system and the GPU (the host and the device).

Applying OpenACC to Accelerate Matrix Operations

Start with a Serial Code

To illustrate OpenACC usage, we will examine an application which performs common matrix operations. To begin, look at the serial version of the code (without OpenACC compiler directives) in Figure 2:

[sourcecode language=”C”]
#include &amp;amp;amp;quot;stdio.h&amp;amp;amp;quot;
#include &amp;amp;amp;quot;stdlib.h&amp;amp;amp;quot;
#include &amp;amp;amp;quot;omp.h&amp;amp;amp;quot;
#include &amp;amp;amp;quot;math.h&amp;amp;amp;quot;

void fillMatrix(int size, float **restrict A) {
for (int i = 0; i &amp;amp;amp;lt; size; ++i) {
for (int j = 0; j &amp;amp;amp;lt; size; ++j) {
A[i][j] = ((float)i);
}
}
}
float** MatrixMult(int size, float **restrict A, float **restrict B,
float **restrict C) {
for (int i = 0; i &amp;amp;amp;lt; size; ++i) {
for (int j = 0; j &amp;amp;amp;lt; size; ++j) {
float tmp = 0.;
for (int k = 0; k &amp;amp;amp;lt; size; ++k) {
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
return C;
}
float** MakeMatrix(int size, float **restrict arr) {
int i;
arr = (float **)malloc( sizeof(float *) * size);
arr[0] = (float *)malloc( sizeof(float) * size * size);
for (i=1; i&amp;amp;amp;lt;size; i++){
arr[i] = (float *)(arr[i-1] + size);
}
return arr;
}
void showMatrix(int size, float **restrict arr) {
int i, j;
for (i=0; i&amp;amp;amp;lt;size; i++){
for (j=0; j&amp;amp;amp;lt;size; j++){
printf(&amp;amp;amp;quot;arr[%d][%d]=%f \n&amp;amp;amp;quot;,i,j,arr[i][j]);
}
}
}
void copyMatrix(float **restrict A, float **restrict B, int size){
for (int i=0; i&amp;amp;amp;lt;size; ++i){
for (int j=0; j&amp;amp;amp;lt;size; ++j){
A[i][j] = B[i][j];
}
}
}
int main (int argc, char **argv) {
int i, j, k;
float **A, **B, **C;

if (argc != 3) {
fprintf(stderr,&amp;amp;amp;quot;Use: %s size nIter\n&amp;amp;amp;quot;, argv[0]);
return -1;
}
int size = atoi(argv[1]);
int nIter = atoi(argv[2]);

if (nIter &amp;amp;amp;lt;= 0) {
fprintf(stderr,&amp;amp;amp;quot;%s: Invalid nIter (%d)\n&amp;amp;amp;quot;, argv[0],nIter);
return -1;
}
A = (float**)MakeMatrix(size, A);
fillMatrix(size, A);
B = (float**)MakeMatrix(size, B);
fillMatrix(size, B);
C = (float**)MakeMatrix(size, C);

float startTime_tot = omp_get_wtime();
for (int i=0; i&amp;amp;amp;lt;nIter; i++) {
float startTime_iter = omp_get_wtime();
C = MatrixMult(size, A, B, C);
if (i%2==1) {
//multiply A by B and assign back to A on even iterations
copyMatrix(A, C, size);
}
else {
//multiply A by B and assign back to B on odd iterations
copyMatrix(B, C, size);
}
float endTime_iter = omp_get_wtime();
}
float endTime_tot = omp_get_wtime();
printf(&amp;amp;amp;quot;%s total runtime %8.5g\n&amp;amp;amp;quot;, argv[0], (endTime_tot-startTime_tot));
free(A); free(B); free(C);
return 0;
}
[/sourcecode]

Figure 2 Be sure to include the stdio.h and stdlib.h header files. Without these includes, you may encounter segmentation faults during dynamic memory allocation for 2D arrays.

If the program is run in the NVIDIA Profiler without any OpenACC directive, a console output will not include a timeline. Bear in mind that the runtime displayed in the console includes runtime overhead from the profiler itself. To get a more accurate measurement of runtime, run without the profiler at the command line. To compile the serial executable with the PGI compiler, run:

pgcc -fast -o ./matrix_ex_float ./matrix_ex_float.c

The serial runtime, for five iterations with 1000x1000 matrices, is 7.57 seconds. Using larger 3000x3000 matrices, with five iterations increases the serial runtime to 265.7 seconds.

Parallelizing Matrix Multiplication

The procedure-calling iterative loop within main() cannot, in this case, be parallelized because the value of matrix A depends on a series of sequence-dependent multiplications. This is the case with all sequence-dependent evolution of data, such as with time stepped iterations in molecular dynamics (MD). In an obvious sense, loops performing time evolution cannot be run in parallel, because the causality between discrete time steps would be lost. Another way of stating this is that loops with backward dependencies cannot be made parallel.

With the application presented here, the correct matrix product is dependent on the matrices being multiplied together in the correct order, since matrix multiplication does not commute, in general. If the loop was run in parallel, the outcome would be unpredictable, and very likely not what the programmer intended. For example, the correct output for our application, after three iterations, takes on the form AxBxAxBxB. This accounts for the iterative reassignments of A and B to intermediate forms of the product matrix, C. After four iterations, the sequence becomes AxBxAxBxBxAxBxB. The main point: if this loop were to run in parallel, this sequence would very likely be disrupted into some other sequence, through the uncontrolled process of which threads, representing loop iterations, execute before others on the GPU.

[sourcecode language=”C”]
for (int i=0; i&amp;amp;amp;lt;nIter; i++) {
float startTime_iter = omp_get_wtime();
C = MatrixMult(size, A, B, C);
if (i%2==1) {
//multiply A by B and assign back to A on even iterations
copyMatrix(A, C, size);
}
else {
//multiply A by B and assign back to B on odd iterations
copyMatrix(B, C, size);
}
float endTime_iter = omp_get_wtime();
}
[/sourcecode]

We’ve established that the loop in main() is non-parallelizable, having an implicit dependence on the order of execution of loop iterations. To achieve a speedup, one must examine the routine within the loop: MatrixMult()

[sourcecode language=”C”]
float** MatrixMult(int size, float **restrict A, float **restrict B,
float **restrict C) {
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
pcopyout(C[0:size][0:size])
{
float tmp;
for (int i=0; i&amp;amp;amp;lt;size; ++i) {
for (int j=0; j&amp;amp;amp;lt;size; ++j) {
tmp = 0.;
for (int k=0; k&amp;amp;amp;lt;size; ++k) {
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
}
return C;
}
[/sourcecode]

Here, a kernels OpenACC directive has been placed around all three for loops. Three loops happens to be the maximum number of nested loops that can be parallelized within a single nested structure. Note the syntax for an OpenACC compiler directive in C takes on the following form:

#pragma acc kernels [clauses]

In the code above, the kernels directive tells the compiler that it should try to convert this section of code into a CUDA kernel for parallel execution on the device. Instead of describing a long list of OpenACC directives here, an abbreviated list of commonly used directives appears below in Table 1 (see the references for complete API documentation):

Commonly used OpenACC directives
#pragma acc parallelStart parallel execution on the device. The compiler will generate parallel code whether the result is correct or not.
#pragma acc kernelsHint to the compiler that kernels may be generated for the defined region. The compiler may generate parallel code for the region if it determines that the region can be accelerated safely. Otherwise, it will output warnings and compile the region to run in serial.
#pragma acc dataDefine contiguous data to be allocated on the device; establish a data region minimizing excessive data transfer to/from GPU
#pragma acc loopDefine the type of parallelism to apply to the proceeding loop
#pragma acc regionDefine a parallel region where the compiler will search for code segments to accelerate. The compiler will attempt to automatically parallelize whatever it can, and report during compilation exactly what portions of the parallel region have been accelerated.

Table 1 OpenACC Compiler Directives

Along with directives, there can be modifying clauses. In the example above, we are using the kernels directive with the pcopyin(list) and pcopyout(list) clauses. These are abbreviations for present_or_copyin(list), and present_or_copyout(list).

  • pcopy(list) tells the compiler to copy the data to the device, but only if data is not already present. Upon exiting from the parallel region, any data which is present will be copied to the host.
  • pcopyin(list) tells the compiler to copy to the device if the data is not already there.
  • pcopyout(list) directs the compiler to copy the data if it is on the device, else the data is allocated to the device memory and then copied to the host. The variables, and arrays in list are those which will be copied.
  • present_or_copy(list) clauses avoid the reduced performance of excessive data copies, since the data needed may already be present.

After adding the kernels directive to MatrixMult(), compile and run the executable in the profiler. To compile a GPU-accelerated OpenACC executable with PGI, run:

pgcc -fast -acc -ta=nvidia -Minfo -o ./matrix_ex_float ./matrix_ex_float.c

The -Minfo flag is used to enable informational messages from the compiler. These messages are crucial for determining whether the compiler is able to apply the directives successfully, or whether there is some problem which could possibly be solved. For an example of a compiler message reporting a warning, see the section ‘Using a Linearized Array Instead of a 2D Array’ in the next OpenACC blog, entitled ‘More Tips on OpenACC Code Acceleration‘.

To run the executable in the NVIDIA Visual Profiler, run:

nvvp ./matrix_ex 1000 5

During execution, the 1000x1000 matrices – A and B – are created and multiplied together into a product. The command line argument 1000 specifies the dimensions of the square matrix and the argument 5 sets the number of iterations for the loop to run through. The NVIDIA Visual Profiler will display the timeline below:

Screenshot of NVIDIA Visual Profiler Timeline showing the test case where pcopyin and pcopyout are used in MatrixMult().
Figure 3 (click for expanded view)

Note that there are two Host to Device transfers of matrices A and B at the start of every iteration. Data transfers to the device, occurring after the first transfer, are excessive. In other words, every data copy after the first one is wasted time and lost performance.

Using the OpenACC data Directive to Eliminate Excess Data Transfer

Because the parallel region consists of only the two loops in the MatrixMult() routine, every time this routine is called entire copies of matrices A & B are passed to the device. Since the data only needs to be sent before the first iteration, it would make sense to expand the data region to encompass every call to MatrixMult(). The boundary of the data region must be pushed out to encompass the loop in main(). By placing a data directive just outside of this loop, as shown in Figure 4, the unnecessary copying of A and B to the device after the first iteration is eliminated:

[sourcecode language=”C”]
#pragma acc data pcopyin(A[0:size][0:size],B[0:size][0:size],C[0:size][0:size]) \
pcopyout(C[0:size][0:size])
{
float startTime_tot = omp_get_wtime();
for (int i=0; i&amp;amp;amp;lt;nIter; i++) {
float startTime_iter = omp_get_wtime();
C = MatrixMult(size, A, B, C);
if (i%2==1) {
//multiply A by B and assign back to A on even iterations
copyMatrix(A, C, size);
}
else {
//multiply A by B and assign back to B on odd iterations
copyMatrix(B, C, size);
}
float endTime_iter = omp_get_wtime();
}
float endTime_tot = omp_get_wtime();
}
[/sourcecode]
Figure 4 A data region is established around the for loop in main()

After recompiling and re-running the executable in NVIDIA’s Visual Profiler nvvp, the timeline in Figure 5 shows that the unnecessary transfers are now gone:

Screenshot of NVIDIA Visual Profiler Timeline for test case where pcopyin and pcopyout are used in MatrixMult() and the data region is used in main().
Figure 5 (click for expanded view)

Now matrices A and B are copied to the device only once. Matrix C, the result, is copied to the Host at the end of the kernel region in MatrixMult() on every iteration. As shown in the table below, the runtime improvement is small but significant (1.9s vs. 1.5s). This reflects a 19.5% decrease in runtime; a speedup of 1.24.

Runtimes for Various OpenACC Methods (in seconds)
OpenACC methodMatrix size 1000×1000Matrix size 3000×3000
no acceleration7.569265.69
#pragma acc kernels in MatrixMult()0.35401.917
#pragma acc kernels in MatrixMult() and
#pragma acc data in main()
0.05391.543

Table 2 Runtimes for five iterations of matrix multiplication (C=AxB).

As data sizes increase, the amount of work grows and the benefits of parallelization become incredibly clear. For the larger 3000x3000 matrices, a speedup factor of 172 is realized when both kernels and data directives are used.

Comparing Runtimes of OpenACC and OpenMP

Because OpenMP is also used as a method for parallelization of applications, it is useful to compare the two. To compare OpenACC with OpenMP, an OpenMP directive is added to the MatrixMult() routine:

[sourcecode language=”C”]
void MatrixMult(int size, float **restrict A, float **restrict B,
float **restrict C) {
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
pcopyout(C[0:size][0:size])
#pragma omp parallel for default(none) shared(A,B,C,size)
for (int i=0; i&amp;amp;amp;lt;size; ++i) {
for (int j=0; j&amp;amp;amp;lt;size; ++j) {
float tmp = 0.;
for (int k=0; k&amp;amp;amp;lt;size; ++k) {
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
}
[/sourcecode]

To compile the code with OpenMP parallelization, run:

pgcc -fast -mp ./matrix_ex_float.c -o ./matrix_ex_float_omp

The results were gathered on a Microway NumberSmasher server with dual 12-core Intel Xeon E5-2690v3 CPUs running at 2.6GHz. Runtimes were gathered when executing on 6, 12, and 24 of the CPU cores. This is achieved by setting the environment variable OMP_NUM_THREADS to 6, 12, and 24 respectively.

Number of ThreadsRuntime (in seconds)
637.758
1218.886
2410.348

Table 3 Runtimes achieved with OpenMP using 3000x3000 matrices and 5 iterations

It is clear that OpenMP is able to provide parallelization and good speedups (nearly linear). However, the GPU accelerators are able to provide more compute power than the CPUs. The results in Table 4 demonstrate that OpenMP and OpenACC both substancially increase performance. By utilizing a single NVIDIA Tesla M40 GPU, OpenACC is able to run 6.71 faster than OpenMP.

Speedups Over Serial Runtime
serialOpenMP speedupOpenACC speedup
125.67x172x

Table 4 Relative Speedups of OpenACC and OpenMP for 3000x3000 matrices.

OpenACC Bears Similarity to OpenMP

As previously mentioned, OpenACC shares some commonality with OpenMP. Both are open standards, consisting of compiler directives for accelerating applications. Open Multi-Processing (OpenMP) was created for accelerating applications on multi-core CPUs, while OpenACC was primarily created for accelerating applications on GPUs (although OpenACC can also be used to accelerate code on other target devices, such as multi-core CPUs). Looking ahead, there is a growing consensus that the roles of OpenMP and OpenACC will become more and more alike.

OpenACC Acceleration for Specific GPU Devices

GPU Hardware Specifics

When a system has multiple GPU accelerators, a specific GPU can be selected either by using an OpenACC library procedure call, or by simply setting the environment variable CUDA_VISIBLE_DEVICES in the shell. For example, this would select GPUs #0 and #5:

export CUDA_VISIBLE_DEVICES=0,5

On Microway’s GPU Test Drive Cluster, some of the Compute Nodes have a mix of GPUs, including two Tesla M40 GPUs labelled as devices 0 and 5. To see what devices are available on your machine, run the command deviceQuery, (which is included with the CUDA Toolkit). pgaccelinfo, which comes with the OpenACC Toolkit, reports similar information.

When an accelerated application is running, you can view the resource allocation on the device by executing the nvidia-smi utility. Memory usage and GPU usage, listed by application, are reported for all GPU devices in the system.

Gang, Worker, and Vector Clauses

Although CUDA and OpenACC both use similar ideas, their terminology differs slightly. In CUDA, parallel execution is organized into grids, blocks (threadBlocks), and threads. In OpenACC, a gang is like a CUDA threadBlock, which executes on a processing element (PE). On a GPU device, the processing element (PE) is the streaming multiprocessor (SM). A number of OpenACC gangs maps across numerous PEs (CUDA blocks).

An OpenACC worker is a group of vectors. The worker dimension extends across the height of a gang (threadBlock). Each vector is a CUDA thread. The dimension of vector is across the width of the threadBlock. Each worker consists of vector number of threads. Therefore, a worker corresponds to one CUDA warp only if vector takes on the value of 32; a worker does not have to correspond to a warp. For example, a worker can correspond to two warps if vector is 64, for example. The significance of a warp is that all threads in a warp run concurrently.

Diagram of an NVIDIA CUDA Grid, which is made up of multiple Thread Blocks
Figure 6 A CUDA grid consists of blocks of threads (threadBlocks), which can be arranged in one or two dimensions.

Figure 6 illustrates a threadBlock, represented as part of a 2D grid containing multiple threadBlocks. In OpenACC, the grid consists of a number of gangs, which can extend into one or two dimensions. As depicted in Figure 7, the gangs extend into one dimension. It is possible, however, to arrange gangs into a two dimensional grid. Each gang, or threadBlock, in both figures 6 and 7 is comprised of a 2D block of threads. The number of vectors, workers, and gangs can be finely tuned for a parallel loop.

Sometimes it is faster to have some kernels execute more than once on a block, instead of having each kernel execute only once per block. Discovering the optimal amount of kernel re-execution can require some trial and error. In OpenACC, this would correspond to a case where the number of gangs is less than a loop layer which is run in parallel across gangs and which has more iterations than gangs available.

In CUDA, threads execute in groups of 32 at a time. Groups of 32 threads, as mentioned, are called warps, and execute concurrently. In Figure 8, the block width is set to 32 threads. This makes more threads execute concurrently, so the program runs faster.

[expand title=”(click to expand) Additional runtime output, with kernel runtimes, grid size, and block size”]

Note: the kernel reports can only be generated by compiling with the time target, as shown below (read more about this in our next blog post). To compile with kernel reports, run:

pgcc -fast -acc -ta=nvidia,time -Minfo -o ./matrix_ex_float ./matrix_ex_float.c

Once the executable is compiled with the nvidia and time arguments, a kernel report will be generated during execution:

[john@node6 openacc_ex]$ ./matrix_ex_float 3000 5
./matrix_ex_float total runtime 1.3838

Accelerator Kernel Timing data
/home/john/MD_openmp/./matrix_ex_float.c
MatrixMult NVIDIA devicenum=0
time(us): 1,344,646
19: compute region reached 5 times
26: kernel launched 5 times
grid: [100x100] block: [32x32]
device time(us): total=1,344,646 max=269,096 min=268,685 avg=268,929
elapsed time(us): total=1,344,846 max=269,144 min=268,705 avg=268,969
19: data region reached 5 times
35: data region reached 5 times
/home/john/MD_openmp/./matrix_ex_float.c
main NVIDIA devicenum=0
time(us): 8,630
96: data region reached 1 time
31: data copyin transfers: 6
device time(us): total=5,842 max=1,355 min=204 avg=973
31: kernel launched 3 times
grid: [24] block: [128]
device time(us): total=19 max=7 min=6 avg=6
elapsed time(us): total=509 max=432 min=34 avg=169
128: data region reached 1 time
128: data copyout transfers: 3
device time(us): total=2,769 max=1,280 min=210 avg=923

[/expand]

Diagram of OpenACC gangs, workers and vectors
Figure 7 An OpenACC threadBlock has vertical dimension worker, and horizontal dimension vector. The grid consists of gang threadBlocks.

[sourcecode language=”C”]
float** MatrixMult(int size, int nr, int nc, float **restrict A, float **restrict B,
float **restrict C) {
#pragma acc kernels loop pcopyin(A[0:size][0:size],B[0:size][0:size]) \
pcopyout(C[0:size][0:size]) gang(100), vector(32)
for (int i = 0; i &amp;amp;amp;lt; size; ++i) {
#pragma acc loop gang(100), vector(32)
for (int j = 0; j &amp;amp;amp;lt; size; ++j) {
float tmp = 0.;
#pragma acc loop reduction(+:tmp)
for (int k = 0; k &amp;amp;amp;lt; size; ++k) {
tmp += A[i][k] * B[k][j];
}
C[i][j] = tmp;
}
}
return C;
}
[/sourcecode]
Figure 8 OpenACC code with gang and vector clauses. The fully accelerated OpenACC version of the C source code can be downloaded here.

The directive clause gang(100), vector(32), on the j loop, sets the block width to 32 threads (warp size), which makes parallel execution faster. Integer multiples of a warp size will also realize greater concurrency, but not usually beyond a width of 64. The same clause sets the grid width to 100. The directive clause on the outer i loop, gang(100), vector(32), sets the grid height to 100, and block height to 32. The block height specifies that the loop iterations are processed in SIMT groups of 32.

By adding the gang and vector clauses, as shown in Figure 8, the runtime is reduced to 1.3838 sec (a speedup of 1.12x over the best runtime in Table 2).

Targeting GPU Architectures with the Compiler

OpenACC is flexible in its support for GPU, which means support for a variety of GPU types and capabilities. The target options in the table below illustrate how different compute capabilities, GPU architectures, and CUDA versions can be targeted.

compute capabilityGPU architectureCUDA versionCPU
-ta=nvidia[,cc10|cc11|cc12|cc13|cc20] -ta=tesla:cc35, -ta=nvidia,cc35-ta=tesla, -ta=nvidia-ta=cuda7.5, -ta=tesla:cuda6.0-ta=multicore

Table 5 Various GPU target architecture options for the OpenACC compiler

OpenACC for Fortran

Although we have focused here on using OpenACC in the C programming language, there is robust OpenACC support for the Fortran language. The syntax for compiler directives is only slightly different. In the C language, with dynamic memory allocation and pointers, pointers must be restricted inside of parallel regions. This means that pointers, if not declared as restricted in main(), or subsequently cast as restricted in main(), must be cast as restricted when passed as input arguments to routines containing a parallel region. Fortran does not use pointers and handles memory differently, with less user control. Pointer-related considerations therefore do not arise with Fortran.

Summary

OpenACC is a relatively recent open standard for acceleration directives which is supported by several compilers, including, perhaps most notably, the PGI compilers.

Accelerating code with OpenACC is a fairly quick route to speedups on the GPU, without needing to write CUDA kernels in C or Fortran, thereby removing the need to refactor potentially numerous regions of compute-intense portions of a large software application. By making an easy path to acceleration accessible, OpenACC adds tremendous value to the CUDA API. OpenACC is a relatively new development API for acceleration, with the stable 2.0 release appearing in June 2013.

If you have an application and would like to get started with accelerating it with OpenACC or CUDA, you may want to try a free test drive on Microway’s GPU Test Cluster. On our GPU servers, you can test your applications on the Tesla K40, K80, or the new M40 GPU specialized for Deep Learning applications. We offer a wide range of GPU solutions, including:


The post Accelerating Code with OpenACC and the NVIDIA Visual Profiler appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/accelerating-code-with-openacc-and-nvidia-visual-profiler/feed/ 0
NVIDIA Tesla M40 12GB GPU Accelerator (Maxwell GM200) Up Close https://www.microway.com/hpc-tech-tips/nvidia-tesla-m40-12gb-gpu-accelerator-maxwell-gm200-close/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-m40-12gb-gpu-accelerator-maxwell-gm200-close/#respond Wed, 10 Feb 2016 21:23:47 +0000 https://www.microway.com/?p=7187 With the release of Tesla M40, NVIDIA continues to diversify its professional compute GPU lineup. Designed specifically for Deep Learning applications, the M40 provides 7 TFLOPS of single-precision floating point performance and 12GB of high-speed GDDR5 memory. It works extremely well with the popular Deep Learning software frameworks and may also find its way into […]

The post NVIDIA Tesla M40 12GB GPU Accelerator (Maxwell GM200) Up Close appeared first on Microway.

]]>
With the release of Tesla M40, NVIDIA continues to diversify its professional compute GPU lineup. Designed specifically for Deep Learning applications, the M40 provides 7 TFLOPS of single-precision floating point performance and 12GB of high-speed GDDR5 memory. It works extremely well with the popular Deep Learning software frameworks and may also find its way into other industries that need single-precision accuracy.

The Tesla M40 is also notable for being the first Tesla GPU to be based upon NVIDIA’s “Maxwell” GPU architecture. “Maxwell” provides excellent performance per watt, as evidenced by the fact that this GPU provides 7 TFLOPS within a 250W power envelope.

Maximum single-GPU performance: Tesla M40 12GB GPU

Available in Microway NumberSmasher GPU Servers and GPU Clusters

Photo of the NVIDIA Tesla M40 12GB GPU Accelerator

Specifications

  • 3072 CUDA GPU cores (GM200)
  • 7.0 TFLOPS single; 0.21 TFLOPS double-precision
  • 12GB GDDR5 memory
  • Memory bandwidth up to 288 GB/s
  • PCI-E x16 Gen3 interface to system
  • Dynamic GPU Boost for optimal clock speeds
  • Passive heatsink design for installation in qualified GPU servers

As with all other modern Tesla GPUs, you should expect it to be able to max out the PCI-E 3.0 bus to achieve ~12GB/sec of data transfers between the system and each GPU:

[root@node6 ~]# gpu_bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla M40
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12108.0

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12870.2

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			210331.7

Result = PASS

Technical Details

Below is the full status reported by NVIDIA’s SMI tool. Memory error detection and correction (ECC) is supported on all components of the Tesla GPU. Notice that the M40 supports a wide range of operating frequencies:

[root@node6 ~]# nvidia-smi -a -i 0

==============NVSMI LOG==============

Timestamp                           : Wed Feb 10 10:30:31 2016
Driver Version                      : 352.79

Attached GPUs                       : 4
GPU 0000:84:00.0
    Product Name                    : Tesla M40
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Enabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0320116xxxxxx
    GPU UUID                        : GPU-dbacebc6-3878-d72d-ebe9-87fb50xxxxxx
    Minor Number                    : 3
    VBIOS Version                   : 84.00.48.00.01
    MultiGPU Board                  : No
    Board ID                        : 0x8400
    Inforom Version
        Image Version               : G600.0202.02.01
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x84
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x17FD10DE
        Bus Id                      : 0000:84:00.0
        Sub System Id               : 0x117110DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 0 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 11519 MiB
        Used                        : 55 MiB
        Free                        : 11464 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 2 MiB
        Free                        : 16382 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 25 C
        GPU Shutdown Temp           : 92 C
        GPU Slowdown Temp           : 89 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 17.24 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 180.00 W
        Max Power Limit             : 250.00 W
    Clocks
        Graphics                    : 324 MHz
        SM                          : 324 MHz
        Memory                      : 405 MHz
    Applications Clocks
        Graphics                    : 1114 MHz
        Memory                      : 3004 MHz
    Default Applications Clocks
        Graphics                    : 947 MHz
        Memory                      : 3004 MHz
    Max Clocks
        Graphics                    : 1113 MHz
        SM                          : 1113 MHz
        Memory                      : 3004 MHz
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes                       : None
[root@node6 ~]# nvidia-smi -q -d SUPPORTED_CLOCKS -i 0

==============NVSMI LOG==============

Timestamp                           : Wed Feb 10 10:31:16 2016
Driver Version                      : 352.79

Attached GPUs                       : 4
GPU 0000:84:00.0
    Supported Clocks
        Memory                      : 3004 MHz
            Graphics                : 1114 MHz
            Graphics                : 1088 MHz
            Graphics                : 1063 MHz
            Graphics                : 1038 MHz
            Graphics                : 1013 MHz
            Graphics                : 987 MHz
            Graphics                : 962 MHz
            Graphics                : 949 MHz
            Graphics                : 924 MHz
            Graphics                : 899 MHz
            Graphics                : 873 MHz
            Graphics                : 848 MHz
            Graphics                : 823 MHz
            Graphics                : 797 MHz
            Graphics                : 772 MHz
            Graphics                : 747 MHz
            Graphics                : 721 MHz
            Graphics                : 696 MHz
            Graphics                : 671 MHz
            Graphics                : 645 MHz
            Graphics                : 620 MHz
            Graphics                : 595 MHz
            Graphics                : 557 MHz
            Graphics                : 532 MHz
        Memory                      : 405 MHz
            Graphics                : 324 MHz

NVIDIA deviceQuery on Tesla M40

The output below, from the CUDA 7.5 SDK samples, shows additional details of the architecture and capabilities of the Tesla M40 GPU accelerators.

deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla M40"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 11520 MBytes (12079464448 bytes)
  (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores
  GPU Max Clock rate:                            1112 MHz (1.11 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 132 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Tesla M40
Result = PASS

Additional Tesla M40 12GB Information

To learn more about the differences between the Tesla M40 12GB and other versions of the Tesla product line, please review our “Kepler” and “Maxwell” Tesla GPU knowledge center articles:

To learn more about GPU-accelerated servers and clusters which provide the Tesla M40, please see our NVIDIA GPU technology page. Although we are able to provide the M40 in tower workstation systems, the design of the heatsink does not allow for quiet workstations.

Photo of the NVIDIA Tesla M40 12GB GPU Accelerator showing the PCI-Express connector

The post NVIDIA Tesla M40 12GB GPU Accelerator (Maxwell GM200) Up Close appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-m40-12gb-gpu-accelerator-maxwell-gm200-close/feed/ 0
CUB in Action – some simple examples using the CUB template library https://www.microway.com/hpc-tech-tips/cub-action-simple-examples-using-cub-template-library/ https://www.microway.com/hpc-tech-tips/cub-action-simple-examples-using-cub-template-library/#respond Wed, 18 Jun 2014 20:02:31 +0000 http://https://www.microway.com/?p=4047 In my previous post, I presented a brief introduction to the CUB library of CUDA primitives written by Duane Merrill of NVIDIA. CUB provides a set of highly-configurable software components, which include warp- and block-level kernel components as well as device-wide primitives. This time around, we will actually look at performance figures for codes that […]

The post CUB in Action – some simple examples using the CUB template library appeared first on Microway.

]]>
In my previous post, I presented a brief introduction to the CUB library of CUDA primitives written by Duane Merrill of NVIDIA. CUB provides a set of highly-configurable software components, which include warp- and block-level kernel components as well as device-wide primitives. This time around, we will actually look at performance figures for codes that utilize CUB primitives. We will also briefly compare the CUB-based codes to programs that use the analogous Thrust routines, both from a performance and programmability perspective. These comparisons utilize the CUB v1.3.1 and Thrust v1.7.0 releases and CUDA 6.0.

Before we proceed, I need to issue one disclaimer: the examples below were written after a limited amount of experimentation with the CUB library, and they do not necessarily represent the most optimized implementations. However, these examples do illustrate the flexibility of the API and they give an idea of the kind of performance that can be achieved using CUB with only modest programming effort.

Computing Euclidean vector norms – transformations and reductions

To begin with, let’s consider a variant of the reduction routine, which reads in a data array and computes the sum of the squares of the array elements. If we interpret the array as a vector, then the square root of the result of the reduction gives the Euclidean norm, or L2 norm, of the vector.

A kernel implementation that uses CUB’s block-level primitives is shown below:

template <typename T, int BLOCK_THREADS, int ITEMS_PER_THREAD,
        BlockReduceAlgorithm ALGORITHM, typename U> 
__global__ void TransformSumKernel(T* sum, T* input, U unaryOp)
{

  typedef BlockReduce<T, BLOCK_THREADS, ALGORITHM> BlockReduceT;

  __shared__ typename BlockReduceT::TempStorage  temp_storage;

  T data[ITEMS_PER_THREAD];

  LoadDirectBlockedVectorized(threadIdx.x 
                    input+blockIdx.x*BLOCK_THREADS*ITEMS_PER_THREAD, 
                    data);

  for(int item=0; item<ITEMS_PER_THREAD; ++item)
    data[item] = unaryOp(data[item]);

  T block_sum = BlockReduceT(temp_storage).Sum(data);
  if(threadIdx.x==0) atomicAdd(sum,block_sum);
  return
}

 

In this example, each thread block processes a tile of BLOCK_THREADS*ITEMS_PER_THREAD input elements, and for simplicity the size of the input array is assumed to be a multiple of the tile size. As its name suggests, the function LoadDirectBlockedVectorized uses vectorized loads to increase memory bandwidth utilization. (We could also have used the higher-level BlockLoad class to perform the global-memory loads. The BlockLoad class is templated on the load algorithm, making it trivial to switch between vectorized loads and other approaches.) Global memory loads are coalesced provided that ITEMS_PER_THREAD does not exceed the maximum allowed vector-load width (for example, four ints or floats). unaryOp is a unary C++ function object. In order to compute the sum of the squares of the elements of the input array, we choose unaryOp to be an instance of the Square<T> class, which is defined as follows:

template<typename T>
struct Square
{__host__ __device__ __forceinline__T operator()(const T& a) const {return a*a;}
};

 

CUB Block-level primitives and fancy iterators

CUB’s iterator classes facilitate more elegant and generic kernel implementations. The library contains a number of iterator classes, including iterators that implement textures and iterators that support various cache modifiers for memory accesses. However, for the moment, let’s just consider the TransformInputIterator class, which has the following declaration:

template < typename ValueType, typename ConversionOp, typename InputIterator,
typename Offset=ptr_diff >
class TransformInputIterator;

In this declaration, InputIterator is another iterator type, which could simply be a pointer to a native device type; ConversionOp is a unitary operator class, such as the Square class defined above; and ValueType is the input value type. The precise meaning of these template arguments should become clear in the discussion below. Utilizing iterators, we can compute the Euclidean norm squared using the following reduction-kernel template:

template<typename T, int BLOCK_THREADS, int ITEMS_PER_THREAD, 
         BlockReduceAlgorithm ALGORITHM, typename I>
__global__ void SumKernel(T* sum, I input_iter)
{

 typedef BlockReduct<T BLOCK_THREADS, ALGORITHM> BlockReduceT;

 __shared__ typename lockReduceT::TempStorage temp_storage;

  T data[ITEMS_PER_THREAD];

  LoadDirectStriped(threadIdx.x 
          input_iter + blockIdx.x*blockDim.xTEMS_PER_THREAD, 
          data);

  T block_sum = BlockReduceT(temp_storage).Sum(data);
  if(threadIdx.x == 0) atomicAdd(sum,block_sum)
  return;
}

where input_iter is an instance of the TransformInputIterator class:

TransformInputIterator<T, Square<T>, T*> input_iter(input, Square<T>());

At present, CUB only supports vectorized loads for native device-type pointers, and the above kernel template uses the LoadDirectStriped<BLOCK_THREADS> routine to load data from global memory. Each thread loads ITEMS_PER_THREAD array elements separated by stride BLOCK_THREADS from global memory. BLOCK_THREADS is, of course, a multiple of the warp size, and data loads are coalesced.

Device-wide primitives

CUB contains a range of device-level primitives, including a device-wide reduction optimized for different architectures. Therefore, the only reason for implementing custom kernels for this particular calculation is that they might provide some performance advantage. The calculation can be implemented with CUB’s device-level API by using the input iterator defined above and calling

DeviceReduce::Sum(temp_storage,temp_storage_bytes,input_iter,output,num_elements);

The squared norm is then returned in the variable output. The same calculation can be performed using the Thrust library by calling

thrust::transform_reduce(d_in.begin(), d_in.end(), Square(), static_cast(0), thrust::plus())

in the client application.

CUB Performance

Fig. 1 shows performance data for codes that use the reduction kernels defined above, together with performance results for code using the DeviceReduce::Sum primitive. The corresponding performance figures for thrust::transform_reduce are also shown. These results were obtained on tests involving 32-bit floating-point data running on a Tesla K40c. Limited experimentation indicated that setting the thread-block size (BLOCK_THREADS) to 128 gave the best performance for both SumKernel and TransformSumKernel, and the block-wide reduction algorithm was set to BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY in both cases. The number of data items processed per thread (ITEMS_PER_THREAD) was set to 4 in runs involving TransformSumKernel, which uses vectorized loads. Choosing ITEMS_PER_THREAD=8, gave better performance in SumKernel. On all problem sizes, the CUB-based codes significantly outperform the code that utilizes the transform_reduce Thrust routine. We also see that the code that utilizes the DeviceReduce::Sum primitive is always competitive with code that utilizes our hand-tuned kernels, and, in fact, gives the best performance on all but the largest input arrays. This result is very encouraging since the use of CUB’s device-wide primitives in a client application requires very little programming effort, comparable to the effort required to utilize the corresponding Thrust primitives.

Plot of NVIDIA CUB performance on the reduce operation
Fig. 1 – Performance measurements for the calculation of the square of the Euclidean norm of a float32 vector on a Tesla K40c

A second CUB example – Transformed prefix sum

Let’s consider the slightly more contrived example of the calculation of the inclusive prefix sum of the squares of the elements of an integer array. Given the input and output arrays int a_in[N] and int a_out[N], this calculation corresponds to the following serial C++ code:

a_out[0] = a_in[0]*a_in[0];
for(int i=1; i<N; ++i){
a_out[i] = a_in[i]*a_in[i] + a_out[i-1];
}

 

In analogy to the reduction example above, the calculation can also be implemented using CUB’s DeviceScan::InclusiveSum function together with the TransformInputIterator class. We tested the CUB-based approach and an implementation using Thrust on 32-bit integer data on a Tesla K40c. However, in our initial tests the CUB-based implementation showed lower-than-expected performance on large input arrays. These performance figures correspond to the green points in Fig. 2 below, where the input iterator passed to DeviceScan::InclusiveSum was of type TransformInputIterator<int, Square<int>, int*>. In fact, on the largest input arrays, better performance was obtained from the Thrust-based code (red points). Furthermore, we found that the code obtained by omitting the square operation and simply passing a pointer to the data in global memory (in this case, an int* pointer) to DeviceScan::InclusiveSum achieved approximately twice the performance on large problem sizes.

Plot of NVIDIA CUB scan performance
Fig. 2 – Performance of code to compute the inclusive prefix sum of the squared elements of an int32 array measured on a Tesla K40c

A quick look at the CUB source code revealed the reason for the disparity in performance. It turns out that on compute-capability 3.5 devices, DeviceScan::InclusiveSum is configured to load data through texture cache automatically – provided the input iterator is a pointer to a native device type. However, for more general iterator types, the use of texture cache has to be specified explicitly in the client application. Fortunately, this is easy to implement using the CacheModifiedInputIterator class defined in CUB. To utilize texture cache, input_iter, the iterator passed to the prefix-sum routine, should be defined as follows in the client code:

// loads using cached_iter go through texture cache on sm_35 devices
CacheModifiedInputIterator<LOAD_LDG,int> cached_iter(d_in);
// Use the following iterator to return the squares of the data elements// loaded from global memory using cached_iter
TransformInputIterator<int,Square<int>,CacheModifiedInputIterator<LOAD_LDG,int> >input_iter(cached_iter, Square<int>());

 

The LOAD_LDG template argument specifies the use of texture cache on sm_35 hardware. The blue points in Fig. 2 correspond to performance figures obtained using the LOAD_LDG option, which, on large data sets, gives a 2x speedup over the code that does not utilize texture cache. CUB’s iterators support a number of other access modes, and although a comprehensive discussion of iterators is beyond the scope of this post, we encourage the reader to explore the iterator classes.

Summary

In this post, we have looked at variants of the reduction and prefix-sum algorithms that involve a transformation of the input data and how they might be implemented using the CUB library. In particular, we’ve considered implementations involving CUB’s device-level functions and iterator classes. These implementations compare favorably with code based on custom kernels and offer superior performance to the corresponding Thrust-based implementations. The examples also illustrate the flexibility and composability provided by CUB’s iterator classes.

The post CUB in Action – some simple examples using the CUB template library appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/cub-action-simple-examples-using-cub-template-library/feed/ 0
Introducing CUDA UnBound (CUB) https://www.microway.com/hpc-tech-tips/introducing-cuda-unbound-cub/ https://www.microway.com/hpc-tech-tips/introducing-cuda-unbound-cub/#comments Mon, 14 Apr 2014 13:21:38 +0000 http://https://www.microway.com/?p=3780 CUB – a configurable C++ template library of high-performance CUDA primitives Each new generation of NVIDIA GPUs brings with it a dramatic increase in compute power and the pace of development over the past several years has been rapid. The Tesla M2090, based on the Fermi GF110 architecture anounced in 2010, offered global memory bandwidth […]

The post Introducing CUDA UnBound (CUB) appeared first on Microway.

]]>
CUB – a configurable C++ template library of high-performance CUDA primitives

Each new generation of NVIDIA GPUs brings with it a dramatic increase in compute power and the pace of development over the past several years has been rapid. The Tesla M2090, based on the Fermi GF110 architecture anounced in 2010, offered global memory bandwidth of up to 177 Gigabytes per second and peak double-precision floating-point performance of 665 Gigaflops. By comparison, today’s Tesla K40 (Kepler GK110b architecture) has peak memory bandwidth of 288 Gigabytes per second and provides reported peak double-precision performance of over 1.4 Teraflops. However, the K40’s reign as the most advanced GPGPU hardware is coming to an end, and Kepler will shortly be superseded by Maxwell-class cards.

Actually achieving optimal performance on diverse GPU architectures can be challenging, since it relies on the implementation of carefully-crafted kernels that incorporate extensive knowledge of the underlying hardware and which take full advantage of relevant features of the CUDA programming model. This places a considerable burden on the CUDA developer seeking to port her application to a new generation of GPUs or looking to ensure performance across a range of architectures.

Fortunately, many CUDA applications are formulated in terms of a small set of primitives, such as parallel reduce, scan, or sort. Before attempting to handcraft these primitive operations ourselves, we should consider using one of the libraries of optimized primitives available to CUDA developers. Such libraries include Thrust and CUDPP, but in this post, we will focus on the CUB library developed by Duane Merrill of NVIDIA Research. CUB – the name derives from “CUDA Unbound” – provides generic high-performance primitives targeting multiple levels of application development. For example, CUB supports a set of device-wide primitives, which are called from the host, and in this regard, the functionality provided by CUB overlaps with Thrust to some degree. However, unlike Thrust, CUB also provides a set of kernel components that operate at the thread-block and thread-warp levels.

Thread-block reduction – a simple CUB example

A key feature of the CUB library, and one that makes CUB an attractive option for a wide range of performance-critical applications, is the fact that software components are not specialized for a particular GPU architecture or problem type. CUB is a C++ template library which utilizes policy-based design to provide highly-configurable kernel components that can be tuned for different GPU architectures and applications. To see what exactly this means, let’s consider how we might implement a reduction kernel using CUB. The library includes a templated BlockReduce class to perform reduction operations across a single thread block. It is declared as follows:

template <typename T, int BLOCK_SIZE, BlockReduceAlgorithm ALGORITHM> class BlockReduce;

T denotes the type of data on which the reduction operation is performed, BLOCK_SIZE is the number of threads in the thread block, and BlockReduceAlgorithm is an enumeration of different algorithms that can be used to perform the reduction. Note that the binary operation that specifies the type of reduction being performed (which, more often than not, involves computing the sum or the maximum or minimum of a data set) is not included in the class declaration. Using the BlockReduce class and atomic operations, a kernel to compute the maximum value in an array of integers can be implemented as follows:

1)  template<int BLOCK_SIZE, BlockReduceAlgorithm ALGORITHM>
2)  __global__ 
3)  void maxKernel(int* max, int* input)
4)  { 
5)   int id=blockIdx.x*blockDim.x + threadIdx.x; 
6)   typedef cub::BlockReduce<int,BLOCK_SIZE,ALGORITHM> BlockReduceT; 
7)
8)   // Allocate temporary storage in shared memory 
9)   __shared__ typename BlockReduceT::Temp temp_storage; 
10)
11)  int val=input[id]; 
12)  int block_max=BlockReduceT(temp_storage).Reduce(val,cub::Max());
13)
14)  // update global max value
15)  if(threadIdx.x == 0) atomicMax(max,block_max); 
16)
17)  return;  
18) }

Line 9 of the kernel above allocates temporary storage in device shared memory for use in the reduction algorithm. The quantity and layout of this storage depend on the choice of algorithm, the type of the data, the number of threads per block, and the target GPU architecture. The optimal shared-memory configuration, which provides sufficient temporary storage and avoids unnecessary bank conflicts, is determined at compile time using the template arguments selected in the client code. However, the shared-memory configuration details themselves are hidden from the client application.

On line 12 of our kernel, the BlockReduceT constructor (which takes as an argument the temporary storage allocated above) is called, generating a temporary object, which then invokes its Reduce method. The second argument in the Reduce method is an instance of CUB’s Max functor class. This class is defined such that if maxObject is an instance of the class Max, then maxObject(a,b) returns the maximum of a and b. Other binary operations supported in CUB include binary addition, the binary min operation, and variants of max and min that identify the position of the first occurance of the maximum or minimum value in a data array. The result of a thread-block reduction is returned to the first thread of each block (which has threadIdx.x == 0). Finally, each thread block calls a single atomic operation to update the global maximum. Note that line 15 of the kernel assumes that the value pointed to by max is initialized to some minimum value before the kernel is launched.

Optimizing performance by limiting concurrency

Currently, CUB supports three different block-reduction algorithms, corresponding to the enumerators BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_REDUCE_RAKING, and BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY. The latter algorithm is specialized for commutative binary operations (such as the Max operation in our example, where the relative ordering of inputs does not affect the output), while the other algorithms also support non-commutative binary operators. The background to these algorithms is decribed in detail in a series of publications by Merrill and collaborators. As described in those papers, a core feature of the algorithms used in CUB is that they balance concurrency with serial computation in order maximize performance on GPU hardware. In contrast, earlier algorithms targeting GPU architectures tended to involve high levels of concurrency, where the number of logical threads assigned to a problem scales with the problem size. However, in reduction and scan calculations, logical threads have to share data and synchronize with each other, and the cost of this inter-thread cooperation scales with the amount of concurrency in the algorithm (i.e., it depends on the number of threads). Performance can be improved by choosing a level of concurrency that ensures that the GPU hardware is fully utilized while minimizing communication and synchronization overheads.

To understand how the CUB routines utilize serial processing, consider the raking block-reduction algorithms mentioned above. In these algorithms, after an initial step, which we discuss below, each thread in the block writes data to shared memory. At this point, a single warp of threads “rakes” the shared-memory array, with each thread in the warp performing a serial reduction on some segment of the data in shared memory. At the end of this step, a single warp-width of data remains to be reduced, and one warp-level reduction completes the calculation. Further serialization can be achieved by having each thread in the thread block perform a serial partial reduction in registers at the beginning of the block-level reduction routine. To do this, we modify our reduction kernel as follows:

template<int VALS_PER_THREAD, int BLOCK_SIZE, 
         BlockReduceAlgorithm ALGORITHM> 
__global__ 
void maxKernel(int* max, int* input) 
{ 
  int id=blockIdx.x*blockDim.x + threadIdx.x; 
  typedef cub::BlockReduce<int,BLOCK_SIZE,ALGORITHM> BlockReduceT;

  // Allocate temporary storage in shared memory 
  __shared__ typename BlockReduceT::Temp temp_storage; 

  // Assign multiple values to each block thread 
  int val[VALS_PER_THREAD]; 

  // Code to initialize the val array has been omitted 
  int block_max=BlockReduceT(temp_storage).Reduce(val,cub::Max()); 

  // update global max value 
  if(threadIdx.x == 0) atomicMax(max,block_max); 

  return; 
}

Thus, each thread in the block reduces VALS_PER_THREAD items in registers as an initial step in the block reduction.

It’s worth noting that on devices of compute capability 3.0 and above CUB will utilize CUDA’s shuffle feature to perform efficient warp-level reductions, but it reverts to a shared-memory implementation on older hardware (for a recent description of reductions using shuffle, see this Parallel ForAll blog post).

CUDA Unbound

The block reduction example illustrates the extreme configurability of CUB. In essence, CUB provides an outline of the reduction algorithm, but leaves performance-critical details, such as the exact choice of algorithm and the degree of concurrency unbound and in the hands of the user. These parameters can be tuned in order maximimize performance for a particular architecture and application. Since the parameter values are specified in the client application at compile time, this flexibility incurs no runtime performance penalty. The CUB library provides most benefit if integrated into a client-application auto-tuning procedure. In this case, on each new architecture and problem type, the client application would launch a series of short jobs to explore the CUB tuning space and determine the choice of template arguments that optimize performance.

Although we have focused solely on CUB’s support for block-wide reductions in this post, the library also includes highly-configurable scan and sort implementations. Among the other primitives implemented in CUB are block-wide data-exchange operations and parallel histogram calculations, and all of these implementations are flexible enough to ensure high performance in diverse applications running on a range of NVIDIA architectures.

Well, that completes our brief introduction to the CUB library. We’ll revisit CUB in our next post, when we’ll look at concrete performance figures for kernels utilizing different CUB routines on a variety of problems and different GPU architectures.

The post Introducing CUDA UnBound (CUB) appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/introducing-cuda-unbound-cub/feed/ 1
NVIDIA Tesla K40 “Atlas” GPU Accelerator (Kepler GK110b) Up Close https://www.microway.com/hpc-tech-tips/nvidia-tesla-k40-atlas-gpu-accelerator-kepler-gk110b-up-close/ https://www.microway.com/hpc-tech-tips/nvidia-tesla-k40-atlas-gpu-accelerator-kepler-gk110b-up-close/#comments Mon, 18 Nov 2013 14:01:59 +0000 http://https://www.microway.com/?p=3238 NVIDIA’s latest Tesla accelerator is without a doubt the most powerful GPU available. With almost 3,000 CUDA cores and 12GB GDDR5 memory, it wins in practically every* performance test you’ll see. As with the “Kepler” K20 GPUs, the Tesla K40 supports NVIDIA’s latest SMX, Dynamic Parallelism and Hyper-Q capabilities (CUDA compute capability 3.5). It also […]

The post NVIDIA Tesla K40 “Atlas” GPU Accelerator (Kepler GK110b) Up Close appeared first on Microway.

]]>
NVIDIA’s latest Tesla accelerator is without a doubt the most powerful GPU available. With almost 3,000 CUDA cores and 12GB GDDR5 memory, it wins in practically every* performance test you’ll see. As with the “Kepler” K20 GPUs, the Tesla K40 supports NVIDIA’s latest SMX, Dynamic Parallelism and Hyper-Q capabilities (CUDA compute capability 3.5). It also introduces professional-level GPU Boost capability to squeeze every bit of performance your code can pull from the GPU’s 235W power envelope.

Maximum GPU Memory and Compute Performance: Tesla K40 GPU Accelerator

Integrated in Microway NumberSmasher GPU Servers and GPU Clusters

Photograph of the new NVIDIA Tesla "Atlas" K40 "Kepler" GPU Accelerator

Specifications

  • 2880 CUDA GPU cores (GK110b)
  • 4.2 TFLOPS single; 1.4 TFLOPS double-precision
  • 12GB GDDR5 memory
  • Memory bandwidth up to 288 GB/s
  • PCI-E x16 Gen3 interface to system
  • GPU Boost increased clock speeds
  • Supports Dynamic Parallelism and HyperQ features
  • Active and Passive heatsinks available for installation in workstations and specially-designed GPU servers

The new GPU also leverages PCI-E 3.0 to achieve over 10 gigabytes per second transfers between the host (CPUs) and the devices (GPUs):

[root@node3 tests]# ./gpu_bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla K40m
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     10038.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     10046.7

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     202665.0

Result = PASS

Technical Details

Here is the full list of capabilities reported by NVIDIA’s SMI tool. Memory error detection and correction (ECC) is supported on all components of the Tesla GPU. Notice that GPU Boost allows the top CUDA core clock frequency to be set to 745 MHz, 810 MHz or 875 MHz:

[root@node3 ~]# nvidia-smi -a -i 0

==============NVSMI LOG==============

Timestamp                           : Mon Nov 11 21:42:13 2013
Driver Version                      : 325.15

Attached GPUs                       : 3
GPU 0000:02:00.0
    Product Name                    : Tesla K40m
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 128
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 032391304xxxx
    GPU UUID                        : GPU-3964f3ae-5ee0-2afc-5d93-9f1edd2axxxx
    VBIOS Version                   : 80.80.24.00.06
    Inforom Version
        Image Version               : 2081.0202.01.04
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x102310DE
        Bus Id                      : 0000:02:00.0
        Sub System Id               : 0x097E10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
    Fan Speed                       : N/A
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Not Active
    Memory Usage
        Total                       : 11519 MB
        Used                        : 69 MB
        Free                        : 11450 MB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        Gpu                         : 26 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 19.49 W
        Power Limit                 : 235.00 W
        Default Power Limit         : 235.00 W
        Enforced Power Limit        : 235.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 235.00 W
    Clocks
        Graphics                    : 324 MHz
        SM                          : 324 MHz
        Memory                      : 324 MHz
    Applications Clocks
        Graphics                    : 745 MHz
        Memory                      : 3004 MHz
    Default Applications Clocks
        Graphics                    : 745 MHz
        Memory                      : 3004 MHz
    Max Clocks
        Graphics                    : 875 MHz
        SM                          : 875 MHz
        Memory                      : 3004 MHz
    Compute Processes               : None
[root@node3 ~]# nvidia-smi -q -d SUPPORTED_CLOCKS -i 0

==============NVSMI LOG==============

Timestamp                           : Mon Nov 11 21:42:45 2013
Driver Version                      : 325.15

Attached GPUs                       : 3
GPU 0000:02:00.0
    Supported Clocks
        Memory                      : 3004 MHz
            Graphics                : 875 MHz
            Graphics                : 810 MHz
            Graphics                : 745 MHz
            Graphics                : 666 MHz
        Memory                      : 324 MHz
            Graphics                : 324 MHz

NVIDIA deviceQuery on Tesla K40

The output below, from the CUDA 5.5 SDK samples, shows additional details of the architecture and capabilities of the Tesla K40 GPU accelerators.

deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla K40m"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 12288 MBytes (12884705280 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Clock rate:                                876 MHz (0.88 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = Tesla K40m
Result = PASS

*Caveat on Tesla K40 performance boost: users with very specific, memory-intensive, single-precision floating point and/or integer math may be better served by the NVIDIA Tesla K10 GPU Accelerator with 8GB GDDR5 memory. Please speak with one of our GPU experts.

Additional Tesla K40 Information

To learn more about the differences between the Tesla K40 and other versions of the Tesla product line, please review our In-Depth Comparison of NVIDIA Tesla “Kepler” GPU Accelerators.

The post NVIDIA Tesla K40 “Atlas” GPU Accelerator (Kepler GK110b) Up Close appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-tesla-k40-atlas-gpu-accelerator-kepler-gk110b-up-close/feed/ 1
CUDA Code Migration (Fermi to Kepler Architecture) on Tesla GPUs https://www.microway.com/hpc-tech-tips/cuda-code-migration-fermi-to-kepler-architecture-on-tesla-gpus/ https://www.microway.com/hpc-tech-tips/cuda-code-migration-fermi-to-kepler-architecture-on-tesla-gpus/#respond Sun, 27 Oct 2013 19:14:11 +0000 http://https://www.microway.com/?p=3266 The debut of NVIDIA’s Kepler architecture in 2012 marked a significant milestone in the evolution of general-purpose GPU computing. In particular, Kepler GK110 (compute capability 3.5) brought unrivaled compute power and introduced a number of new features to enhance GPU programmability. NVIDIA’s Tesla K20 and K20X accelerators are based on the Kepler GK110 architecture. The […]

The post CUDA Code Migration (Fermi to Kepler Architecture) on Tesla GPUs appeared first on Microway.

]]>
The debut of NVIDIA’s Kepler architecture in 2012 marked a significant milestone in the evolution of general-purpose GPU computing. In particular, Kepler GK110 (compute capability 3.5) brought unrivaled compute power and introduced a number of new features to enhance GPU programmability. NVIDIA’s Tesla K20 and K20X accelerators are based on the Kepler GK110 architecture. The higher-end K20X, which is used in the Titan and Bluewaters supercomputers, contains a massive 2,688 CUDA cores and achieves peak single-precision floating-point performance of 3.95 Tflops. In contrast, the Fermi-architecture Tesla M2090 (compute capability 2.0) has peak single-precision performance of 1.3 Tflops.

In addition to the increase in raw power, GK110 includes a number of features designed to facilitate efficient GPU utilization. Of these, Hyper-Q technology and support for dynamic parallelism have been particularly well publicized. Hyper-Q facilitates the concurrent execution of multiple kernels on a single device and also enables multiple CPU processes to simultaneously launch work on a single GPU. The dynamic parallelism feature means that kernels can be launched from the device, which greatly simplifies the implementation and improves the performance of divide-and-conquer algorithms, for example. Other Kepler additions include support for bindless textures, which offer greater flexibility and performance than texture references. Shared-memory bank widths have also increased from 4 to 8 bytes, with a corresponding increase in shared-memory bandwidth and a reduction in bank conflicts in many applications.

Read Whitepaper

Migrating Your Code from Nvidia Tesla Fermi to Tesla Kepler K20X, with Examples from QUDA Lattice QCD Library

In this white paper, we provide an overview of the new features of Kepler GK110 and highlight the differences in functionality and performance between the new architecture and Fermi. We cite a number of examples drawn from disparate sources on the web, but also draw on our own experiences involving QUDA, an open-source library for performing Lattice QCD calculations on GPUs. In general, codes developed on Fermi ought to see substantial performance gains on GK110 without any modification. In the case of data-dependent and recursive algorithms, however, far greater gains may be achieved by exploiting dynamic parallelism. More generally, relatively minor code modifications, such as switching to bindless textures, or changing shared-memory accesses to take advantage of increased bandwidth, can also result in significant improvements in performance.

The post CUDA Code Migration (Fermi to Kepler Architecture) on Tesla GPUs appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/cuda-code-migration-fermi-to-kepler-architecture-on-tesla-gpus/feed/ 0
Avoiding GPU Memory Performance Bottlenecks https://www.microway.com/hpc-tech-tips/avoiding-gpu-memory-performance-bottlenecks/ https://www.microway.com/hpc-tech-tips/avoiding-gpu-memory-performance-bottlenecks/#respond Tue, 01 Oct 2013 03:36:22 +0000 http://https://www.microway.com/?p=2903 This post is Topic #3 (post 3) in our series Parallel Code: Maximizing your Performance Potential. Many applications contain algorithms which make use of multi-dimensional arrays (or matrices). For cases where threads need to index the higher dimensions of the array, strided accesses can’t really be avoided. In cases where strided access is actually avoidable, […]

The post Avoiding GPU Memory Performance Bottlenecks appeared first on Microway.

]]>
This post is Topic #3 (post 3) in our series Parallel Code: Maximizing your Performance Potential.

Many applications contain algorithms which make use of multi-dimensional arrays (or matrices). For cases where threads need to index the higher dimensions of the array, strided accesses can’t really be avoided. In cases where strided access is actually avoidable, every effort to avoid accesses with a stride greater than one should be taken.

So all this advice is great and all, but I’m sure you’re wondering “What actually is strided memory access?” The following example will illustrate this phenomenon and outline its effect on the effective bandwidth:

__global__ void strideExample (float *outputData, float *inputData, int stride=2)
{
    int index = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    outputData[index] = inputData[index];
}

In the above code, threads within a warp access data words in memory with a stride of 2. This leads to a load of two L1 cache lines per warp. The actual accessing of the memory is shown below.

Diagram of NVIDIA Kepler Strided Memory Accesses

Accesses with a stride of 2 result in a 50% load/store efficiency (shown above), since half of the elements involved in the transaction are not used (becoming wasted bandwidth). As the stride increases, the effective bandwidth decreases until there is a single cache line for each of the threads in a warp (wow, that’s a lot of lost performance!).

Strided accesses can debilitate performance of even the most optimized algorithms. For large strides, the effective bandwidth is poor, regardless of the architecture of compute capability version. Intuitively, this makes sense. When concurrent threads are simultaneously accessing data located in memory addresses that are far apart in the physical memory, the accesses cannot be combined. For these types of situations, you absolutely must not use global memory if you wish to realize any sort of performance gain from your application for accesses with a stride greater than 1. In cases where you are stuck with strided memory accesses, you must ensure that as much data as possible is used from each cache line fetching operation.

So, if I haven’t made it clear enough: if you can avoid global memory, you should. In my personal experiences programming with CUDA, you really can’t go wrong if you intelligently make use of shared memory. With the exception of bank conflicts (discussed in Shared Memory Optimization), you don’t suffer the painful penalties that accompany global memory usage when you have non-sequential memory accesses, or misaligned accesses by warps in shared memory.


For those of us who are more advanced, if you can make use of registers without register pressure or read-after-write dependencies, you should. I briefly discussed register memory in previous posts, but feel that it warrants a bit more discussion here.

Shared memory allows communications between threads, which is very convenient. However, for those of us looking to squeeze out every last drop of performance from our applications, you really need to make use of registers when you can. Think of it this way – shared memory is kind of the “jack of all trades” memory. It’s suitable for “most” applications and operations, but for register operations (without read-after-write issues) there is no comparison. Typically, register access consumes zero extra clock cycles per instruction. While this lack of processing latency makes register memory very appealing, read-after-write dependencies have a latency of roughly 24 clock cycles. When such a dependency appears in a loop of code, this latency will add up very quickly.

The only other downside of register memory is called register pressure. Register pressure occurs when there are just simply not enough registers for a given task. Although every multiprocessor in a GPU contains literally thousands of 32 bit registers, these get partitioned amongst concurrent threads. You can set the maximum number of registers that can be allocated (by the compiler) via the command line.

To summarize, when you’re developing your algorithms and applications you really need to be aware of how you’re making use of memory:

  • Global memory is great for beginner programmers, as it drastically simplifies coding for those who aren’t skilled or experienced in regards to CUDA programming. Performance will be lower.
  • If you aren’t needing to squeeze out every drop of performance, shared memory can take you to where you need to be. The benefits of thread-to-thread communications within a warp makes many algorithms easier to code and implement, making shared memory a very attractive option.
  • Register memory is the fastest, but a little more tricky. There are hard limits to what you can do with register memory, but if what your algorithm requires fits inside those confines, then definitely make use of registers.
  • Very specific types of applications can really benefit from using texture and local memory, but if you’re in the market for those types of memory, you probably wouldn’t be reading this blog in the first place.

The next portion of this blog will step away from the memory aspect of performance optimization and into optimizing configurations and the art of keeping all the multiprocessors on your device busy throughout the execution of your kernel.

The post Avoiding GPU Memory Performance Bottlenecks appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/avoiding-gpu-memory-performance-bottlenecks/feed/ 0
GPU Shared Memory Performance Optimization https://www.microway.com/hpc-tech-tips/gpu-shared-memory-performance-optimization/ https://www.microway.com/hpc-tech-tips/gpu-shared-memory-performance-optimization/#comments Fri, 27 Sep 2013 02:48:10 +0000 http://https://www.microway.com/hpc-tech-tips/?p=357 This post is Topic #3 (post 2) in our series Parallel Code: Maximizing your Performance Potential. In my previous post, I provided an introduction to the various types of memory available for use in a CUDA application. Now that you’re familiar with these types of memory, the more important topic can be addressed – accessing […]

The post GPU Shared Memory Performance Optimization appeared first on Microway.

]]>
This post is Topic #3 (post 2) in our series Parallel Code: Maximizing your Performance Potential.

In my previous post, I provided an introduction to the various types of memory available for use in a CUDA application. Now that you’re familiar with these types of memory, the more important topic can be addressed – accessing the memory.

Think for a moment: global memory is up to 150x slower than some of the other types of device memory available. If you could reduce the number of global memory accesses needed by your application, then you’d realize a significant performance increase (especially if your application performs the same operations in a loop or things of that nature). The easiest way to obtain this performance gain is to coalesce your memory accesses to global memory. The number of concurrent global memory accesses of the threads in a given warp is equal to the number of cache lines needed to service all of the threads of the warp. So how do you coalesce your accesses you ask? There are many ways.

The simplest way to coalesce your memory accesses is to have the N-th thread in a warp access the N-th word in a cache line. If the threads in a warp are accessing adjacent 4-byte words (float, for example), a single cache line (and therefore, a single coalesced transaction) will service that memory access. Even if some words of the cache line are not requested by any thread in the warp (e.g., several of the threads access the same word, or some of the threads don’t participate in the access), all data in the cache line is fetched anyways. This results in a single global memory access (see Figure 1).

Diagram of NVIDIA Kepler Aligned Memory Accesses
Figure 1: Aligned Memory Accesses

If sequential threads in a warp access sequential memory locations, but the memory locations are not aligned with the cache lines (overlapping), there will be two 128-byte (L1) cache lines requested. This results in 128-bytes of additional memory being fetched even though it is not needed (see the red blocks in Figure 2). Fortunately, memory allocated via cudaMalloc() is guaranteed to be aligned to at least 256 bytes. By choosing intelligent thread block sizes (typically multiples of the warp size), it facilitates memory accesses by the warps that are aligned to cache lines. This means fewer memory accesses are needed. Let your mind wander for a moment as to what would happen to the memory locations that are accessed by the 2nd, 3rd, 4th, etc thread blocks if the thread block size was not a multiple of warp size. Not good.

Diagram of NVIDIA Kepler Mis-Aligned Memory Accesses
Figure 2: Mis-Aligned Memory Accesses

So what happens if your memory accesses are misaligned? Let’s take a look. Below is a simple kernel that demonstrates aligned and misaligned accesses.

__global__ void misalignedCopy(float *outputData, float *inputData, int offset)
{
    int xid = blockIdx.x * blockDim.x + threadIdx.x + offset;
    outputData[xid] = inputData[xid];
}

In the code example above, data is copied from the array inputData to the array outputData. Both of these arrays exist in global memory. The kernel here is executed within a loop in host code that varies the offset between 0 and 32. Here, global memory accesses with 0 offset, or with offsets that are multiples of 32 words, result in a single cache line transaction. When the offset is not a multiple of 32 words, two L1 cache lines are loaded per warp. This results in roughly 80% of the memory throughput achieved compared to the case with no offsets.

Another technique, similar to coalescing, is known as striding. Strided memory accesses will be discussed in the next post.

Shared Memory Bank Conflicts

If your application is making use of shared memory, you’d expect to see increased performance compared to an implementation using only global memory. Because it is on-chip, shared memory has a much higher bandwidth and lower latency than global memory. But this speed increase requires that your application have no bank conflicts between threads.

In order to actually achieve the high memory bandwidth for concurrent accesses, shared memory is divided into equally sized memory modules (also known as banks) that can be accessed simultaneously. This means any memory load/store of N memory addresses than spans N distinct memory banks can be serviced simultaneously (see Figure 3). In performance gain terms, this means that the memory exhibits an effective bandwidth that is N times as high as that of a single memory module.

Diagram of NVIDIA Kepler Shared Memory Banks Parallel Accesses
Figure 3: Simultaneous Accesses of Shared Memory

The problem however, lies in situations where multiple addresses of a memory request map to the same memory bank. When this occurs (a bank conflict), the accesses are serialized, reducing the effective bandwidth. A memory request that has bank conflicts is split into as many separate conflict-free requests as necessary, which greatly reduces the performance of the application (by a factor that’s equal to the number of separate memory requests). As shown in Figure 4, serialized shared memory accesses can take much longer.

Diagram of NVIDIA Kepler Shared Memory Banks Serialized Accesses
Figure 4: Serialized Accesses of Shared Memory

The only exception is the case of shared memory broadcasts. These occur when all threads in a warp access the same location in shared memory. In this case, a bank conflict does not occur.

Summary

It really cannot be stressed enough to make as much use of shared memory as possible in your application. In my next post I will provide an example that illustrates just how much faster shared memory is compared to global memory, as well as the impacts with regards to performance that result when reads to global memory are coalesced and bank conflicts are removed. In addition, I will discuss strided memory accesses, and provide some additional insight into the optimization techniques for the other types of available memory.

The post GPU Shared Memory Performance Optimization appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/gpu-shared-memory-performance-optimization/feed/ 1
GPU Memory Types – Performance Comparison https://www.microway.com/hpc-tech-tips/gpu-memory-types-performance-comparison/ https://www.microway.com/hpc-tech-tips/gpu-memory-types-performance-comparison/#comments Tue, 06 Aug 2013 18:45:33 +0000 http://https://www.microway.com/hpc-tech-tips/?p=347 This post is Topic #3 (part 1) in our series Parallel Code: Maximizing your Performance Potential. CUDA devices have several different memory spaces: Global, local, texture, constant, shared and register memory. Each type of memory on the device has its advantages and disadvantages. Incorrectly making use of the available memory in your application can can […]

The post GPU Memory Types – Performance Comparison appeared first on Microway.

]]>
This post is Topic #3 (part 1) in our series Parallel Code: Maximizing your Performance Potential.

CUDA devices have several different memory spaces: Global, local, texture, constant, shared and register memory. Each type of memory on the device has its advantages and disadvantages. Incorrectly making use of the available memory in your application can can rob you of the performance you desire. With so many different types of memory, how can you be certain you’re using the correct type? Well, it is no easy task.

In terms of speed, if all the various types of device memory were to race here’s how the race would turn out:

  • 1st place: Register file
  • 2nd place: Shared Memory
  • 3rd place: Constant Memory
  • 4th: Texture Memory
  • Tie for last place: Local Memory and Global Memory

Looking at the above list, it would seem that to have the best performance we’d only want to use register file, shared memory, and constant memory. In a simple world I’d agree with that statement. However, there are many more factors associated with choosing the best form of memory for various portions of your application.

Memory Features

The only two types of memory that actually reside on the GPU chip are register and shared memory. Local, Global, Constant, and Texture memory all reside off chip. Local, Constant, and Texture are all cached.

While it would seem that the fastest memory is the best, the other two characteristics of the memory that dictate how that type of memory should be utilized are the scope and lifetime of the memory:

  • Data stored in register memory is visible only to the thread that wrote it and lasts only for the lifetime of that thread.
  • Local memory has the same scope rules as register memory, but performs slower.
  • Data stored in shared memory is visible to all threads within that block and lasts for the duration of the block. This is invaluable because this type of memory allows for threads to communicate and share data between one another.
  • Data stored in global memory is visible to all threads within the application (including the host), and lasts for the duration of the host allocation.
  • Constant and texture memory won’t be used here because they are beneficial for only very specific types of applications. Constant memory is used for data that will not change over the course of a kernel execution and is read only. Using constant rather than global memory can reduce the required memory bandwidth, however, this performance gain can only be realized when a warp of threads read the same location.Similar to constant memory, texture memory is another variety of read-only memory on the device. When all reads in a warp are physically adjacent, using texture memory can reduce memory traffic and increase performance compared to global memory.

How to Choose Memory Type

Knowing how and when to use each type of memory goes a long way towards optimizing the performance of your application. More often than not, it is best to make use of shared memory due to the fact that threads within the same block utilizing shared memory can communicate. Combined with its excellent performance, this makes shared memory a good ‘all around’ choice when used properly. In some cases however, it may be better to make use of the other types of available memory.

Shared Memory

A common problem arises when memory is shared: with all memory available to all threads, there will be many threads accessing the data simultaneously. To alleviate this potential bottleneck, shared memory is divided into 32 logical banks. Successive sections of memory are assigned to successive banks (see Figure 1).

Diagram of NVIDIA Kepler GPU architecture Shared Memory and L1 Cache Memory
Figure 1: Shared Memory and L1 Cache

Some facts about shared memory:

  • The total size of shared memory may be set to 16KB, 32KB or 48KB (with the remaining amount automatically used for L1 Cache) as shown in Figure 1. Shared memory defaults to 48KB (with 16KB remaining for L1 Cache).
  • With the Kepler architecture, each bank has a bandwidth of 64 bits per clock cycle. The older Fermi architecture was clocked differently, but effectively offered half this bandwidth.
  • There are 32 threads in a warp and exactly 32 shared memory banks. Because each bank services only one request per cycle, multiple simultaneous accesses to the same bank will result in what is known as a bank conflict. This will be discussed further in the next post.
  • GPUs section memory banks into 32-bit words (4 bytes). Kepler architecture introduced the option to increase banks to 8 bytes using cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte). This can help avoid bank conflicts when accessing double precision data.

When there are no bank conflicts present, shared memory performance is comparable to register memory. Use it properly and shared memory will be lightning fast.

Register Memory

In most cases, accessing a register consumes zero clock cycles per instruction. However, delays can occur due to read after write dependencies and bank conflicts. The latency of read after write dependencies is roughly 24 clock cycles. For newer CUDA devices that have 32 cores per multiprocessor, it may take up to 768 threads to completely hide latency.

In addition to the read after write latency, register pressure can severely detract from the performance of the application. Register pressure occurs when there are not enough registers available for a given task. When this occurs, the data is “spilled over” using local memory. See the following posts for further details.

Local Memory

Local memory is not a physical type of memory, but an abstraction of global memory. Its scope is local to the thread and it resides off-chip, which makes it as expensive to access as global memory. Local memory is used only to hold automatic variables. The compiler makes use of local memory when it determines that there is not enough register space to hold the variable. Automatic variables that are large structures or arrays are also typically placed in local memory.

Recommendation

All in all, for most applications my recommendation is definitely to try to make use of shared memory wherever possible. It is the most versatile and easy-to-use type of memory. Shared memory allows communication between threads within a warp which can make optimizing code much easier for beginner to intermediate programmers. The other types of memory all have their place in CUDA applications, but for the general case, shared memory is the way to go.

Conclusion

So now that you know a little bit about each of the various types of memory available to you in your GPU applications, you’re ready to learn how to efficiently use them. The next post will discuss how you can optimize the use of the various types of memory throughout your application.

The post GPU Memory Types – Performance Comparison appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/gpu-memory-types-performance-comparison/feed/ 2