Administration Archives - Microway https://www.microway.com/category/administration/ We Speak HPC & AI Tue, 28 May 2024 16:51:28 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 NVIDIA Datacenter Manager (DCGM) for More Effective GPU Management https://www.microway.com/hpc-tech-tips/nvidia-dcgm-for-gpu-management/ https://www.microway.com/hpc-tech-tips/nvidia-dcgm-for-gpu-management/#respond Mon, 02 Apr 2018 03:58:47 +0000 https://www.microway.com/?p=10643 Managing an HPC server can be a tricky job, and managing multiple servers even more complex. Adding GPUs adds even more power yet new levels of granularity. Luckily, there’s a powerful, and effective tool available for managing multiple servers or a cluster of GPUs: NVIDIA Datacenter GPU Manager. Executing hardware or health checks DCGM’s power […]

The post NVIDIA Datacenter Manager (DCGM) for More Effective GPU Management appeared first on Microway.

]]>
Managing an HPC server can be a tricky job, and managing multiple servers even more complex. Adding GPUs adds even more power yet new levels of granularity. Luckily, there’s a powerful, and effective tool available for managing multiple servers or a cluster of GPUs: NVIDIA Datacenter GPU Manager.

Executing hardware or health checks

DCGM’s power comes from its ability to access all kinds of low level data from the GPUs in your system. Much of this data is reported by NVML (NVIDIA Management Library), and it may be accessible via IPMI on your system. But DCGM helps make it far easier to access and use the following:

Report what GPUs are installed, in which slots and PCI-E trees and make a group

Build a group of GPUs once you know which slots your GPUs are installed in and on which PCI-E trees and NUMA nodes they are on. This is great for binding jobs, linking available capabilities.

Determine GPU link states, bandwidths

Provide a report of the PCI-Express link speed each GPU is running at. You may also perform D2D and H2D bandwidth tests inside your system (to take action on the reports)

Read temps, boost states, power consumption, or utilization

Deliver data on the energy usage and utilization of your GPUs. This data can be used to control the cluster

Driver versions and CUDA versions

Report on the versions of CUDA, NVML, and the NVIDIA GPU driver installed on your system

Run sample jobs and integrated validation

Run basic diagnostics and sample jobs that are built into the DCGM package.

Set policies

DCGM provide a mechanism to set policies to a group of GPUs.

Policy driven management: elevating from “what’s happening” to “what can I do”

Simply accessing data about your GPUs is only of modest use. The power of DCGM is in how it arms you to take act upon that data.DCGM allows administrators to take programmatic or preventative action when “it’s not right”

Here’s are a few scenarios where data provided by DCGM allows for both powerful control of your hardware and action:

Scenario 1: Healthchecks – periodic or before the job

Run a check before each job, after a job, or daily/hourly to ensure a cluster is performing optimally.

This allows you to preemptively stop a run if diagnostics fail or move GPUs/nodes out of the scheduling queue for the next job.

Scenario 2: Resource Allocation

Jobs often need a certain class of node (ex: with >4 GPUs or with IB & GPUs on the same PCI-E tree). DCGM can be used to report on the capabilities of a node and help identify appropriate resources.

Users/schedulers can subsequently send jobs only where they are capable of being executed

Scenario 3: “Personalities”

Some codes request specific CUDA or NVIDIA driver versions. DCGM can be used to probe the CUDA version/NVIDIA GPU driver version on a compute node.

Users can then script the deployment of alternate versions or the launch containerized apps to support non-standard versions.

Scenario 4: Stress tests

Periodically stress test GPUs in a cluster with integrated functions

Stress tests like Microway GPU Checker can tease out failing GPUs, and reading data via DCGM during or after can identify bad nodes to be sidelined.

Scenario 5: Power Management

Programmatically set GPU Boost or max TDP levels for an application or run. This allow you to eke out extra performance.

Alternatively, set your GPUs to stay within a certain power band to reduce electricity costs when rates are high or lower total cluster consumption when there is insufficient generation capacity

Scenario 6: Logging for Validation

Script the pull of error logs and take action with that data.

You can accumulate error logging over time, and determine tendencies of your cluster. For example, a group of GPUs with consistently high temperatures may indicate a hotspot in your datacenter

Getting Started with DCGM: Starting a Health Check

DCGM can be used in many ways. We won’t explore them all here, but it’s important to understand the ease of use of these capabilities.

Here’s the code for a simple health check and also for a basic diagnostic:

dcgmi health --check -g 1
dcgmi diag –g 1 -r 1

The syntax is very standard and includes dcgmi, the command, and the group of GPUs (you must set a group first). In the diagnostic, you include the level of diagnostics requested (-r 1, or lowest level here).

DCGM and Cluster Management

While the Microway team loves advanced scripting, you may prefer integrating DCGM or its capabilities in with your existing schedulers or cluster managers. The following are supported or already leverage DCGM today:

What’s Next

What will you do with DCGM or DCGM-enabled tools? We’ve only scratched the surface. There are extensive resources on how to use DCGM and/or how it is integrated with other tools. We recommend this blog post.

The post NVIDIA Datacenter Manager (DCGM) for More Effective GPU Management appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-dcgm-for-gpu-management/feed/ 0
nvidia-smi: Control Your GPUs https://www.microway.com/hpc-tech-tips/nvidia-smi_control-your-gpus/ https://www.microway.com/hpc-tech-tips/nvidia-smi_control-your-gpus/#comments Tue, 06 Dec 2011 03:44:14 +0000 http://https://www.microway.com/hpc-tech-tips/?p=58 This post was last updated on 2018-11-05 Most users know how to check the status of their CPUs, see how much system memory is free, or find out how much disk space is free. In contrast, keeping tabs on the health and status of GPUs has historically been more difficult. If you don’t know where […]

The post nvidia-smi: Control Your GPUs appeared first on Microway.

]]>
This post was last updated on 2018-11-05

Most users know how to check the status of their CPUs, see how much system memory is free, or find out how much disk space is free. In contrast, keeping tabs on the health and status of GPUs has historically been more difficult. If you don’t know where to look, it can even be difficult to determine the type and capabilities of the GPUs in a system. Thankfully, NVIDIA’s latest hardware and software tools have made good improvements in this respect.

The tool is NVIDIA’s System Management Interface (nvidia-smi). Depending on the generation of your card, various levels of information can be gathered. Additionally, GPU configuration options (such as ECC memory capability) may be enabled and disabled.

As an aside, if you find that you’re having trouble getting your NVIDIA GPUs to run GPGPU code, nvidia-smi can be handy. For example, on some systems the proper NVIDIA devices in /dev are not created at boot. Running a simple nvidia-smi query as root will initialize all the cards and create the proper devices in /dev. Other times, it’s just useful to make sure all the GPU cards are visible and communicating properly. Here’s the default output from a recent version with four Tesla V100 GPU cards:

nvidia-smi
 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:18:00.0 Off |                    0 |
| N/A   40C    P0    55W / 250W |  31194MiB / 32480MiB |     44%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   40C    P0    36W / 250W |  30884MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   41C    P0    39W / 250W |  30884MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   39C    P0    37W / 250W |  30884MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    305892      C   /usr/bin/python                            31181MiB |
+-----------------------------------------------------------------------------+

Persistence Mode

On Linux, you can set GPUs to persistence mode to keep the NVIDIA driver loaded even when no applications are accessing the cards. This is particularly useful when you have a series of short jobs running. Persistence mode uses a few more watts per idle GPU, but prevents the fairly long delays that occur each time a GPU application is started. It is also necessary if you’ve assigned specific clock speeds or power limits to the GPUs (as those changes are lost when the NVIDIA driver is unloaded). Enable persistence mode on all GPUS by running:
nvidia-smi -pm 1

On Windows, nvidia-smi is not able to set persistence mode. Instead, you need to set your computational GPUs to TCC mode. This should be done through NVIDIA’s graphical GPU device management panel.

GPUs supported by nvidia-smi

NVIDIA’s SMI tool supports essentially any NVIDIA GPU released since the year 2011. These include the Tesla, Quadro, and GeForce devices from Fermi and higher architecture families (Kepler, Maxwell, Pascal, Volta, etc).

Supported products include:

Querying GPU Status

Microway’s GPU Test Drive cluster, which we provide as a benchmarking service to our customers, contains a group of NVIDIA’s latest Tesla GPUs. These are NVIDIA’s high-performance compute GPUs and provide a good deal of health and status information. The examples below are taken from this internal cluster.

To list all available NVIDIA devices, run:

nvidia-smi -L

GPU 0: Tesla K40m (UUID: GPU-d0e093a0-c3b3-f458-5a55-6eb69fxxxxxx)
GPU 1: Tesla K40m (UUID: GPU-d105b085-7239-3871-43ef-975ecaxxxxxx)

To list certain details about each GPU, try:

nvidia-smi --query-gpu=index,name,uuid,serial --format=csv

0, Tesla K40m, GPU-d0e093a0-c3b3-f458-5a55-6eb69fxxxxxx, 0323913xxxxxx
1, Tesla K40m, GPU-d105b085-7239-3871-43ef-975ecaxxxxxx, 0324214xxxxxx

To monitor overall GPU usage with 1-second update intervals:

nvidia-smi dmon

# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0    43    35     -     0     0     0     0  2505  1075
    1    42    31     -    97     9     0     0  2505  1075
(in this example, one GPU is idle and one GPU has 97% of the CUDA sm "cores" in use)

To monitor per-process GPU usage with 1-second update intervals:

nvidia-smi pmon

# gpu        pid  type    sm   mem   enc   dec   command
# Idx          #   C/G     %     %     %     %   name
    0      14835     C    45    15     0     0   python         
    1      14945     C    64    50     0     0   python
(in this case, two different python processes are running; one on each GPU)

Monitoring and Managing GPU Boost

The GPU Boost feature which NVIDIA has included with more recent GPUs allows the GPU clocks to vary depending upon load (achieving maximum performance so long as power and thermal headroom are available). However, the amount of available headroom will vary by application (and even by input file!) so users and administrators should keep their eyes on the status of the GPUs.

A listing of available clock speeds can be shown for each GPU (in this case, the Tesla V100):

nvidia-smi -q -d SUPPORTED_CLOCKS

GPU 00000000:18:00.0
    Supported Clocks
        Memory                      : 877 MHz
            Graphics                : 1380 MHz
            Graphics                : 1372 MHz
            Graphics                : 1365 MHz
            Graphics                : 1357 MHz
            [...159 additional clock speeds omitted...]
            Graphics                : 157 MHz
            Graphics                : 150 MHz
            Graphics                : 142 MHz
            Graphics                : 135 MHz

As shown, the Tesla V100 GPU supports 167 different clock speeds (from 135 MHz to 1380 MHz). However, only one memory clock speed is supported (877 MHz). Some GPUs support two different memory clock speeds (one high speed and one power-saving speed). Typically, such GPUs only support a single GPU clock speed when the memory is in the power-saving speed (which is the idle GPU state). On all recent Tesla and Quadro GPUs, GPU Boost automatically manages these speeds and runs the clocks as fast as possible (within the thermal/power limits and any limits set by the administrator).

To review the current GPU clock speed, default clock speed, and maximum possible clock speed, run:

nvidia-smi -q -d CLOCK

GPU 00000000:18:00.0
    Clocks
        Graphics                    : 1230 MHz
        SM                          : 1230 MHz
        Memory                      : 877 MHz
        Video                       : 1110 MHz
    Applications Clocks
        Graphics                    : 1230 MHz
        Memory                      : 877 MHz
    Default Applications Clocks
        Graphics                    : 1230 MHz
        Memory                      : 877 MHz
    Max Clocks
        Graphics                    : 1380 MHz
        SM                          : 1380 MHz
        Memory                      : 877 MHz
        Video                       : 1237 MHz
    Max Customer Boost Clocks
        Graphics                    : 1380 MHz
    SM Clock Samples
        Duration                    : 0.01 sec
        Number of Samples           : 4
        Max                         : 1230 MHz
        Min                         : 135 MHz
        Avg                         : 944 MHz
    Memory Clock Samples
        Duration                    : 0.01 sec
        Number of Samples           : 4
        Max                         : 877 MHz
        Min                         : 877 MHz
        Avg                         : 877 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A

Ideally, you’d like all clocks to be running at the highest speed all the time. However, this will not be possible for all applications. To review the current state of each GPU and any reasons for clock slowdowns, use the PERFORMANCE flag:

nvidia-smi -q -d PERFORMANCE

GPU 00000000:18:00.0
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active

If any of the GPU clocks is running at a slower speed, one or more of the above Clocks Throttle Reasons will be marked as active. The most concerning condition would be if HW Slowdown was active, as this would most likely indicate a power or cooling issue. The remaining conditions typically indicate that the card is idle or has been manually set into a slower mode by a system administrator.

Reviewing System/GPU Topology and NVLink with nvidia-smi

To properly take advantage of more advanced NVIDIA GPU features (such as GPU Direct), it is vital that the system topology be properly configured. The topology refers to how the various system devices (GPUs, InfiniBand HCAs, storage controllers, etc.) connect to each other and to the system’s CPUs. Certain topology types will reduce performance or even cause certain features to be unavailable. To help tackle such questions, nvidia-smi supports system topology and connectivity queries:

nvidia-smi topo --matrix

        GPU0    GPU1    GPU2    GPU3    mlx4_0  CPU Affinity
GPU0     X      PIX     PHB     PHB     PHB     0-11
GPU1    PIX      X      PHB     PHB     PHB     0-11
GPU2    PHB     PHB      X      PIX     PHB     0-11
GPU3    PHB     PHB     PIX      X      PHB     0-11
mlx4_0  PHB     PHB     PHB     PHB      X 

Legend:

  X   = Self
  SOC = Path traverses a socket-level link (e.g. QPI)
  PHB = Path traverses a PCIe host bridge
  PXB = Path traverses multiple PCIe internal switches
  PIX = Path traverses a PCIe internal switch

Reviewing this section will take some getting used to, but can be very valuable. The above configuration shows two Tesla K80 GPUs and one Mellanox FDR InfiniBand HCA (mlx4_0) all connected to the first CPU of a server. Because the CPUs are 12-core Xeons, the topology tool recommends that jobs be assigned to the first 12 CPU cores (although this will vary by application).

Higher-complexity systems require additional care in examining their configuration and capabilities. Below is the output of nvidia-smi topology for the NVIDIA DGX-1 system, which includes two 20-core CPUs, eight NVLink-connected GPUs, and four Mellanox InfiniBand adapters:

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	mlx5_0	mlx5_2	mlx5_1	mlx5_3	CPU Affinity
GPU0	 X 	NV1	NV1	NV2	NV2	SYS	SYS	SYS	PIX	SYS	PHB	SYS	0-19,40-59
GPU1	NV1	 X 	NV2	NV1	SYS	NV2	SYS	SYS	PIX	SYS	PHB	SYS	0-19,40-59
GPU2	NV1	NV2	 X 	NV2	SYS	SYS	NV1	SYS	PHB	SYS	PIX	SYS	0-19,40-59
GPU3	NV2	NV1	NV2	 X 	SYS	SYS	SYS	NV1	PHB	SYS	PIX	SYS	0-19,40-59
GPU4	NV2	SYS	SYS	SYS	 X 	NV1	NV1	NV2	SYS	PIX	SYS	PHB	20-39,60-79
GPU5	SYS	NV2	SYS	SYS	NV1	 X 	NV2	NV1	SYS	PIX	SYS	PHB	20-39,60-79
GPU6	SYS	SYS	NV1	SYS	NV1	NV2	 X 	NV2	SYS	PHB	SYS	PIX	20-39,60-79
GPU7	SYS	SYS	SYS	NV1	NV2	NV1	NV2	 X 	SYS	PHB	SYS	PIX	20-39,60-79
mlx5_0	PIX	PIX	PHB	PHB	SYS	SYS	SYS	SYS	 X 	SYS	PHB	SYS	
mlx5_2	SYS	SYS	SYS	SYS	PIX	PIX	PHB	PHB	SYS	 X 	SYS	PHB	
mlx5_1	PHB	PHB	PIX	PIX	SYS	SYS	SYS	SYS	PHB	SYS	 X 	SYS	
mlx5_3	SYS	SYS	SYS	SYS	PHB	PHB	PIX	PIX	SYS	PHB	SYS	 X 	

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

The NVLink connections themselves can also be queried to ensure status, capability, and health. Readers are encouraged to consult NVIDIA documentation to better understand the specifics. Short summaries from nvidia-smi on DGX-1 are shown below.

nvidia-smi nvlink --status

GPU 0: Tesla V100-SXM2-32GB
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s
	 Link 2: 25.781 GB/s
	 Link 3: 25.781 GB/s
	 Link 4: 25.781 GB/s
	 Link 5: 25.781 GB/s

         [snip]

GPU 7: Tesla V100-SXM2-32GB
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s
	 Link 2: 25.781 GB/s
	 Link 3: 25.781 GB/s
	 Link 4: 25.781 GB/s
	 Link 5: 25.781 GB/s
nvidia-smi nvlink --capabilities

GPU 0: Tesla V100-SXM2-32GB
	 Link 0, P2P is supported: true
	 Link 0, Access to system memory supported: true
	 Link 0, P2P atomics supported: true
	 Link 0, System memory atomics supported: true
	 Link 0, SLI is supported: false
	 Link 0, Link is supported: false
	 
         [snip]

	 Link 5, P2P is supported: true
	 Link 5, Access to system memory supported: true
	 Link 5, P2P atomics supported: true
	 Link 5, System memory atomics supported: true
	 Link 5, SLI is supported: false
	 Link 5, Link is supported: false

Get in touch with one of our HPC GPU experts if you have questions on these topics.

Printing all GPU Details

To list all available data on a particular GPU, specify the ID of the card with -i. Here’s the output from an older Tesla GPU card:

nvidia-smi -i 0 -q

==============NVSMI LOG==============

Timestamp                       : Mon Dec  5 22:05:49 2011

Driver Version                  : 270.41.19

Attached GPUs                   : 2

GPU 0:2:0
    Product Name                : Tesla M2090
    Display Mode                : Disabled
    Persistence Mode            : Disabled
    Driver Model
        Current                 : N/A
        Pending                 : N/A
    Serial Number               : 032251100xxxx
    GPU UUID                    : GPU-2b1486407f70xxxx-98bdxxxx-660cxxxx-1d6cxxxx-9fbd7e7cd9bf55a7cfb2xxxx
    Inforom Version
        OEM Object              : 1.1
        ECC Object              : 2.0
        Power Management Object : 4.0
    PCI
        Bus                     : 2
        Device                  : 0
        Domain                  : 0
        Device Id               : 109110DE
        Bus Id                  : 0:2:0
    Fan Speed                   : N/A
    Memory Usage
        Total                   : 5375 Mb
        Used                    : 9 Mb
        Free                    : 5365 Mb
    Compute Mode                : Default
    Utilization
        Gpu                     : 0 %
        Memory                  : 0 %
    Ecc Mode
        Current                 : Enabled
        Pending                 : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Total           : 0
        Aggregate
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Total           : 0
    Temperature
        Gpu                     : N/A
    Power Readings
        Power State             : P12
        Power Management        : Supported
        Power Draw              : 31.57 W
        Power Limit             : 225 W
    Clocks
        Graphics                : 50 MHz
        SM                      : 100 MHz
        Memory                  : 135 MHz

The above example shows an idle card. Here is an excerpt for a card running GPU-accelerated AMBER:

nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER,CLOCK,COMPUTE

==============NVSMI LOG==============

Timestamp                       : Mon Dec  5 22:32:00 2011

Driver Version                  : 270.41.19

Attached GPUs                   : 2

GPU 0:2:0
    Memory Usage
        Total                   : 5375 Mb
        Used                    : 1904 Mb
        Free                    : 3470 Mb
    Compute Mode                : Default
    Utilization
        Gpu                     : 67 %
        Memory                  : 42 %
    Power Readings
        Power State             : P0
        Power Management        : Supported
        Power Draw              : 109.83 W
        Power Limit             : 225 W
    Clocks
        Graphics                : 650 MHz
        SM                      : 1301 MHz
        Memory                  : 1848 MHz

You’ll notice that unfortunately the earlier M-series passively-cooled Tesla GPUs do not report temperatures to nvidia-smi. More recent Quadro and Tesla GPUs support a greater quantity of metrics data:

==============NVSMI LOG==============
Timestamp                           : Mon Nov  5 14:50:59 2018
Driver Version                      : 410.48

Attached GPUs                       : 4
GPU 00000000:18:00.0
    Product Name                    : Tesla V100-PCIE-32GB
    Product Brand                   : Tesla
    Display Mode                    : Enabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 032161808xxxx
    GPU UUID                        : GPU-4965xxxx-79e3-7941-12cb-1dfe9c53xxxx
    Minor Number                    : 0
    VBIOS Version                   : 88.00.48.00.02
    MultiGPU Board                  : No
    Board ID                        : 0x1800
    GPU Part Number                 : 900-2G500-0010-000
    Inforom Version
        Image Version               : G500.0202.00.02
        OEM Object                  : 1.1
        ECC Object                  : 5.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    IBMNPU
        Relaxed Ordering Mode       : N/A
    PCI
        Bus                         : 0x18
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1DB610DE
        Bus Id                      : 00000000:18:00.0
        Sub System Id               : 0x124A10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 31000 KB/s
        Rx Throughput               : 155000 KB/s
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 32480 MiB
        Used                        : 31194 MiB
        Free                        : 1286 MiB
    BAR1 Memory Usage
        Total                       : 32768 MiB
        Used                        : 8 MiB
        Free                        : 32760 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 44 %
        Memory                      : 4 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    FBC Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 40 C
        GPU Shutdown Temp           : 90 C
        GPU Slowdown Temp           : 87 C
        GPU Max Operating Temp      : 83 C
        Memory Current Temp         : 39 C
        Memory Max Operating Temp   : 85 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 58.81 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 100.00 W
        Max Power Limit             : 250.00 W
    Clocks
        Graphics                    : 1380 MHz
        SM                          : 1380 MHz
        Memory                      : 877 MHz
        Video                       : 1237 MHz
    Applications Clocks
        Graphics                    : 1230 MHz
        Memory                      : 877 MHz
    Default Applications Clocks
        Graphics                    : 1230 MHz
        Memory                      : 877 MHz
    Max Clocks
        Graphics                    : 1380 MHz
        SM                          : 1380 MHz
        Memory                      : 877 MHz
        Video                       : 1237 MHz
    Max Customer Boost Clocks
        Graphics                    : 1380 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 315406
            Type                    : C
            Name                    : /usr/bin/python
            Used GPU Memory         : 31181 MiB

Additional nvidia-smi options

Of course, we haven’t covered all the possible uses of the nvidia-smi tool. To read the full list of options, run nvidia-smi -h (it’s fairly lengthy). Some of the sub-commands have their own help section. If you need to change settings on your cards, you’ll want to look at the device modification section:

    -pm,  --persistence-mode=   Set persistence mode: 0/DISABLED, 1/ENABLED
    -e,   --ecc-config=         Toggle ECC support: 0/DISABLED, 1/ENABLED
    -p,   --reset-ecc-errors=   Reset ECC error counts: 0/VOLATILE, 1/AGGREGATE
    -c,   --compute-mode=       Set MODE for compute applications:
                                0/DEFAULT, 1/EXCLUSIVE_PROCESS,
                                2/PROHIBITED
          --gom=                Set GPU Operation Mode:
                                    0/ALL_ON, 1/COMPUTE, 2/LOW_DP
    -r    --gpu-reset           Trigger reset of the GPU.
                                Can be used to reset the GPU HW state in situations
                                that would otherwise require a machine reboot.
                                Typically useful if a double bit ECC error has
                                occurred.
                                Reset operations are not guarenteed to work in
                                all cases and should be used with caution.
    -vm   --virt-mode=          Switch GPU Virtualization Mode:
                                Sets GPU virtualization mode to 3/VGPU or 4/VSGA
                                Virtualization mode of a GPU can only be set when
                                it is running on a hypervisor.
    -lgc  --lock-gpu-clocks=    Specifies  clocks as a
                                    pair (e.g. 1500,1500) that defines the range 
                                    of desired locked GPU clock speed in MHz.
                                    Setting this will supercede application clocks
                                    and take effect regardless if an app is running.
                                    Input can also be a singular desired clock value
                                    (e.g. ).
    -rgc  --reset-gpu-clocks
                                Resets the Gpu clocks to the default values.
    -ac   --applications-clocks= Specifies  clocks as a
                                    pair (e.g. 2000,800) that defines GPU's
                                    speed in MHz while running applications on a GPU.
    -rac  --reset-applications-clocks
                                Resets the applications clocks to the default values.
    -acp  --applications-clocks-permission=
                                Toggles permission requirements for -ac and -rac commands:
                                0/UNRESTRICTED, 1/RESTRICTED
    -pl   --power-limit=        Specifies maximum power management limit in watts.
    -am   --accounting-mode=    Enable or disable Accounting Mode: 0/DISABLED, 1/ENABLED
    -caa  --clear-accounted-apps
                                Clears all the accounted PIDs in the buffer.
          --auto-boost-default= Set the default auto boost policy to 0/DISABLED
                                or 1/ENABLED, enforcing the change only after the
                                last boost client has exited.
          --auto-boost-permission=
                                Allow non-admin/root control over auto boost mode:
                                0/UNRESTRICTED, 1/RESTRICTED
nvidia-smi dmon -h

    GPU statistics are displayed in scrolling format with one line
    per sampling interval. Metrics to be monitored can be adjusted
    based on the width of terminal window. Monitoring is limited to
    a maximum of 4 devices. If no devices are specified, then up to
    first 4 supported devices under natural enumeration (starting
    with GPU index 0) are used for monitoring purpose.
    It is supported on Tesla, GRID, Quadro and limited GeForce products
    for Kepler or newer GPUs under x64 and ppc64 bare metal Linux.

    Usage: nvidia-smi dmon [options]

    Options include:
    [-i | --id]:          Comma separated Enumeration index, PCI bus ID or UUID
    [-d | --delay]:       Collection delay/interval in seconds [default=1sec]
    [-c | --count]:       Collect specified number of samples and exit
    [-s | --select]:      One or more metrics [default=puc]
                          Can be any of the following:
                              p - Power Usage and Temperature
                              u - Utilization
                              c - Proc and Mem Clocks
                              v - Power and Thermal Violations
                              m - FB and Bar1 Memory
                              e - ECC Errors and PCIe Replay errors
                              t - PCIe Rx and Tx Throughput
    [-o | --options]:     One or more from the following:
                              D - Include Date (YYYYMMDD) in scrolling output
                              T - Include Time (HH:MM:SS) in scrolling output
    [-f | --filename]:    Log to a specified file, rather than to stdout
nvidia-smi topo -h

    topo -- Display topological information about the system.

    Usage: nvidia-smi topo [options]

    Options include:
    [-m | --matrix]: Display the GPUDirect communication matrix for the system.
    [-mp | --matrix_pci]: Display the GPUDirect communication matrix for the system (PCI Only).
    [-i | --id]: Enumeration index, PCI bus ID or UUID. Provide comma
                 separated values for more than one device
                 Must be used in conjuction with -n or -p.
    [-c | --cpu]: CPU number for which to display all GPUs with an affinity.
    [-n | --nearest_gpus]: Display the nearest GPUs for a given traversal path.
                   0 = a single PCIe switch on a dual GPU board
                   1 = a single PCIe switch
                   2 = multiple PCIe switches
                   3 = a PCIe host bridge
                   4 = an on-CPU interconnect link between PCIe host bridges
                   5 = an SMP interconnect link between NUMA nodes
                 Used in conjunction with -i which must be a single device ID.
    [-p | --gpu_path]: Display the most direct path traversal for a pair of GPUs.
                 Used in conjunction with -i which must be a pair of device IDs.
    [-p2p | --p2pstatus]:      Displays the p2p status between the GPUs of a given p2p capability
                   r - p2p read capabiity
                   w - p2p write capability
                   n - p2p nvlink capability
                   a - p2p atomics capability
                   p - p2p prop capability

Information on additional tools

With this tool, checking the status and health of NVIDIA GPUs is simple. If you’re looking to monitor the cards over time, then nvidia-smi might be more resource-intensive than you’d like. For that, have a look at the API available from NVIDIA’s GPU Management Library (NVML), which offers C, Perl and Python bindings.

There are also tools purpose-built for larger-scale health monitoring and validation. When managing a group or cluster of GPU-accelerated systems, administrators should consider NVIDIA Datacenter GPU Manager (DCGM) and/or Bright Cluster Manager.

Given the popularity of GPUs, most popular open-source tools also include support for monitoring GPUs. Examples include Ganglia, Telegraf, collectd, and Diamond.

The post nvidia-smi: Control Your GPUs appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/nvidia-smi_control-your-gpus/feed/ 5
Monitoring Hard Drive and RAID Health https://www.microway.com/hpc-tech-tips/monitoring-hard-drive-and-raid-health/ https://www.microway.com/hpc-tech-tips/monitoring-hard-drive-and-raid-health/#comments Fri, 16 Sep 2011 21:49:29 +0000 http://https://www.microway.com/hpc-tech-tips/?p=49 By default, you won’t find out that one of your hard drives has failed until the data is gone. Even if you are using a software or hardware RAID, it will only continue to function if you replace failed drives. I have seen RAIDs run in degraded mode for months or years until additional drive […]

The post Monitoring Hard Drive and RAID Health appeared first on Microway.

]]>
By default, you won’t find out that one of your hard drives has failed until the data is gone. Even if you are using a software or hardware RAID, it will only continue to function if you replace failed drives. I have seen RAIDs run in degraded mode for months or years until additional drive failures ruined any chance of data recovery.

Drives and operating systems are designed to work around issues as best they can until absolute failure. However, that doesn’t mean that you can’t monitor the situation and receive an alert as soon as the first problem develops.

If you do not have a dedicated hardware RAID controller, there are two utilities to be configured and started: smartd and mdadm. The smartd daemon reads hard drive S.M.A.R.T. health data directly off the drives and sends alerts of any changes. Similarly, mdadm watches the health of your Linux software RAIDs for any problems.

If you are using a hardware RAID controller, then it manages some of these tasks. However, you must be sure to properly configure the automated alerts within the controller’s management interface – check the manual for full instructions. Additionally, you may be able to monitor hard drive health data if the controller supports it (3ware and ARECA cards are known to work) – see the smartd man page.

smartd

Here are the standard entries I use in /etc/smartd.conf:

/dev/sda -a -d ata -m eliot@example.com -H -l error -l selftest -M test -o on -S on -s (S/../../3/03|L/../15/./04)
/dev/sdb -a -d ata -m eliot@example.com -H -l error -l selftest -o on -S on -s (S/../../3/04|L/../15/./05)

These lines are fairly convoluted. In this example, they monitor drives /dev/sda and /dev/sdb by performing the following tasks:

  • E-mailing all alerts to eliot@example.com
  • Sending one test e-mail upon startup
  • Watch for any critical failure warnings in the SMART data
  • Monitor the results of hard drive self tests
  • Enables Automatic Offline Testing of the drives
  • Run a short self test on each drive once a week (3am for sda; 4am for sdb)
  • Run a long self test on each drive once a month (4am for sda; 5am for sdb)

Once you have written the configuration file, you need to start the service:

/etc/init.d/smartd start

To ensure the service starts at boot, you’ll need to add it to the boot sequence. The exact command depends upon your Linux distribution:

chkconfig --add smartd        (Red Hat, Fedora and SUSE)
rc-update add smartd default  (Gentoo)
update-rc.d mdadm defaults    (Debian)

mdadm

To monitor Linux software RAIDs, you’ll need at least the following lines in /etc/mdadm.conf:

DEVICE /dev/sd[ab]1 /dev/sd[ab]5 /dev/sd[ab]6 /dev/sd[ab]7 /dev/sd[ab]8 /dev/sd[ab]10

ARRAY /dev/md1 devices=/dev/sda1,/dev/sdb1
ARRAY /dev/md5 devices=/dev/sda5,/dev/sdb5
ARRAY /dev/md6 devices=/dev/sda6,/dev/sdb6
ARRAY /dev/md7 devices=/dev/sda7,/dev/sdb7
ARRAY /dev/md8 devices=/dev/sda8,/dev/sdb8
ARRAY /dev/md10 devices=/dev/sda10,/dev/sdb10

MAILADDR eliot@example.com

Using this example, any changes to the listed md devices will be immediately e-mailed to eliot@example.com.

Note that some newer versions of mdadm require that devices be identified by UUID (e.g. f4849d33:f8c1ce1c:ac28ac18:9d4741e7) rather than raw device name (e.g. /dev/md1). If this is the case, run mdadm --detail /dev/md1 for each RAID.

Once you have written the configuration file, you need to start the service. Some distributions use the name mdadm and others use mdmonitor:

/etc/init.d/mdadm start

To ensure the service starts at boot, you’ll need to add it to the boot sequence. The exact command depends upon your Linux distribution:

chkconfig --add mdadm        (Red Hat, Fedora and SUSE)
rc-update add mdadm default  (Gentoo)
update-rc.d mdadm defaults   (Debian)

As always, be certain that you use your own e-mail address and the names of the actual hard drives and arrays in your system.

The post Monitoring Hard Drive and RAID Health appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/monitoring-hard-drive-and-raid-health/feed/ 5
Managing a Linux Software RAID with MDADM https://www.microway.com/hpc-tech-tips/managing-a-linux-software-raid-with-mdadm/ https://www.microway.com/hpc-tech-tips/managing-a-linux-software-raid-with-mdadm/#respond Tue, 30 Aug 2011 16:24:09 +0000 http://https://www.microway.com/hpc-tech-tips/?p=19 There are several advantages to assembling hard drives into a RAID: performance, redundancy and capacity. Microway workstations and servers are most commonly outfitted with software RAID to prevent a single drive failure from destroying your operating system installation. In most cases, the RAID is built from two hard drives, but you may also find software […]

The post Managing a Linux Software RAID with MDADM appeared first on Microway.

]]>
There are several advantages to assembling hard drives into a RAID: performance, redundancy and capacity. Microway workstations and servers are most commonly outfitted with software RAID to prevent a single drive failure from destroying your operating system installation. In most cases, the RAID is built from two hard drives, but you may also find software RAID on systems with up to six drives. If you have a larger storage server, a hardware RAID manages the hard drives.

Linux provides a robust software RAID implementation which costs nothing and offers great performance for lower array levels (e.g. 0, 1, 10). It is flexible and powerful, but array monitoring and management can be opaque if you’ve not previously worked with a Linux software RAID.

Software RAID Introduction

Linux software RAID depends on two components:

  1. The Linux kernel, which operates the arrays
  2. The mdadm utility, which creates and manages the arrays

As a user, you need not worry much about #1. The only fact you need to know is that the kernel keeps a live printout of array status in the dynamic text file /proc/mdstat. You may check the status of all arrays by checking the contents of that file – either with your favorite text editor or a simple cat /proc/mdstat.

To properly maintain your arrays, you’ll need to learn some basics of the mdadm RAID management utility. Most commands should be fairly straightforward, but check the mdadm man page for full details. Microway customers are welcome to contact technical support for assistance at any point.

Traditional hardware RAIDs reserve the full capacity of each hard drive in the array. However, Linux software RAIDs may be built using either an entire drive or individual partitions of a hard drive. This allows us more flexibility, such as creating a redundant RAID1 mirror for the /home partition while using a faster RAID0 stripe for /tmp. You will typically see up to 10 partitions on each drive, such as sda1/sdb1, sda2/sdb, ..., sda9/sdb9, sda10/sdb10. These are used to build the corresponding RAID devices md1, md2, ..., md9, md10. By default, Microway installations use partitions 1, 5, 6, 7, 8, 9, 10.

The following examples assume a software RAID1 mirror of two hard drives, which is the most common configuration. Only minor changes should be needed to perform maintenance on other arrays, but take care. Dangerous commands (which could cause data loss) are marked in red.

Checking Array Health

To be certain you are alerted to drive failures, set up automated alerts for hard drive and array failures. As mentioned above, manually checking the status of a software array is easy. Simply check the contents of /proc/mdstat:

eliot@penguin:~$ cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
 194496 blocks [2/2] [UU]
unused devices: <none>

Breaking down the contents by line, we see:

  1. Personalities: reports which RAID levels are running (e.g. raid0, raid1, raid10, etc). It’s safe to ignore this line.
  2. md1: the status of the first array, the type of the array, and each member of the array.
  3. md1: the size of the first array, # of members active/# of members total, and the status of each member. If a drive has failed, it will be shown on this line. You will see one of the following listed: [1/2] [_U] [U_]. These indicate that one of the two members is no longer operating.
  4. List of unused devices (drives or partitions). There’s usually nothing interesting here.

If your system has experienced a drive failure, Linux kernel error messages will be logged. You will see them in the /var/log/messages file or by running dmesg. Be certain you know which drive has failed before taking further steps.

Replacing a Failed Hard Drive

Because RAID offers redundancy, it is not necessary to take the storage offline to repair the RAID. The commands below may be issued while the system is operating normally. However, a heavily-loaded system will take much more time to complete the repair.

Before installing the new drive, you will need to remove the failed drive. You can see which drive failed by looking at the contents of /proc/mdstat or consulting Linux kernel message logs. If you are working on a live system, be absolutely certain you remove the correct failed drive. Microway customers should contact tech support with any questions.

I’m assuming that /dev/sda and /dev/sdb are the two drives currently running. If this system does not use Microway’s default partitioning (partitions 1, 5, 6, 7, 8, 9, 10) you will need to adjust the commands.

The software RAID operates at a level below the filesystem, so you do not need to re-create any filesystems. However, you do have to get the partitioning right. Once the replacement drive is installed, the partitioning can be copied from the working drive. This example assumes that sda is the operating drive and sdb is a replacement for the drive that failed:

sfdisk -d /dev/sda | sfdisk /dev/sdb

Once the partitions of the two drives match, you can add the new drive into the mirror. This has to be done partition by partition. Microway’s defaults are below (assuming sdb was the failed drive):

mdadm /dev/md1 --add /dev/sdb1
mdadm /dev/md5 --add /dev/sdb5
mdadm /dev/md6 --add /dev/sdb6
mdadm /dev/md7 --add /dev/sdb7
mdadm /dev/md8 --add /dev/sdb8
mdadm /dev/md10 --add /dev/sdb10

(you can check the status of the sync in /proc/mdstat)

One partition is used for Linux swap:

mkswap /dev/sdb9
swapon /dev/sdb9

To make the replacement drive bootable, the GRUB bootloader installer will need to be run:

root@penguin:~# grub
grub> device (hd0) /dev/sdb
grub> root (hd0,0)
grub> setup (hd0)
grub> quit

Continue to check /proc/mdstat as the arrays sync. The time required to sync is usually several hours per terabyte, although a heavily-loaded system will take longer.

To make your job easier, set up automated alerts for hard drive and array failures.

The post Managing a Linux Software RAID with MDADM appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/managing-a-linux-software-raid-with-mdadm/feed/ 0
Take Care When Updating Your Cluster https://www.microway.com/hpc-tech-tips/take-care-when-updating-your-cluster/ https://www.microway.com/hpc-tech-tips/take-care-when-updating-your-cluster/#respond Wed, 06 Jul 2011 20:29:36 +0000 http://https://www.microway.com/hpc-tech-tips/?p=7 Although modern Linux distributions have made it very easy to keep your software packages up-to-date, there are some pitfalls you might encounter when managing your compute cluster. Cluster software packages are usually not managed from the same software repository as the standard Linux packages, so the updater will unknowingly break compatibility. In particular, upgrading or […]

The post Take Care When Updating Your Cluster appeared first on Microway.

]]>
Although modern Linux distributions have made it very easy to keep your software packages up-to-date, there are some pitfalls you might encounter when managing your compute cluster.

Cluster software packages are usually not managed from the same software repository as the standard Linux packages, so the updater will unknowingly break compatibility. In particular, upgrading or changing the Linux kernel on your cluster may require manual re-configuration – particularly for systems with large storage, InfiniBand and/or GPU compute processor components. These types of systems usually require that kernel modules or other packages be recompiled against the new kernel.

Please keep in mind that updating the software on your cluster may break existing functionality, so don’t update just for the sake of updating! Plan an update schedule and notify users in case there is downtime from unexpected snags.

You may always contact Microway technical support before you update to find out what problems you should expect from running a software update.

The post Take Care When Updating Your Cluster appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/take-care-when-updating-your-cluster/feed/ 0