linux Archives - Microway https://www.microway.com/tag/linux/ We Speak HPC & AI Tue, 28 May 2024 04:23:47 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 Check for memory errors on NVIDIA GPUs https://www.microway.com/knowledge-center-articles/check-for-memory-errors-on-nvidia-gpus/ https://www.microway.com/knowledge-center-articles/check-for-memory-errors-on-nvidia-gpus/#respond Thu, 14 Feb 2019 18:10:35 +0000 https://www.microway.com/?post_type=incsub_wiki&p=11141 Professional NVIDIA GPUs (the Tesla and Quadro products) are equipped with error-correcting code (ECC) memory, which allows the system to detect when memory errors occur. Smaller “single-bit” errors are transparently corrected. Larger “double-bit” memory errors will cause applications to crash, but are at least detected (GPUs without ECC memory would continue operating on the corrupted […]

The post Check for memory errors on NVIDIA GPUs appeared first on Microway.

]]>
Professional NVIDIA GPUs (the Tesla and Quadro products) are equipped with error-correcting code (ECC) memory, which allows the system to detect when memory errors occur. Smaller “single-bit” errors are transparently corrected. Larger “double-bit” memory errors will cause applications to crash, but are at least detected (GPUs without ECC memory would continue operating on the corrupted data).

There are conditions under which GPU events are reported to the Linux kernel, in which case you will see such errors in the system logs. However, the GPUs themselves will also store the type and date of the event.

It’s important to note that not all ECC errors are due to hardware failures. Stray cosmic rays are known to cause bit flips. For this reason, memory is not considered “bad” when a single error occurs (or even when a number of errors occurs). If you have a device reporting tens or hundreds of Double Bit errors, please contact Microway tech support for review. You may also wish to review the NVIDIA documentation

To review the current health of the GPUs in a system, use the nvidia-smi utility:

[root@node7 ~]# nvidia-smi -q -d PAGE_RETIREMENT

==============NVSMI LOG==============

Timestamp                           : Thu Feb 14 10:58:34 2019
Driver Version                      : 410.48

Attached GPUs                       : 4
GPU 00000000:18:00.0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No

GPU 00000000:3B:00.0
    Retired Pages
        Single Bit ECC              : 15
        Double Bit ECC              : 0
        Pending                     : No

The output above shows one card with no issues and one card with a minor quantity of single-bit errors (the card is still functional and in operation).

If the above report indicates that memory pages have been retired, then you may wish to see additional details (including when the pages were retired). If nvidia-smi reports Pending: Yes, then memory errors have occurred since the last time the system rebooted. In either case, there may be older page retirements that took place.

To review a complete listing of the GPU memory pages which have been retired (including the unique ID of each GPU), run:

[root@node7 ~]# nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv

gpu_uuid, retired_pages.address, retired_pages.cause
GPU-9fa5168d-97bf-98aa-33b9-45329682f627, 0x000000000005c05e, Single Bit ECC
GPU-9fa5168d-97bf-98aa-33b9-45329682f627, 0x000000000005ca0d, Single Bit ECC
GPU-9fa5168d-97bf-98aa-33b9-45329682f627, 0x000000000005c72e, Single Bit ECC
...

A different type of output must be selected in order to read the timestamps of page retirements. The output is in XML format and may require a bit more effort to parse. In short, try running a report such as shown below:

[root@node7 ~]# nvidia-smi -i 1 -q -x| grep -i -A1 retired_page_addr

<retired_page_address>0x000000000005c05e</retired_page_address>
<retired_page_timestamp>Mon Dec 18 06:25:25 2017</retired_page_timestamp>
--
<retired_page_address>0x000000000005ca0d</retired_page_address>
<retired_page_timestamp>Mon Dec 18 06:25:25 2017</retired_page_timestamp>
--
<retired_page_address>0x000000000005c72e</retired_page_address>
<retired_page_timestamp>Mon Dec 18 06:25:31 2017</retired_page_timestamp>
...

The post Check for memory errors on NVIDIA GPUs appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/check-for-memory-errors-on-nvidia-gpus/feed/ 0
High-Level Linux Troubleshooting https://www.microway.com/knowledge-center-articles/high-level-linux-troubleshooting/ https://www.microway.com/knowledge-center-articles/high-level-linux-troubleshooting/#respond Thu, 25 Jul 2013 21:51:31 +0000 http://https://www.microway.com/?post_type=incsub_wiki&p=2232 Whether you’re working on a cluster, a server or a workstation, most installations of Linux are similar. When something goes wrong, you need to determine the exact issue before you can get it resolved. This article provides a top-level overview of Linux troubleshooting. Linux Kernel Messages The Linux kernel is often aware of issues as […]

The post High-Level Linux Troubleshooting appeared first on Microway.

]]>
Whether you’re working on a cluster, a server or a workstation, most installations of Linux are similar. When something goes wrong, you need to determine the exact issue before you can get it resolved. This article provides a top-level overview of Linux troubleshooting.

Linux Kernel Messages

The Linux kernel is often aware of issues as they occur. If you suspect you’re facing a hardware issue or serious software issue (crashes/segfaults), the kernel can probably provide more information.

To see the most recent messages, run:
dmesg | tail -n50

To find older messages, read through the log file /var/log/messages (on some systems /var/log/kern.log). The Linux kernel prints many messages during normal operation (especially during the boot process), so don’t assume everything you see is a serious error.

Memory Errors

If your dmesg output contains messages similar to the examples below, your system is encountering errors when accessing memory. Because modern system components are closely integrated, such an error may be caused by several different types of hardware failure. Please send the dmesg output to our support team.

sbridge: HANDLING MCE MEMORY ERROR
CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010091
TSC 0 ADDR 10877e640 MISC 21420c8c86 PROCESSOR 0:206d6 TIME 1369016551 SOCKET 0 APIC 0
EDAC MC0: CE row 0, channel 0, label "CPU_SrcID#0_Channel#0_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0091 (ch=1), addr 
= 0x108778e40 => socket=0, Channel=0(mask=1), rank=0
kernel:[Hardware Error]: CPU:56 MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c55c00080080a13
kernel:[Hardware Error]:     MC4_ADDR: 0x000000720157c6f0
kernel:[Hardware Error]: Northbridge Error (node 7): DRAM ECC error detected on the NB.

NVIDIA GPU Errors

Kernel messages which contain the terms NVRM or Xid indicate some type of event occurred on an NVIDIA GPU. Such messages may not be fatal, so please contact Microway support for additional review. Consult NVIDIA documentation for the full list of Xid errors. Some examples of higher-priority issues are shown below.

NVRM: GPU at 0000:83:00: GPU-722f9c93-9a7f-08e3-6cc2-a5d8e3331e7f
NVRM: Xid (PCI:0000:83:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU
NVRM: Xid (PCI:0000:83:00): 79, GPU has fallen off the bus.
NVRM: GPU at 0000:83:00.0 has fallen off the bus.

Software RAID Errors

The dmesg output below shows an example message for a system with a degraded software RAID. This occurs when one of the hard drives fails, and will require a hardware swap. Please send a copy of the file /proc/mdstat to our support team.

[2010086.462608] md/raid1:md1: Disk failure on sdb1, disabling device.
md/raid1:md1: Operation continuing on 1 devices.
[2010086.474910] RAID1 conf printout:
[2010086.474914]  --- wd:1 rd:2
[2010086.474917]  disk 0, wo:1, o:0, dev:sdb1
[2010086.474919]  disk 1, wo:0, o:1, dev:sda1
[2010086.480441] RAID1 conf printout:
[2010086.480444]  --- wd:1 rd:2
[2010086.480447]  disk 1, wo:0, o:1, dev:sda1

Application Errors

If your scientific code is not working properly but you can find no system errors or messages, this is an indication that Linux and the hardware are working fine. It is likely that your code has a bug, your compiler has a bug or one of the scientific/math libraries has a bug. There are also cases where it is simply a compatibility issue – recompiling with a different compiler/library may fix the issue (e.g., OpenMPI instead of MVAPICH2, Intel compiler vs GNU compiler).

System Hangs/Crashes

Many different conditions can be described as a “system hang”. There are a variety of possible causes for such behavior. Please reference what to do when your system hangs.

No Linux Kernel Messages; System Reboots/Powers Off

If your system is rebooting or powering off with no warning, Linux will not be able to log the cause. You should verify that both your power and cooling are sufficient. The room should be roughly 74°F – systems that overheat will automatically power themselves off.

If power and cooling are reliable, then the most likely explanation is a hardware issue. Our support team can help you track down the issue.

The post High-Level Linux Troubleshooting appeared first on Microway.

]]>
https://www.microway.com/knowledge-center-articles/high-level-linux-troubleshooting/feed/ 0