mdadm Archives - Microway https://www.microway.com/tag/mdadm/ We Speak HPC & AI Tue, 28 May 2024 16:51:08 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 Monitoring Hard Drive and RAID Health https://www.microway.com/hpc-tech-tips/monitoring-hard-drive-and-raid-health/ https://www.microway.com/hpc-tech-tips/monitoring-hard-drive-and-raid-health/#comments Fri, 16 Sep 2011 21:49:29 +0000 http://https://www.microway.com/hpc-tech-tips/?p=49 By default, you won’t find out that one of your hard drives has failed until the data is gone. Even if you are using a software or hardware RAID, it will only continue to function if you replace failed drives. I have seen RAIDs run in degraded mode for months or years until additional drive […]

The post Monitoring Hard Drive and RAID Health appeared first on Microway.

]]>
By default, you won’t find out that one of your hard drives has failed until the data is gone. Even if you are using a software or hardware RAID, it will only continue to function if you replace failed drives. I have seen RAIDs run in degraded mode for months or years until additional drive failures ruined any chance of data recovery.

Drives and operating systems are designed to work around issues as best they can until absolute failure. However, that doesn’t mean that you can’t monitor the situation and receive an alert as soon as the first problem develops.

If you do not have a dedicated hardware RAID controller, there are two utilities to be configured and started: smartd and mdadm. The smartd daemon reads hard drive S.M.A.R.T. health data directly off the drives and sends alerts of any changes. Similarly, mdadm watches the health of your Linux software RAIDs for any problems.

If you are using a hardware RAID controller, then it manages some of these tasks. However, you must be sure to properly configure the automated alerts within the controller’s management interface – check the manual for full instructions. Additionally, you may be able to monitor hard drive health data if the controller supports it (3ware and ARECA cards are known to work) – see the smartd man page.

smartd

Here are the standard entries I use in /etc/smartd.conf:

/dev/sda -a -d ata -m eliot@example.com -H -l error -l selftest -M test -o on -S on -s (S/../../3/03|L/../15/./04)
/dev/sdb -a -d ata -m eliot@example.com -H -l error -l selftest -o on -S on -s (S/../../3/04|L/../15/./05)

These lines are fairly convoluted. In this example, they monitor drives /dev/sda and /dev/sdb by performing the following tasks:

  • E-mailing all alerts to eliot@example.com
  • Sending one test e-mail upon startup
  • Watch for any critical failure warnings in the SMART data
  • Monitor the results of hard drive self tests
  • Enables Automatic Offline Testing of the drives
  • Run a short self test on each drive once a week (3am for sda; 4am for sdb)
  • Run a long self test on each drive once a month (4am for sda; 5am for sdb)

Once you have written the configuration file, you need to start the service:

/etc/init.d/smartd start

To ensure the service starts at boot, you’ll need to add it to the boot sequence. The exact command depends upon your Linux distribution:

chkconfig --add smartd        (Red Hat, Fedora and SUSE)
rc-update add smartd default  (Gentoo)
update-rc.d mdadm defaults    (Debian)

mdadm

To monitor Linux software RAIDs, you’ll need at least the following lines in /etc/mdadm.conf:

DEVICE /dev/sd[ab]1 /dev/sd[ab]5 /dev/sd[ab]6 /dev/sd[ab]7 /dev/sd[ab]8 /dev/sd[ab]10

ARRAY /dev/md1 devices=/dev/sda1,/dev/sdb1
ARRAY /dev/md5 devices=/dev/sda5,/dev/sdb5
ARRAY /dev/md6 devices=/dev/sda6,/dev/sdb6
ARRAY /dev/md7 devices=/dev/sda7,/dev/sdb7
ARRAY /dev/md8 devices=/dev/sda8,/dev/sdb8
ARRAY /dev/md10 devices=/dev/sda10,/dev/sdb10

MAILADDR eliot@example.com

Using this example, any changes to the listed md devices will be immediately e-mailed to eliot@example.com.

Note that some newer versions of mdadm require that devices be identified by UUID (e.g. f4849d33:f8c1ce1c:ac28ac18:9d4741e7) rather than raw device name (e.g. /dev/md1). If this is the case, run mdadm --detail /dev/md1 for each RAID.

Once you have written the configuration file, you need to start the service. Some distributions use the name mdadm and others use mdmonitor:

/etc/init.d/mdadm start

To ensure the service starts at boot, you’ll need to add it to the boot sequence. The exact command depends upon your Linux distribution:

chkconfig --add mdadm        (Red Hat, Fedora and SUSE)
rc-update add mdadm default  (Gentoo)
update-rc.d mdadm defaults   (Debian)

As always, be certain that you use your own e-mail address and the names of the actual hard drives and arrays in your system.

The post Monitoring Hard Drive and RAID Health appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/monitoring-hard-drive-and-raid-health/feed/ 5
Managing a Linux Software RAID with MDADM https://www.microway.com/hpc-tech-tips/managing-a-linux-software-raid-with-mdadm/ https://www.microway.com/hpc-tech-tips/managing-a-linux-software-raid-with-mdadm/#respond Tue, 30 Aug 2011 16:24:09 +0000 http://https://www.microway.com/hpc-tech-tips/?p=19 There are several advantages to assembling hard drives into a RAID: performance, redundancy and capacity. Microway workstations and servers are most commonly outfitted with software RAID to prevent a single drive failure from destroying your operating system installation. In most cases, the RAID is built from two hard drives, but you may also find software […]

The post Managing a Linux Software RAID with MDADM appeared first on Microway.

]]>
There are several advantages to assembling hard drives into a RAID: performance, redundancy and capacity. Microway workstations and servers are most commonly outfitted with software RAID to prevent a single drive failure from destroying your operating system installation. In most cases, the RAID is built from two hard drives, but you may also find software RAID on systems with up to six drives. If you have a larger storage server, a hardware RAID manages the hard drives.

Linux provides a robust software RAID implementation which costs nothing and offers great performance for lower array levels (e.g. 0, 1, 10). It is flexible and powerful, but array monitoring and management can be opaque if you’ve not previously worked with a Linux software RAID.

Software RAID Introduction

Linux software RAID depends on two components:

  1. The Linux kernel, which operates the arrays
  2. The mdadm utility, which creates and manages the arrays

As a user, you need not worry much about #1. The only fact you need to know is that the kernel keeps a live printout of array status in the dynamic text file /proc/mdstat. You may check the status of all arrays by checking the contents of that file – either with your favorite text editor or a simple cat /proc/mdstat.

To properly maintain your arrays, you’ll need to learn some basics of the mdadm RAID management utility. Most commands should be fairly straightforward, but check the mdadm man page for full details. Microway customers are welcome to contact technical support for assistance at any point.

Traditional hardware RAIDs reserve the full capacity of each hard drive in the array. However, Linux software RAIDs may be built using either an entire drive or individual partitions of a hard drive. This allows us more flexibility, such as creating a redundant RAID1 mirror for the /home partition while using a faster RAID0 stripe for /tmp. You will typically see up to 10 partitions on each drive, such as sda1/sdb1, sda2/sdb, ..., sda9/sdb9, sda10/sdb10. These are used to build the corresponding RAID devices md1, md2, ..., md9, md10. By default, Microway installations use partitions 1, 5, 6, 7, 8, 9, 10.

The following examples assume a software RAID1 mirror of two hard drives, which is the most common configuration. Only minor changes should be needed to perform maintenance on other arrays, but take care. Dangerous commands (which could cause data loss) are marked in red.

Checking Array Health

To be certain you are alerted to drive failures, set up automated alerts for hard drive and array failures. As mentioned above, manually checking the status of a software array is easy. Simply check the contents of /proc/mdstat:

eliot@penguin:~$ cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
 194496 blocks [2/2] [UU]
unused devices: <none>

Breaking down the contents by line, we see:

  1. Personalities: reports which RAID levels are running (e.g. raid0, raid1, raid10, etc). It’s safe to ignore this line.
  2. md1: the status of the first array, the type of the array, and each member of the array.
  3. md1: the size of the first array, # of members active/# of members total, and the status of each member. If a drive has failed, it will be shown on this line. You will see one of the following listed: [1/2] [_U] [U_]. These indicate that one of the two members is no longer operating.
  4. List of unused devices (drives or partitions). There’s usually nothing interesting here.

If your system has experienced a drive failure, Linux kernel error messages will be logged. You will see them in the /var/log/messages file or by running dmesg. Be certain you know which drive has failed before taking further steps.

Replacing a Failed Hard Drive

Because RAID offers redundancy, it is not necessary to take the storage offline to repair the RAID. The commands below may be issued while the system is operating normally. However, a heavily-loaded system will take much more time to complete the repair.

Before installing the new drive, you will need to remove the failed drive. You can see which drive failed by looking at the contents of /proc/mdstat or consulting Linux kernel message logs. If you are working on a live system, be absolutely certain you remove the correct failed drive. Microway customers should contact tech support with any questions.

I’m assuming that /dev/sda and /dev/sdb are the two drives currently running. If this system does not use Microway’s default partitioning (partitions 1, 5, 6, 7, 8, 9, 10) you will need to adjust the commands.

The software RAID operates at a level below the filesystem, so you do not need to re-create any filesystems. However, you do have to get the partitioning right. Once the replacement drive is installed, the partitioning can be copied from the working drive. This example assumes that sda is the operating drive and sdb is a replacement for the drive that failed:

sfdisk -d /dev/sda | sfdisk /dev/sdb

Once the partitions of the two drives match, you can add the new drive into the mirror. This has to be done partition by partition. Microway’s defaults are below (assuming sdb was the failed drive):

mdadm /dev/md1 --add /dev/sdb1
mdadm /dev/md5 --add /dev/sdb5
mdadm /dev/md6 --add /dev/sdb6
mdadm /dev/md7 --add /dev/sdb7
mdadm /dev/md8 --add /dev/sdb8
mdadm /dev/md10 --add /dev/sdb10

(you can check the status of the sync in /proc/mdstat)

One partition is used for Linux swap:

mkswap /dev/sdb9
swapon /dev/sdb9

To make the replacement drive bootable, the GRUB bootloader installer will need to be run:

root@penguin:~# grub
grub> device (hd0) /dev/sdb
grub> root (hd0,0)
grub> setup (hd0)
grub> quit

Continue to check /proc/mdstat as the arrays sync. The time required to sync is usually several hours per terabyte, although a heavily-loaded system will take longer.

To make your job easier, set up automated alerts for hard drive and array failures.

The post Managing a Linux Software RAID with MDADM appeared first on Microway.

]]>
https://www.microway.com/hpc-tech-tips/managing-a-linux-software-raid-with-mdadm/feed/ 0