Have your RAIDs tell you how they really feel

1. Summary

This document tells you how to add crontab entries to have your RAID controllers report X times a day how they’re feeling. If they’re feeling well, your email will look like this. kmail view of the RAIDSTATUS folder from ~20 servers .

The Subject lines that contain "OK OK OK" indicate servers that have multiple RAIDs; 'VERIFYING" indicates the instantaneous state of the RAID. The different commandline utilities have different vocabularies to indicate cleanliness or state. If you need absolute identity in the vocabulary, you can spiff up the test script with some regex matching to provide it.

2. Introduction

This is a common problem - you’ve got a zillion Linux servers, each with its own RAID controller and your skin is on the line if any of them loses data.

Each of the %$% controllers has a different Web interface, with different scheduling options and each of the %$#% things has a different commandline tool for probing the controller/raid guts - some of them are truly, phenomenally, execrably, vomitaciously awful.

The usual failure case is that:

you haven’t configured the email to work because all of the %$#%^ proprietary systems have their own mailing systems / setup and its a PITA to set them up and verify them (or you set them up and one of them silently loses the email service).
and then the disks fail silently and the RAID degrades disk by disk until one morning you get the dreaded WTFIMD? email from a client and you have to explain that …. it’s complicated…

In an unusual attempt to be more efficient, I’ve put together a bunch of crontab scripts that will issue a report X times a day for these controllers:

LSI, using the execreble MegaRaid MegaCli (I have recently heard from LSI that this was never meant to be seen by humans (only engineers) but was released contractual reasons. Since they bought 3ware, they will soon use 3ware’s much better tw_cli (see below).
3ware, now part of LSI, but existing 3ware controllers still have their own, (better) commandline utility - tw_cli
Adaptec, using the StorMan arcconf utility
Areca, using the cli64
the special case of mdadm, which can run software RAID over most of the above controllers.

So here are my crontab entries for all of the RAID systems mentioned. If you’re at all interested in this, you’ll probably know what to modify

3. The crontab entries

For the following entries, some things obviously have to be changed: the controller value in many cases, the subject line if you want to include more example, and of course the email recipient Many of the vendors supply different versions of their software with different hardware; the version of the software that I reference is the one that I used.

Also, for email to work, you’ll have to have a mail agent such as exim4 or postfix running each of the servers, generally pointing to an institutional SMTP server as well as mutt.

3.1. LSI

You’ll need the MegaRaid utility - choose one or search from this page.

3.1.1. crontab entry

# m h  dom mon dow   command
7  6,20   *   *   *  HOST=`hostname`;SUB=`/opt/MegaRAID/MegaCli/MegaCli64 \
-CfgDsply -a0 |grep '^State'|tr -d ' '|cut -f2 -d:`; \
/opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -a0 \
| mutt -s "$HOST RAIDSTATUS: $SUB" <email_recipient>

3.1.2. Output

The output for the LSI email is long; view it here.

3.2. 3ware

You’ll need the tw_cli utility, usually bundled with the rest of the 3ware 3DM2 web interface.

3.2.1. crontab entry

# m h     dom mon dow  command
07  6,20   *   *   *   HOST=`hostname`;SUB=`/usr/bin/tw_cli '/c2 show' \
| grep RAID| cut -c17-25`; /usr/bin/tw_cli '/c2 show'| mutt -s "$HOST \
RAIDSTATUS: $SUB" <email_recipient>

3.2.2. Output

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
 ------------------------------------------------------------------------------
u0    RAID-5    OK             -       -       256K    3259.56   ON     ON

Port   Status           Unit   Size        Blocks        Serial
 ---------------------------------------------------------------
p0     OK               u0     465.76 GB   976773168     3PM0A1BY
p1     OK               u0     465.76 GB   976773168     3PM0GH46
p2     OK               u0     465.76 GB   976773168     9QG06MT6
p3     OK               u0     465.76 GB   976773168     9QG06RDL
p4     OK               u0     465.76 GB   976773168     3PM0G6W6
p5     OK               u0     465.76 GB   976773168     3PM0G747
p6     OK               u0     465.76 GB   976773168     3PM0BVG2
p7     OK               u0     465.76 GB   976773168     3PM0CD2N

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
 ---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       0      xx-xxx-xxxx

Altho the tw_cli software can detect it, the 3Dm2 monitoring software does not warn on reallocated sectors so these may go unreported until the disk fails. Increasing reallocated sectors are an indication that the disk is failing, so it’s good to have some warning of this. This sectorchek.pl script will detect and report reallocated sectors as well as any disk that is in any non-OK state - I had one instance where a disk went into an ECC ERROR state but the 3DM2 report didn’t mention it.

3.3. Adaptec Integrated controllers

You’ll need the Adaptec arcconf utility - choose the correct version from this page.

3.3.1. crontab entry

# m   h   dom mon dow   command
7    6,20  *   *   *    HOST=`hostname`;SUB=`/usr/StorMan/arcconf GETCONFIG 1 \
|grep ' Status  ' | cut -f2 -d: |tr -d '\n'`; /usr/StorMan/arcconf GETCONFIG 1 \
| mutt -s "$HOST RAIDSTATUS: $SUB" <email_recipient>

3.3.2. Output

The output of the Adaptec email is also fairly long; view it here.

3.4. Areca

You’ll need the Areca commandline interface

3.4.1. crontab entry

# m  h   dom mon dow  command
7   6,20  *   *   *   HOST=`hostname`;SUB=`/root/areca/cli64 rsf info raid=1 \
|grep "Raid Set State" | cut -f2 -d:`; /root/areca/cli64 rsf info raid=1 | \
mutt -s "$HOST Areca RAIDSTATUS: $SUB" <email_recipient>

3.4.2. Output

Raid Set Information
 ===========================================
Raid Set Name        : BDUC home
Member Disks         : 15
Total Raw Capacity   : 15000.0GB
Free Raw Capacity    : 0.0GB
Min Member Disk Size : 1000.0GB
Raid Set State       : Normal
 ===========================================
GuiErrMsg<0x00>: Success.

3.5. mdadm

mdadm deserves a special mention - it’s very simple to set up and administer; it was designed for sysadmins and commandline use. It has a simple, fairly understandable configuration file and very good man page. Finally, the author, Neil Brown is almost always available on the Linux RAID listserv and is willing to answer reasonable questions (and if he isn’t, many other mdadm users are. mdadm is a gem. You’ll need to change the md device (/dev/md0 below) and you’ll obviously need the mdadm software; it’s called mdadm in both Debian and RedHat repo’s.

3.5.1. crontab entry

# m  h    dom mon dow  command
07  6,20   *   *   *   HOST=`hostname`;SUB=`/sbin/mdadm -Q --detail /dev/md0 \
|grep 'State :' |cut -f2 -d:`; /sbin/mdadm -Q --detail /dev/md0 | \
mutt -s "$HOST RAIDSTATUS: $SUB" <email_recipient>

3.5.2. Output

/dev/md0:
        Version : 00.90
  Creation Time : Wed Nov 19 15:04:51 2008
     Raid Level : raid1
     Array Size : 192640 (188.16 MiB 197.26 MB)
  Used Dev Size : 192640 (188.16 MiB 197.26 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Oct  3 20:08:02 2012
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 164af41a:07a8c1c4:70b12226:52f0beed
         Events : 0.446

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       17        1      active sync   /dev/sdb1

In my setups, the crontabs all fire at the same time so that all of the hosts come in (or don’t) simultaneously - it’s usually obvious which ones don’t. If you need better attendance counting, you can filter them into separate email folders so that it’s more obvious if you have crontab or email failures.

These scripts not only send you an OK / not OK in the subject header but ~ a page-ish summary of the RAID/controller status.

Corrections, additions, suggestions back to me.