Have your RAIDs tell you how they really feel
=============================================
by Harry Mangalam <harry.mangalam@uci.edu>
v1.01 - Jan 31, 2013
:icons:


//Harry Mangalam mailto:harry.mangalam@uci.edu[harry.mangalam@uci.edu]
// this file is converted to the HTML via the command:

// fileroot="/home/hjm/nacs/HOWTO_verify_RAID_by_crontab"; asciidoc -a icons -a toc2 -b html5 -a numbered ${fileroot}.txt; scp ${fileroot}.html ${fileroot}.txt ${fileroot}_1.png moo:~/public_html/bduc

// update svn from BDUC
// scp ${fileroot}.txt  hmangala@claw1:~/bduc/trunk/sge; ssh hmangala@bduc-login 'cd ~/bduc/trunk/sge; svn update; svn commit -m "new mods to HOWTO_verify_RAID_by_crontab"'

// and push it to Wordpress:
// blogpost.py  update  -c HowTos ${fileroot}.txt

// don't forget that the HTML equiv of '~' = '%7e'
// asciidoc cheatsheet: http://powerman.name/doc/asciidoc
// asciidoc user guide: http://www.methods.co.nz/asciidoc/userguide.html

// run linkchecker on this file to verify links.

Summary
-------
This document tells you how to add http://www.adminschoice.com/crontab-quick-reference[crontab entries] to have your RAID controllers report X times a day how they're feeling. If they're feeling well, your email will look like this.
image:raidstatus_email_list.png[kmail view of the RAIDSTATUS folder from ~20 servers].

The Subject lines that contain "OK OK OK" indicate servers that have multiple RAIDs; 'VERIFYING" indicates the instantaneous state of the RAID. The different commandline utilities have different vocabularies to indicate cleanliness or state.  If you need absolute identity in the vocabulary, you can spiff up the test script with some regex matching to provide it.


Introduction
------------
This is a common problem - you've got a zillion Linux servers, each with its 
own RAID controller and your skin is on the line if any of them loses data.  

Each of the %$#%# controllers has a different Web interface, with different 
scheduling options and each of the %$#% things has a different commandline 
tool for probing the controller/raid guts - some of them are truly, 
phenomenally, execrably, vomitaciously awful.  

The usual failure case is that:

* you haven't configured the email to work because all of the %$#%^ proprietary systems have their own mailing systems / setup and its a PITA to set them up and verify them (or you set them up and one of them silently loses the email service).
* and then the disks fail silently and the RAID degrades disk by disk until one morning you get the dreaded *WTFIMD?* email from a client and you have to explain that .... it's complicated... 

In an unusual attempt to be more efficient, I've put together a bunch of crontab scripts that will issue a report X times a day for these controllers:

- *LSI*, using the execreble MegaRaid 'MegaCli' (I have recently heard from LSI that this was never meant to be seen by humans (only engineers) but was released contractual reasons.  Since they bought 3ware, they will soon use 3ware's much better 'tw_cli' (see below).
- *3ware*, now part of LSI, but existing 3ware controllers still have their own, (better) commandline utility - 'tw_cli'
- *Adaptec*, using the StorMan 'arcconf' utility
- *Areca*, using the 'cli64' 
- the special case of *mdadm*, which can run software RAID over most of the above controllers.

So here are my crontab entries for all of the RAID systems mentioned.  If 
you're at all interested in this, you'll probably know what to modify

The crontab entries
-------------------
For the following entries, some things obviously have to be changed: the 'controller value' in many cases, the 'subject line' if you want to include more example, and of course the email recipient
Many of the vendors supply different versions of their software with different hardware; the version of the software that I reference is the one that I used.

Also, for email to work, you'll have to have a mail agent such as http://www.exim.org/[exim4] or http://www.postfix.org/[postfix] running each of the servers, generally pointing to an institutional SMTP server as well as mutt.

LSI
~~~
You'll need the MegaRaid utility - http://goo.gl/trxWX[choose one or search from this page].

crontab entry
^^^^^^^^^^^^^
--------------------------------------------------------------------------
# m h  dom mon dow   command
7  6,20   *   *   *  HOST=`hostname`;SUB=`/opt/MegaRAID/MegaCli/MegaCli64 \
-CfgDsply -a0 |grep '^State'|tr -d ' '|cut -f2 -d:`; \
/opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -a0 \
| mutt -s "$HOST RAIDSTATUS: $SUB" <email_recipient>
--------------------------------------------------------------------------

Output
^^^^^^
The output for the LSI email is long;  http://moo.nac.uci.edu/~hjm/bduc/LSI_MegaRAID_SAS_9261-8i_output.txt[view it here].


3ware
~~~~~
You'll need the http://www.lsi.com/downloads/Public/SATA/SATA%20Common%20Files/3DM2_CLI-Linux_10.2.1_9.5.4.zip[tw_cli utility], usually bundled with the rest of the 3ware 3DM2 web interface.

crontab entry
^^^^^^^^^^^^^
--------------------------------------------------------------------------
# m h     dom mon dow  command
07  6,20   *   *   *   HOST=`hostname`;SUB=`/usr/bin/tw_cli '/c2 show' \
| grep RAID| cut -c17-25`; /usr/bin/tw_cli '/c2 show'| mutt -s "$HOST \
RAIDSTATUS: $SUB" <email_recipient>
--------------------------------------------------------------------------

Output
^^^^^^
--------------------------------------------------------------------------
Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
 ------------------------------------------------------------------------------
u0    RAID-5    OK             -       -       256K    3259.56   ON     ON     

Port   Status           Unit   Size        Blocks        Serial
 ---------------------------------------------------------------
p0     OK               u0     465.76 GB   976773168     3PM0A1BY            
p1     OK               u0     465.76 GB   976773168     3PM0GH46            
p2     OK               u0     465.76 GB   976773168     9QG06MT6            
p3     OK               u0     465.76 GB   976773168     9QG06RDL            
p4     OK               u0     465.76 GB   976773168     3PM0G6W6            
p5     OK               u0     465.76 GB   976773168     3PM0G747            
p6     OK               u0     465.76 GB   976773168     3PM0BVG2            
p7     OK               u0     465.76 GB   976773168     3PM0CD2N            

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
 ---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       0      xx-xxx-xxxx  
---------------------------------------------------------------------------

Altho the tw_cli software can detect it, the 3Dm2 monitoring software does not warn on 'reallocated sectors' so these may go unreported until the disk fails.  Increasing reallocated sectors are http://serverfault.com/questions/387765/how-to-use-a-disk-with-high-reallocated-sector-count[an indication that the disk is failing], so it's good to have some warning of this.  http://moo.nac.uci.edu/~hjm/sectorchek.pl[This 'sectorchek.pl' script] will detect and report reallocated sectors as well as any disk that is in any 'non-OK' state - I had one instance where a disk went into an 'ECC ERROR' state but the 3DM2 report didn't mention it.


Adaptec Integrated controllers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You'll need the 'Adaptec arcconf' utility - choose the correct version from http://goo.gl/ln9ji[this page].

crontab entry
^^^^^^^^^^^^^
--------------------------------------------------------------------------
# m   h   dom mon dow   command
7    6,20  *   *   *    HOST=`hostname`;SUB=`/usr/StorMan/arcconf GETCONFIG 1 \
|grep ' Status  ' | cut -f2 -d: |tr -d '\n'`; /usr/StorMan/arcconf GETCONFIG 1 \
| mutt -s "$HOST RAIDSTATUS: $SUB" <email_recipient>
--------------------------------------------------------------------------

Output
^^^^^^
The output of the Adaptec email is also fairly long; http://moo.nac.uci.edu/~hjm/bduc/Adaptec_StorMan_arcconf_output.txt[view it here].


Areca
~~~~~

You'll need the http://www.areca.us/support/s_linux/cli/linuxcli_V1.10.0_120815.zip[Areca commandline interface]

crontab entry
^^^^^^^^^^^^^
--------------------------------------------------------------------------
# m  h   dom mon dow  command
7   6,20  *   *   *   HOST=`hostname`;SUB=`/root/areca/cli64 rsf info raid=1 \
|grep "Raid Set State" | cut -f2 -d:`; /root/areca/cli64 rsf info raid=1 | \
mutt -s "$HOST Areca RAIDSTATUS: $SUB" <email_recipient>
--------------------------------------------------------------------------

Output
^^^^^^
--------------------------------------------------------------------------
Raid Set Information 
 ===========================================
Raid Set Name        : BDUC home       
Member Disks         : 15
Total Raw Capacity   : 15000.0GB
Free Raw Capacity    : 0.0GB
Min Member Disk Size : 1000.0GB
Raid Set State       : Normal
 ===========================================
GuiErrMsg<0x00>: Success.
--------------------------------------------------------------------------


mdadm
~~~~~
mdadm deserves a special mention - it's very simple to set up and administer; it was designed for sysadmins and commandline use.  It has a simple, fairly understandable configuration file and http://linux.die.net/man/8/mdadm[very good man page].  Finally, the author, Neil Brown is almost always available on the https://raid.wiki.kernel.org/index.php/Linux_Raid[Linux RAID listserv] and is willing to answer reasonable questions (and if he isn't, many other mdadm users are.  mdadm is a gem.
You'll need to change the md device ('/dev/md0' below) and you'll obviously need the mdadm software; it's called 'mdadm' in both Debian and RedHat repo's.

crontab entry
^^^^^^^^^^^^^
--------------------------------------------------------------------------
# m  h    dom mon dow  command
07  6,20   *   *   *   HOST=`hostname`;SUB=`/sbin/mdadm -Q --detail /dev/md0 \
|grep 'State :' |cut -f2 -d:`; /sbin/mdadm -Q --detail /dev/md0 | \
mutt -s "$HOST RAIDSTATUS: $SUB" <email_recipient>
--------------------------------------------------------------------------

Output
^^^^^^
--------------------------------------------------------------------------
/dev/md0:
        Version : 00.90
  Creation Time : Wed Nov 19 15:04:51 2008
     Raid Level : raid1
     Array Size : 192640 (188.16 MiB 197.26 MB)
  Used Dev Size : 192640 (188.16 MiB 197.26 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Oct  3 20:08:02 2012
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 164af41a:07a8c1c4:70b12226:52f0beed
         Events : 0.446

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       17        1      active sync   /dev/sdb1
--------------------------------------------------------------------------


In my setups, the crontabs all fire at the same time so that all of the hosts come in (or don't) simultaneously - it's usually obvious which ones don't.  If you need better attendance counting, you can filter them into separate email folders so that it's more obvious if you have crontab or email failures.

These scripts not only send you an 'OK' / not 'OK' in the subject header but ~ 
a page-ish summary of the RAID/controller status.

Corrections, additions, suggestions back to me.