Have your RAIDs tell you how they really feel ============================================= by Harry Mangalam v1.01 - Jan 31, 2013 :icons: //Harry Mangalam mailto:harry.mangalam@uci.edu[harry.mangalam@uci.edu] // this file is converted to the HTML via the command: // fileroot="/home/hjm/nacs/HOWTO_verify_RAID_by_crontab"; asciidoc -a icons -a toc2 -b html5 -a numbered ${fileroot}.txt; scp ${fileroot}.html ${fileroot}.txt ${fileroot}_1.png moo:~/public_html/bduc // update svn from BDUC // scp ${fileroot}.txt hmangala@claw1:~/bduc/trunk/sge; ssh hmangala@bduc-login 'cd ~/bduc/trunk/sge; svn update; svn commit -m "new mods to HOWTO_verify_RAID_by_crontab"' // and push it to Wordpress: // blogpost.py update -c HowTos ${fileroot}.txt // don't forget that the HTML equiv of '~' = '%7e' // asciidoc cheatsheet: http://powerman.name/doc/asciidoc // asciidoc user guide: http://www.methods.co.nz/asciidoc/userguide.html // run linkchecker on this file to verify links. Summary ------- This document tells you how to add http://www.adminschoice.com/crontab-quick-reference[crontab entries] to have your RAID controllers report X times a day how they're feeling. If they're feeling well, your email will look like this. image:raidstatus_email_list.png[kmail view of the RAIDSTATUS folder from ~20 servers]. The Subject lines that contain "OK OK OK" indicate servers that have multiple RAIDs; 'VERIFYING" indicates the instantaneous state of the RAID. The different commandline utilities have different vocabularies to indicate cleanliness or state. If you need absolute identity in the vocabulary, you can spiff up the test script with some regex matching to provide it. Introduction ------------ This is a common problem - you've got a zillion Linux servers, each with its own RAID controller and your skin is on the line if any of them loses data. Each of the %$#%# controllers has a different Web interface, with different scheduling options and each of the %$#% things has a different commandline tool for probing the controller/raid guts - some of them are truly, phenomenally, execrably, vomitaciously awful. The usual failure case is that: * you haven't configured the email to work because all of the %$#%^ proprietary systems have their own mailing systems / setup and its a PITA to set them up and verify them (or you set them up and one of them silently loses the email service). * and then the disks fail silently and the RAID degrades disk by disk until one morning you get the dreaded *WTFIMD?* email from a client and you have to explain that .... it's complicated... In an unusual attempt to be more efficient, I've put together a bunch of crontab scripts that will issue a report X times a day for these controllers: - *LSI*, using the execreble MegaRaid 'MegaCli' (I have recently heard from LSI that this was never meant to be seen by humans (only engineers) but was released contractual reasons. Since they bought 3ware, they will soon use 3ware's much better 'tw_cli' (see below). - *3ware*, now part of LSI, but existing 3ware controllers still have their own, (better) commandline utility - 'tw_cli' - *Adaptec*, using the StorMan 'arcconf' utility - *Areca*, using the 'cli64' - the special case of *mdadm*, which can run software RAID over most of the above controllers. So here are my crontab entries for all of the RAID systems mentioned. If you're at all interested in this, you'll probably know what to modify The crontab entries ------------------- For the following entries, some things obviously have to be changed: the 'controller value' in many cases, the 'subject line' if you want to include more example, and of course the email recipient Many of the vendors supply different versions of their software with different hardware; the version of the software that I reference is the one that I used. Also, for email to work, you'll have to have a mail agent such as http://www.exim.org/[exim4] or http://www.postfix.org/[postfix] running each of the servers, generally pointing to an institutional SMTP server as well as mutt. LSI ~~~ You'll need the MegaRaid utility - http://goo.gl/trxWX[choose one or search from this page]. crontab entry ^^^^^^^^^^^^^ -------------------------------------------------------------------------- # m h dom mon dow command 7 6,20 * * * HOST=`hostname`;SUB=`/opt/MegaRAID/MegaCli/MegaCli64 \ -CfgDsply -a0 |grep '^State'|tr -d ' '|cut -f2 -d:`; \ /opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -a0 \ | mutt -s "$HOST RAIDSTATUS: $SUB" -------------------------------------------------------------------------- Output ^^^^^^ The output for the LSI email is long; http://moo.nac.uci.edu/~hjm/bduc/LSI_MegaRAID_SAS_9261-8i_output.txt[view it here]. 3ware ~~~~~ You'll need the http://www.lsi.com/downloads/Public/SATA/SATA%20Common%20Files/3DM2_CLI-Linux_10.2.1_9.5.4.zip[tw_cli utility], usually bundled with the rest of the 3ware 3DM2 web interface. crontab entry ^^^^^^^^^^^^^ -------------------------------------------------------------------------- # m h dom mon dow command 07 6,20 * * * HOST=`hostname`;SUB=`/usr/bin/tw_cli '/c2 show' \ | grep RAID| cut -c17-25`; /usr/bin/tw_cli '/c2 show'| mutt -s "$HOST \ RAIDSTATUS: $SUB" -------------------------------------------------------------------------- Output ^^^^^^ -------------------------------------------------------------------------- Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-5 OK - - 256K 3259.56 ON ON Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u0 465.76 GB 976773168 3PM0A1BY p1 OK u0 465.76 GB 976773168 3PM0GH46 p2 OK u0 465.76 GB 976773168 9QG06MT6 p3 OK u0 465.76 GB 976773168 9QG06RDL p4 OK u0 465.76 GB 976773168 3PM0G6W6 p5 OK u0 465.76 GB 976773168 3PM0G747 p6 OK u0 465.76 GB 976773168 3PM0BVG2 p7 OK u0 465.76 GB 976773168 3PM0CD2N Name OnlineState BBUReady Status Volt Temp Hours LastCapTest --------------------------------------------------------------------------- bbu On Yes OK OK OK 0 xx-xxx-xxxx --------------------------------------------------------------------------- Altho the tw_cli software can detect it, the 3Dm2 monitoring software does not warn on 'reallocated sectors' so these may go unreported until the disk fails. Increasing reallocated sectors are http://serverfault.com/questions/387765/how-to-use-a-disk-with-high-reallocated-sector-count[an indication that the disk is failing], so it's good to have some warning of this. http://moo.nac.uci.edu/~hjm/sectorchek.pl[This 'sectorchek.pl' script] will detect and report reallocated sectors as well as any disk that is in any 'non-OK' state - I had one instance where a disk went into an 'ECC ERROR' state but the 3DM2 report didn't mention it. Adaptec Integrated controllers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You'll need the 'Adaptec arcconf' utility - choose the correct version from http://goo.gl/ln9ji[this page]. crontab entry ^^^^^^^^^^^^^ -------------------------------------------------------------------------- # m h dom mon dow command 7 6,20 * * * HOST=`hostname`;SUB=`/usr/StorMan/arcconf GETCONFIG 1 \ |grep ' Status ' | cut -f2 -d: |tr -d '\n'`; /usr/StorMan/arcconf GETCONFIG 1 \ | mutt -s "$HOST RAIDSTATUS: $SUB" -------------------------------------------------------------------------- Output ^^^^^^ The output of the Adaptec email is also fairly long; http://moo.nac.uci.edu/~hjm/bduc/Adaptec_StorMan_arcconf_output.txt[view it here]. Areca ~~~~~ You'll need the http://www.areca.us/support/s_linux/cli/linuxcli_V1.10.0_120815.zip[Areca commandline interface] crontab entry ^^^^^^^^^^^^^ -------------------------------------------------------------------------- # m h dom mon dow command 7 6,20 * * * HOST=`hostname`;SUB=`/root/areca/cli64 rsf info raid=1 \ |grep "Raid Set State" | cut -f2 -d:`; /root/areca/cli64 rsf info raid=1 | \ mutt -s "$HOST Areca RAIDSTATUS: $SUB" -------------------------------------------------------------------------- Output ^^^^^^ -------------------------------------------------------------------------- Raid Set Information =========================================== Raid Set Name : BDUC home Member Disks : 15 Total Raw Capacity : 15000.0GB Free Raw Capacity : 0.0GB Min Member Disk Size : 1000.0GB Raid Set State : Normal =========================================== GuiErrMsg<0x00>: Success. -------------------------------------------------------------------------- mdadm ~~~~~ mdadm deserves a special mention - it's very simple to set up and administer; it was designed for sysadmins and commandline use. It has a simple, fairly understandable configuration file and http://linux.die.net/man/8/mdadm[very good man page]. Finally, the author, Neil Brown is almost always available on the https://raid.wiki.kernel.org/index.php/Linux_Raid[Linux RAID listserv] and is willing to answer reasonable questions (and if he isn't, many other mdadm users are. mdadm is a gem. You'll need to change the md device ('/dev/md0' below) and you'll obviously need the mdadm software; it's called 'mdadm' in both Debian and RedHat repo's. crontab entry ^^^^^^^^^^^^^ -------------------------------------------------------------------------- # m h dom mon dow command 07 6,20 * * * HOST=`hostname`;SUB=`/sbin/mdadm -Q --detail /dev/md0 \ |grep 'State :' |cut -f2 -d:`; /sbin/mdadm -Q --detail /dev/md0 | \ mutt -s "$HOST RAIDSTATUS: $SUB" -------------------------------------------------------------------------- Output ^^^^^^ -------------------------------------------------------------------------- /dev/md0: Version : 00.90 Creation Time : Wed Nov 19 15:04:51 2008 Raid Level : raid1 Array Size : 192640 (188.16 MiB 197.26 MB) Used Dev Size : 192640 (188.16 MiB 197.26 MB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Wed Oct 3 20:08:02 2012 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 UUID : 164af41a:07a8c1c4:70b12226:52f0beed Events : 0.446 Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 17 1 active sync /dev/sdb1 -------------------------------------------------------------------------- In my setups, the crontabs all fire at the same time so that all of the hosts come in (or don't) simultaneously - it's usually obvious which ones don't. If you need better attendance counting, you can filter them into separate email folders so that it's more obvious if you have crontab or email failures. These scripts not only send you an 'OK' / not 'OK' in the subject header but ~ a page-ish summary of the RAID/controller status. Corrections, additions, suggestions back to me.