The Storage Brick - Fast, Cheap, Reliable Terabytes

1. Summary

Like most institutions, university storage requirements are always increasing. Fortunately storage hardware prices are dropping rapidly, currently about 25-30 cents/GB for raw SATA disk storage. Multi-core motherboards with multiple ethernet interfaces are now commonplace and full high-performance (albeit generic) storage systems can be had for less than $1000/TB. We refer to these devices as Storage Bricks.

In contrast, prices for proprietary, integrated, highly reliable storage systems are still quite high, approximately 4-8x the cost of the generic Storage Brick. The differences between such generic Storage Bricks and the products offered by high end commercial storage vendors such as Network Appliance, EMC, or BlueArc are in both proprietary hardware such as bus structure, and hardware accelerators, and software such as proprietary advanced Logical Volume Managers. Such software adds specialized features like snapshotting, volume migration, on-the-fly data rearrangement, RAID idling, special-purpose error checking, and more. There is also something of a lifestyle choice in choosing one of the proprietary vendors as you will have so much investment in that particular proprietary path that it will be difficult to change. To do so will be not unlike a divorce.

However, much of the storage that a research university requires maps well to lower-end hardware, especially if administratively, smaller departments or even labs are responsible for their own storage requirements. As such, there is little advantage in investing in the expensive base costs of the proprietary servers mentioned above. Therefore, I set out to see how much performance and reliability could be bought for the least dollar.

An 8-port and a 16-port hardware PCI-e RAID controller from Areca and 3ware were tested under variety of configurations and applications in order to evaluate what a relatively cheap platform could provide in terms of performance and reliability for the money. The controllers from Areca were the 8-port ARC-1220 and the 16 port ARC-1261ML. The controllers from 3ware were the 8-port and 16 port models based on the 9650SE.

Besides the disk controllers, variables tested included: RAID type, number of disks in the RAID, 4 popular Linux filesystems (Ext3, Reiserfs, JFS, and XFS), filesystem initialization, RAM included on the card, effect of system RAM, and readahead size. These were tested using 4 application benchmarks: the Bonnie++ and IOZONE disk benchmarking suites, the netCDF operator utility ncecat, and the Linux 2.6.21.3 kernel compile.

All the cards performed very well - bandwidth on large writes on a 16-disk RAID6 array was measured at >2GB/s on a large memory system and up to ~800MB/s on a RAM-constrained system. Large reads were slower, reflecting the ability of writes to be partially cached, but were still measured at up to 570MB/s on a 16-disk RAID6. Small reads and writes were significantly slower, ( ~60MB/s) reflecting the seek overhead in complicated read/write patterns. Not surprisingly, the 16-port cards were both more cost-effective and had higher performance. There was no clear winner between 3ware and Areca - each had areas of slightly better performance.

The filesystems all performed well, but the JFS and XFS filesystems performed best on large disk IO and did very well in the other tests as well. JFS and XFS both initialized even 7.5TB filesystems essentially instantaneously. Reiserfs was best under a variety of conditions on small-file, I/O intensive operations as represented by the kernel compile test. The Ext3 filesystem uses the same on-disk structure as the non-journaling ext2 filesystem, and thus requires initialization time proportional to the RAID size, ~17m for a 3.5TB RAID5 filesystem. It also performed worst on large data reads and writes and so would be a poor choice for a data volume.

The variable that contributed most to better performance was amount of system RAM. The more motherboard RAM your storage device has, the faster it will perform in almost all circumstances.

2. Introduction & Rationale

Digital information storage is increasingly important at all institutions; univerities are no exception. Experimental equipment generates increasing amounts of digital data, social sciences increasingly rely on digital archives, and many researchers in all fields are analyzing large digital archvies as primary data sources. Academic research therfore is in the forefront in requiring more storage and better ways of dealing with it. Remote sensing streams, gene expression data, medical imaging, and simulation intermediates are now easily ranging into the 10s, 100s, and often into the 1000s of GBs. As well, class work, lab notes, administrative documents, and generic digital multimedia contribute to the digital flood. Email is a particular concern as many people are using it as their primary work log and therefore keeping it available over long periods of time is essential for tracking research development, primacy, and intellectual property.

Some of this data is reproducible at low cost; some is "once in a lifetime". Other data is extremely valuable either because of the cost of (re)producing it or it deals with sensitive financial or medical records. This proposal does not address the storage of legally binding documents of the highest sensitivity and security. There are commercial vendors who supply such technologies and they are typically 4-8X more expensive than the storage that we address. For example, the Network Appliance FAS270 is a comparably sized storage device that costs in the low $40K range, compared to the just under $10K for this device. We address the storage in the pretty spot of this terrain: pretty cheap, pretty secure, pretty available, pretty fast, pretty accessible, pretty flexible. The plan is to use these devices as building blocks of a larger infrastructure, and because it is also a fairly accurate industry term, we are calling the device described here a BRICK.

Not only is data size increasing but people are communicating this data to their colleagues at increasing rates. Typically this is done via email attachments but there is some evidence that researchers are using URLs to pass pointers to data as opposed to the data itself. NACS has a charge to see that this is done securely, easily, quickly, with generous allocations as to bandwidth and storage limits. One of the ways to do this is to match storage demands from schools with local bricks that are still maintained by NACS. They could be co-located in remote server closets and managed by NACS, the local administrators or a combination of the two.

2.1. The Storage Brick

The basic unit of this test is the Storage Brick Images below), described here. The purchased test version is smaller than those described: a rack-mountable 3U chassis containing a motherboard, redundant power supplies, multiple ethernet interfaces, 8GB RAM, 4 Opteron Cores, and 16 hot-swap SATAII slots. These slots allow disks to be pulled from a running system for replacement. The system does not need to be brought down and the data remains accessible during this replacement due to parity striped data on the other disks (in RAID5 or RAID6). While these slots could be populated by any capacity disks up to 1TB each (the largest disk commercially available now) in this configuration, we used Seagate 500GB disks, the most cost-effective disks when I specified the system. This configuration provides 8TB of raw disk or RAID0 (striped data), 7.5TB of RAID5 (1 parity disk), 6TB of RAID6 (2 parity disks), or 4TB of RAID10 (striped and mirrored data)

The three redundant power supplies allow for the failure of 1 of the 3 to fail without the system becoming inoperable. Power supplies are the second most likely point of failure on a system.

Table: The Storage Brick Device
Description	Image
front view of Storage Brick, sled out
top view of hot swap disk sled
top view of Storage Brick
back view of Storage Brick

3. Primary Variables Tested

3.1. The Disk Controllers

For data to be written to and retrieved from a storage device, the disks need to be coordinated by a disk controller, which presents the storage available on the disks to the operating system. There are a huge variety of controllers available, but for this test, I chose true hardware RAID controllers rather than the dumb, cheap controllers (aka fake RAID controllers) that are typically used for Desktop machines. Such controllers use the main CPU to do the computations for placing the data on the disk and for most situations, this is fine, as the CPU of a Desktop machine is usually idle. On a server machine, this is often not the case, so efficiency is more crucial, especially when the disk controller is responsible for many disks, and when it also have to perform the parity calculations to spread the data across the disk array, as is the case with RAID5 and RAID6.

Table: Hardware RAID controllers Used
Description	Image
3ware 9650SE 8port multilane PCI-e controller & battery
3ware 9650SE 16port multilane PCI-e controller w/o battery
Areca 8port PCI-e controller & battery
Areca 16port multilane PCI-e controller & battery

An 8-port and a 16-port hardware PCI-e RAID controller from both Areca (a recent Taiwanese manufacturer of high performance controllers) and 3ware (an older American manufacturer of RAID controllers, which has contributed patches and code to the Linux kernel for several years) were tested under 64-bit Linux kernel 2.6.20-15 (Ubuntu Feisty). The 3ware controllers and the 16port Areca controller were multilane which use Infiniband-like connectors that integrate 4 SATAII cables into 1 for easier-to-cable, more secure connections.

Both Areca and 3ware provide for battery backup of the RAM cache so that if power fails and the data cannot be sync'ed to the disks, it will remain in the card cache until power is restored. Note that the battery for the 3ware controllers is integrated onto the card while the battery for the Areca requires an separate slot. Only the Areca 16 port card allowed for an upgrade of the on-card RAM to a maximum of 2GB. The rest supplied 256MB of soldered cache RAM.

The Areca device drivers are just now being incorporated into the mainline kernel (2.6.19 and later) and are being patched into previous kernels by almost all Linux distribution vendors. 3ware has some advantage because unlike Areca, its controllers are supported by the SMART monitor daemon that can peek into the SMART data of the individual disks behind the controller to check temperature, recorded errors, etc. The ability to detect SMART errors from the individual disks is an advantage for trying to see if a disk is about to fail before the fact. Recently published were 2 analyses of large numbers of disk failures (one from Google, the other from CMU) that have tried to evaluate predictive failure of disks. They found that detecting SMART errors have been useful in predicting disk failure, although far from absolute. A good review of both studies is Rik Farrow's editorial in Usenix.

3.2. The RAID type

I tested RAID0 (striping data over multiple disks for increased performance without redundancy), RAID10, (RAID0 with mirroring), RAID5 (striping data over multiple disks with 1 parity disk), RAID6 (like RAID5 but with 2 parity disks).

3.3. Number of disks in the RAID

I tested the effect of how much performance was gained by increasing the number of disks in an array for RAIDs 0, 5, and 6.

3.4. Initialization performance

I tested how much performance was effected by performing the tests while the RAID5 and RAID6 arrays were being initialized as well as when they had finished. Initializing a large RAID can take hours if not days to initialize. This tested the performance on an array as it was being initialized by the controller.

3.5. The amount of Controller memory

Areca provided a 2GB DIMM to replace the 256MB DIMM that the controller uses as a cache.

3.6. The amount of System memory

I tested under conditions of 0.5 GB of RAM and the full 8GB of RAM. Since Linux intelligently caches file input and output, increasing the amount of RAM can have a dramatic effect on disk IO.

3.7. File systems

While the controller is responsible for making the raw disk storage available to the Operating System, there is another level of organization required before the data can be used. The raw disk storage must be organized into a file system. This is the structure that allows directories to be made, data to be stored as files that often have to cross raw device sectors. This structure allows the dates of file creation, modification, access, etc to be associated with a file, allows files to be renamed, copied, moved, etc. Depending on what the storage is going to be used for, the type of file system can have a large effect on the overall performance of the entire system.

I tested 4 of the most popular Linux filesytems:

Ext3 - the journaling version of the long-in-the-tooth, but very reliable ext2 filesystem. Because it is the ext2 filesystem on disk, with journaling added on top, compared to the others which are native journaling filesystems, it takes a very long time to initialize the filesystems - 3 orders of magnitude slower vs XFS on the 3.5 TB filesystem.
JFS - the 64-bit, IBM-contributed, journaling filesystem originally developed for AIX, which supports extents.
Reiserfs - Hans Reiser's 3rd version of his file system which has some unusual features like tail-packing (storing the ends of multiple files into the same sector to increase storage efficiency with small files)
XFS - SGI's journaling filesystem derived from their long experience with high performance, large-data supercomputing applications. XFS was explicitly designed to support extents (eXtents File System).

3.8. Readahead size

Readahead is the amount of data pre-read into the disk's on-board memory on the theory that if an application wants some data from one file, it probably want more of it. A useful document from 3ware has implicated readahead in increasing disk performance so it was tested in doubling strides from 256B to 32KB.

3.9. Application testing

I used both some well known benchmarks as well as real-life applications:

http://www.coker.com.au/bonnie++/[bonnie++] is a benchmark suite that is aimed at performing a number of simple tests of hard drive and file system performance. Very simple to use.
IOZONE is a fairly comprehensive disk benchmark suite that generates and measures a wide array of IO tests including: read, write, re-read, re-write, read backwards, read strided, fread, fwrite, random read, pread, mmap, aio_read, and aio_write.
Linux kernel compile with parallel make to generate the 2.6.21.3 kernel and most of the modules.
ncecat, a utility from the NCO suite that reads and rearranges netCDF files using GB-sized reads and writes.

4. Results

4.1. Individual Tests

4.1.1. KERNEL

The kernel compile test had small differences in time depending on system RAM. With restricted RAM, but the 8G case was notably faster. Also notable was that with 8G RAM, ext3 did considerably better than in a restricted RAM environment. Also notable was that with more RAM, the RAIDs with more spindles did better. Oddly enough, with restricted RAM, the lower spindle count RAIDs did better.

While XFS also recorded the top time in the restricted RAM Kernel compile test (572s), ReiserFS took 14 of the top 20 times recorded (in differet conditions), while JFS and Ext3 were only represented 2 times. The range of the top 20 times was fairly wide at 661+/-9.7 s with a range of 572-696s.

Values from this query here

4.1.2. NCECAT

This program, a part of the NCO suite, is specialized to read and concatenate large swaths of binary data from netCDF files efficiently.

As such it is no surprise that a large readahead would increase performance. It was something of a surprise that increasing the readahead beyond 8Kb in many cases did not lead to further increases.

Parallel Coordinates fs.NCECAT.sysram

The above parallel coordinate plot shows the 2 variables that had the most effect on the NCECAT execution time, the filesystem type and amount of RAM.

With 0.5GB system RAM, JFS accounted for 11 of the fastest 20 times, followed by 6 for ReiserFS and then 3 for XFS and 1 for ext3. While the top 2 times were recorded with the Areca 16port card on RAID6, the remaining 18 were on RAID0, with 15 of those using 3ware cards. With restricted RAM, the average was 70.7+/-1s with a range of 58-76s.

In the same test with 8GB system RAM, the best time was 50s with an average of 62.3+/-1.6s and a range of 50-70s. The distribution of filesystems represented changed as well, with XFS taking 6 places including 3 of the top 5. JFS took 10 of the top 20 and ReiserFS took the the last 4 places of the top 20. Also notable was that all of the top 20 places were done with readahead values of 8k or larger whereas with restricted system RAM, there was a wider spread of values from very large to very small, although the data skewed large. Areca took the 4 fastest places but 3ware took 14 of the top places.

It was surprising that the fastest performance for this test was under RAID6 (on the Areca 16 port card) rather than RAID0, proving you can have data redundancy as well as high performance. In both high and low system memory tests, an Areca card with RAID6 provided the fastest times with both XFS and JFS.

Contrary to a common belief, increasing the number of spindles from 8 to 16 had little effect on the performance under RAID5/6.

Values from this query here

4.1.3. IOZONE

Parallel Coordinates fs.IOZONE.sysram

The 2 most important factors for IOZONE performance were Filesystem and System RAM. Not only was the JFS filesystem among the top performing filesystem here, it was never represented in the worst times, unlike XFS which performed very well under some circumstances, but also very poorly. ReiserFS also performed well, but was not as speedy as JFS. Ext3 was the poorest performer on average.

The IOZONE test does a mixed test of many kinds of disk access, with variable numbers of threads accessing the disk at once. The time to complete it therefore can be considered a proxy for mixed disk use, although not a good proxy for a particular storage requirement. Because of the complexity of the individual tests, it is difficult to draw a single conclusion from each run and therefore I've used the run time of the entire test as a crude summary. Once settled on a particular set of parameters, it would help to use IOzone's supplied gnuplot dataviewer Generate_Graphs to examine all the parameters of interest in interactive 3D. See below for obtaining the raw data.

In top 20 results for this test, the JFS filesystem appeared 9 times including all top 5 times. XFS was represented 6 times and ReiserFS, 5 times. 19 of the 20 best times were obtained using the full 8GB system RAM, and the one time noted by the lower system RAM was attained with the 2GB DIMM card option. Half of the top times were attained using 8K Readahead values. RAID0 was used in 15 of the top times, with RAID6 making up 4 of the rest. In 17 of the top 20 times, the card used was the Areca 16 port card; it was noted that of the 3 times the 3ware card was used, it was with 8 disks instead of 16.

4.1.4. MKFS

mount & mkfs times

The above plot shows the time taken to perform a mkfs on RAID filesystems ranging from ~1TB to 7.5TB and then to mount that filesystem based on the filesystem type.

While not a critical component for most storage applications, the making of the filesystem on the disk device distinguishes a modern journaling filesystem from an older one which is dependent on the on-disk writing of the block and inode information. In the cases of the 4 filesystems tested here, XFS initialized almost instantaneously (see above), regardless of the size of the RAID, followed closely by JFS. On the 3.5TB filesystem, Reiserfs took about 10x as long as JFS and Ext3 took about 10X longer than Reiserfs.

For XFS and JFS, there was no effective difference across RAID types or numbers of disks. Reiserfs and Ext3 took time proportional to the RAID size (so that RAID0 took ~1/2 the time that RAID5 did for the same number of disks). This mkfs speed would be very useful if you had to bring up a large filesystem quickly in order to provide emergency storage. It also is reassuring to know that among the tested filesystems, the two that initialized most quickly were the ones which were usually the highest performance.

4.1.5. MOUNT

Mount times are also a vanishingly tiny part of the lifetime of a storage system, but since it was a part of the test procedure, I'll mention that XFS and JFS mounted instantly, regardless of the size or type of the RAID. RFS and Ext3 took longer and also took highly variable amounts of time to mount the array, in the worst case, ranging from 5-140s to mount the largest 7.5TB array (see above)

4.2. Parameters

4.2.1. Readahead

Setting readahead to even the lowest levels above zero increased performance very slightly, and increased read repeatability considerably. Since it's a freebie, it makes sense to increase it from the default 256B on most linux systems to at least 4096B for just about any application. However, setting it higher than 8096 rarely had positive effects on large file reads such as the NCO processing, and had a negative impact on writing both character and block writes, regardless of other parameters.

Parallel Coordinate plot 4 tests

The image above shows a Parallel Coordinate plot of the summary speeds for the Kernel compile test, the Bonnie++ benchmark, the NCECAT operator, and the IOzone benchmark. Some outliers have been trimmed to make the dataset more homogenous and the data has been pooled over all other variables. The lines are colored by the readahead values that were in place when the tests were done. You can see that the smallest values (256B-4KB) tend provide little speed advantage. Similarly, the highest readahead values (32KB, 16KB) contribute little to the overall speed. While the distribution of the values is large, the best performance is provided by the 8KB readahead, which seems to be favored in other more precise tests.

4test matrix with readahead

The Correlation Matrix above shows the relationships between the 4 named tests as well as the numeric distribution of each test. It is colored by the readahead values of each sample, using the same colors as in the Parallel Coordinate image above; Blue:256B-1KB Pink:2KB Red:4KB Green:8KB Orange:16KB Yellow:32KB It is possible to see that while there is an overall positive correlation in the test results, it is certainly not perfect as they emphasize considerably different approaches to reading and writing.

variable matrix

The image above is an All vs All matrix plot of the restricted data set that was normalized over some variables to enable direct comparison of the variables of interest. In the second column from the left (ra), it can be seen that as the readahead increases, the performance of the BLOCK writes (wrblk) increases as do CHARACTER writes (wrchar). While it might be counterintuitive, increasing the readahead does not increase the speed of either CHARACTER (rdchar) or BLOCK reads (rdblk) as a general trend.

The NCECAT test was the only time when readahead had a significant effect on the execution speed, when setting it to 8K led to a 30% increase in speed (see below). This implies that large streaming reads could benefit significantly from setting readhead to 8k. However, increasing readahead beyond 8k led to a DECREASE in performance in most cases, contrary to the results cited in the 3ware whitepaper.

Depending on filesystem type and the type of read, in the bonnie++ tests, increasing readahead either had no effect or a very slight positive effect. The only case in which there was a noticeable positive trend was with Reiserfs in RAID5 with 16 disks, but that case had fairly poor performance to begin with for some reason. There was a broad peak in speed when readahead was about 1K-8K, however Increasing readahead over 1024B also led to decreases in write speeds. As such there seeems to be no reason to increase readahead above 8K.

4.2.2. FileSystems

Note

There are a number of Parallel Coordinate graphs of the type below. These graphs allow the simultaneous visualization of multiple variables. A single line in the graph below corresponds to a record or row number in a table. A decent introduction can be found here although this graph was generated using the R statistical language and ggobi.

Filesystem Results

When corrected for number of disks in the RAID (8), the type of RAID (5), system RAM (.5GB), and the controller (3ware), and considering Readahead, the XFS filesystem was the clear winner. It did fairly well in the bonnie++ write tests, dropping considerably only when the readahead was set to the lowest levels. In the bonnie++ read tests, it was also the best performing filesystem as long as the readahead was set to at least 8K. While the values are shown in a relative scale, the minimums and maximums for each range is shown in MB/s.

Considering ALL combinations of parameters, it is revealing that of the top 20 values recorded in the Bonnie++ CHARACTER reads, XFS was fastest 17 times, with JFS accounting for the other 3 times (including the fastest at 61MB/s), but the top 20 times were grouped very tightly (60.4+/-0.053 MB/s). Values from this query here

For the 20 fastest BLOCK reads, XFS was the fastest in all of them, topping out at 688MB/s. Unless otherwise specified, there was an approximately equal distribution of RAID versions, cards, and readahead values. Values from this query here

For disk writes, Reiserfs contributed 7 of the top 20 fastest CHARACTER writes, equalled by 7 for ext3, 5 for JFS, and only 1 for XFS. This was a rare case where XFS was not among the fastest, but the speeds were so closely grouped that it was largely irrelevant. Values from this query here

For BLOCK writes, there are were 2 cases, one with the full 8G system memory enabled, and the other with only 0.5GB enabled. In the first case, ext3 was the fastest at 2.12 GB/s and the rest of the top 20 being represented jfs: 11, ext3: 5, ReiserFS: 4, with no representation by XFS. The variation was very low. Values from this query here

In the case where the system RAM was restricted to 0.5GB RAM, the max speed corresponded to XFS at 810MB/s and it recorded 8 of the top 20 times, with JFS taking 9 and ext3 taking 3. Reiserfs was not represented. The average top speed was 785+/-3.8 MB/s. All top 20 times were recorded on Areca 16 port card running RAID5. In fact in the top 50 times recorded, the Areca 16 port card was represented 47 times and all with RAID5 or 6. The other difference noted between the 8G and 0.5G cases was that the top 20 values recorded with the 8G case average had readahead values that averaged 6016+/-769 where the corresponding case with 0.5G were recorded with an average readahead of 2073+/-552. Values from this query here

Overall, ReiserFS tended to do almost as well as XFS in block writes and reads, altho its performance in character operations was almost as bad as ext3.

JFS also performed well overall, especially with readaheads above 4K, but it was very poor in block reads.

Ext3 was the oldest filesystem represented and was in general the slowest, although its write performance was not far below the average.

XFS In the large file reads and writes, XFS performed about as well as JFS. With the full 8GB RAM enabled, the system was able to write 2GB files at more than 2GB/s, a truly phenomenal rate. Reading was slower at ~600MB/s, but still impressive. Obviously, the write speed was a function of the file cache (since the theoretical maximum would be about 1GB/s, based on aggregate single disk write speeds), but even with only 512MB RAM enabled, it was still able to write at about 800MB/s and read at 570MB/s on block operations using the full 16 disks in RAID6.

All filesystems were much less impressive when dealing with smaller files reading and writing at close to single device speeds of 60MB/s. However, overall XFS tied with JFS for best performance on the bonnie++ tests for providing consistent small reads.

Because it is not bound by an underlying non-journaling architecture like Ext3, initializing XFS was almost instantaneous (<3s) on every RAID size I tried (up to 8TB). This is not quite as dramatic an advantage as the number would suggest (being 1000x faster at initialization is vanishingly small advantage over the lifetime of a storage server), but it certainly made things nicer for me. This would be useful in an emergency if a large storage server had to be provisioned and brought on line quickly. There was never a failure that I could pin on XFS during my benchmarks.

Reiserfs also initializes quickly, but slower than XFS and JFS, and works very well. It is a good choice for a journaling file system that will be used on small files such as some mail systems or home directories due to it's ability to squeeze more space from its tail-packing. It excelled at small, numerous writes and reads as can be seen in its superior performance in the kernel compile. A sad sidebar is that the follow-on Reiser4, which has some strong technical advantages will probably not be widely adopted because although it has been released, it has not been accepted into the mainline kernel and the technical lead is in unrelated deep legal trouble.
Ext3 The old reliable. There are more utilities to assist with ext2/Ext3 filesystems than the others combined. For small file IO, it works very well. For larger file IO, its performance degrades somewhat, but not tremendously so. Large file performance is dependent on more system RAM caching, but it is quite remarkable that such old technology stands up so well. The next version, the ext4 filesystem is in testing now and performs better than Ext3 in most regards. It would be hard to recommend a filesystem other than Ext3 for a boot disk or small system that needs to be be as reliable as possible. That said, I did record 1 failure with ext3 under the kernel compile test with the 3ware card.
JFS Also very fast to initialize (slightly slower than XFS) and works very well. There were 2 instances when benchmarks failed due to JFS failing when running under the Areca controller (the OS continued to run, but the benchmarks failed and the JFS partition was unresponsive; I eventually had to reboot to bring back the device). This only happened under heavy load, and only with the JFS testing. It is certainly not enough for a statistical evaluation, but it made me uneasy. Otherwise it ran extremely well. During the ncecat test, JFS managed the fastest times except when the readahead was set to 8K; then it was narrowly beaten by the XFS time. It is also notable that it seems to be more consistent in its good performance. While XFS and JFS are comparable at the top end, there are more situations where XFS records much slower times as well as faster times.

4.2.3. Effect of number of disks in RAID

There was a noticable increase in block IO speeds when more disks were used up to about 8 disks, but no real difference when doing random, small reads and writes. There was a small (~15%) but significant increase in speed in the kernel compile when the # of spindles was increased from 8 to 12, altho raising them to 16 did not increase the speed further and increasing the readahead did not increase speed either. For Ext3, more spindles and smaller readaheads yielded the fastest kernel compiles.

4.2.4. More RAM on card

optl 2GB RAM effects

The Areca 16 port controller has a DIMM slot that allows for the standard 256MB DIMM to be removed and replaced with a 2GB DIMM. The effect of using the 2GB DIMM is shown above.

When the 2GB DIMM was used on the Areca card instead of the initial 256M DIMM, there was no effect when the kernel had the full 8G RAM to use as a file cache. When the system RAM was held at 512M, it had an overall positive effect. It also had a slight positive effect in the IOzone and NCECAT tests when system memory was contrained. However, in 4 cases in the Bonnie++ tests, the extra RAM was associated with a very large performance hit under both high and low system RAM conditions and all four filesystems (see above).

In any case, it's hard to justify putting that RAM on the controller card when it could be used more effectively in an Operating System context. The only reason to do so would be to shield it for controller use when the rest of the system RAM was being saturated or to otherwise improve system performance on an otherwise un-upgradeable system. In these situations it would provide for a fairly cheap upgrade.

4.2.5. Performance during RAID initialization

Performance during Initialization

Using only the 3ware controller, a few tests were done during the initialization process (yellow) instead of after the RAID was in Ready status (purple). The Y axis shows relative time to complete the tests with lower being better.

The initialization phase of a large RAID5 or RAID6 is a process that can take many hours. The tests completed normally albeit somewhat more slowly than usual, although the Bonnie++ tests completed in what could be considered a normal time. This implies that you can start using an initializing array for data storage, which can be of some use if there is a time crunch and especially there is prep work to be done before the data needs to be protected by the fully initialized RAID. I would not expect that an initializing RAID could protect against data loss if a disk failed.

4.2.6. Support

Although I did not have reason to request it this time (all 4 cards worked nearly flawlessly), in previous years, I've had reason to request technical assistance from both Areca and 3ware. Retrospectively, the 3ware human support was very good; the Areca, less so. Both companies would benefit from putting their entire support email online. Areca seems to have done this better than 3ware, so the lack of human support from Areca was less of a problem as more of their support documentation was available via Google.

Overall, I marginally preferred using the 3ware controller, due to its better support in Linux (especially the SMART data access that can be obtained from the SMARTMON tools). All the cards are astonishingling fast and the results from testing with large data IO is particularly striking.

5. Conclusions

5.1. RAID size

Because of the loss of disk space due to parity (one for R5; two disks for R6), the more disks in a R5 or R6, the less overall storage is lost the larger a RAID is. Further, because of the enormous bandwidth of a PCIe backplane, there is no loss of throughput in large RAIDs due to the number of disk channels up to the 16 ports I tested. Therefore, unless there are specific reasons to use multiple smaller controllers, I recommend using the largest controller possible and the largest possible RAID configuration, as more spindles correlate with increased performance.

5.2. File System

The choice of file system depends on the primary use. To my surprise, the actual performance on identical hardware was not as different as I thought it would be. I would not hesitate to use Ext3 for a small boot/root file system. As mentioned, it is the default for most Linux systems and for good reasons - it is extremely well-characterized. I would NOT use it for a large data volume as the underlying file utilities would require regular, agonizingly long fscks. XFS or JFS would be a better choice for a data volume and even for general purpose storage. JFS is notable in that it matched the XFS performance and never degraded as XFS occassionally did. Reiserfs is a good choice for a /home or mailspool partition that requires lots of head movement, but it's hard to recommend it since JFS and XFS are such good general purpose filesystems. I did not masure the space advantages that it might have conferred due to its tail-packing feature, so on a space-constrained system it might have an advantage. For all file servers, having lots of extra RAM for file caching is a no-brainer plus, even moreso than higher-speed or multiple CPUs.

5.3. Preferred Controller

Whether the Areca controller or the 3ware controller is better, there is not a clear winner. Both do most things very well. Both provide administrative access via reasonably well-designed web servers. The 16-port Areca controller gives you a direct ethernet port into a dedicated web server on the card itself which can be quite useful. The 3ware web server is an additional daemon that needs to be set up to run on bootup. The 3ware also gives you a commandline utility that allows terminal control of all 3ware controllers in the server - useful for remote monitoring over slow networks. The 2 failures with JFS occurred when using the Areca controller, but it otherwise behaved admirably over the course of multiple hardware changes and tests. The 3ware controller manages to package the battery in the same slot as the controller itself which may be important in a space-restricted server.

Neither the 3ware nor the Areca gives useful interpretation of SMART data from its drives, a strange failing for such high-end hardware, but the 3ware card is supported by smartctl peek-thru to extract this data from the controller. Both the 8-port and the 16-port Areca cards support RAID6, but only the 16-port 3ware card does. Both Areca and 3ware also sell very high density controllers - up to 24 ports on a card, which could support 48 drives on a relatively cheap 2-slot motherboard. Because of the theoretical bandwidth limit of 250MB/s per lane, 2 PCI-e 8-lane slots could support bandwidth up to 4GB/s, sufficient to support 24-disk arrays, but probably pushing it for 48-disk arrays.

When I split the 16 disks into multiple RAIDS and then tried to access them, the Areca card refused to allow access to them until I rebooted. The 3ware card worked every time in showing me new /dev/sdX devices coresponding to the new RAID configurations. I suspect that this is a function of the time that the companies have been supporting Linux. In real life, this is not as annoying as it was during this testing. Most people will initialize a particular RAID configuration and run it until the machine dies; they will not be changing RAID configurations with the manic zeal I did.

6. Other Uses of the Storage Brick

Besides providing cheap storage capacity, the Storage Brick can also be configured for significant computational ability. The current Storage Brick motherboard has two 1207 CPU sockets populated with dual-core CPUs. The recent release of Quad-Core Opterons would theoretically allow this motherboard to host 8 Opteron cores (as well as 32GB of RAM). This provides a fairly capable compute node as well as large storage capacity. Fully populating the system with RAM and CPUs is overkill simply for a storage server; a low-end 32 bit system with 1GB of RAM would suffice to support the pure storage operations. However, fully populating such a system provides high-speed processing in addition to access to very large amounts of data. This maps well to a number of institutional requirements such as mail service and database servers, as well as research applications.

Since it can be equipped with dual and even quad-port Gigabit ethernet cards to supplement the native 2 Gb ethernet ports, the Storage Brick can simultaneousy service multiple networks to act as a backup or NAS file server. It is especially appropriate for a backup service that does significant compression or de-duplication.

It's also worth mentioning that the Opteron, which uses the AMD64 architecture can run simultaneous 64bit and 32bit applications (including mixed Symmetric MultiProcessing applications). Intel has also embraced this architecture (calling it x86_64) and recent Intel CPUs have the same capability.

7. Further Testing

There's always something left out.

a relational database benchmark as a test case. I was going to use SuperSmack, but ran out of time. There are others that should be tried as well, especially the Database Test Suite.
a Mail Server. I should have done a Postmark test as well (available as a Ubuntu package.).
Testing Internal vs external journal.
other filesystems. I'd like to test Sun's apparently very well-designed ZFS, as well as the new Linux ext4 filesystem in the same framework. ZFS has not yet been ported to Linux the linux kernel, although the process is continuing using the FUSE userland approach, thanks to Sun having open-sourced the code. Recently NetApp has sued Sun over some aspects of the ZFS, claiming patent violations. The ext4 filesystem is available in the new Linux kernel as a development option, tho not yet recommended for real use.
Logical Volume Managers. I would also like to test the Linux Logical Volume Managers LVM2 and EVMS against the ZFS volume manager as well. Snapshotting is a very attractive feature that all of them claim to have implemented. Perhaps next time..

Please let me know your priorites.

8. Appendix

8.1. Raw Data

The raw data from this analysis are available in 2 forms:

the original files which can be browsed here singly or downloaded in bulk as a 5MB tarball here.
the SQLite database into which I've parsed most of the data from the above files. The reason for the mismatching is that some of the data was gathered in the process of creating the script and so was not in a form amenable to being parsed easily. Those few files have been omitted. The SQLite database has 3 tables:
- iozone - a large corpus (~6MB) of data from the iozone tests. Each iteration of the iozone test generates about 2400 values, which are indexed by all the parameters of the test, so the data grew quickly.
- bonnie++ - much smaller than the iozone data - only contains ~ 14 values per run
- other - the overall run times of the 5 different tests; extremely compact, at the cost of detail.
- the overall SQLite schema can be seen here in plain text.

8.2. Using R & ggobi to visualize the data

The R language and ggobi are, respectively, free statistical and visualization systems that can be used to analyze and visualize multivariate data such as that provided in this study. While they can be used independently, they are also designed to work together - ggobi can be run from within R to visualize R dataframes (R's internal representation of a dataset, corresponding roughly to a table). While R can directly query databases to provide dataframes (described in more detail here), I'll describe how to load external tables 1st, such as those created by an external database query or spreadsheet table.

Given such a table, named [other.16d.r5+6] for example:

card|ports|disks|rtype|ra|fs|MKFS|MOUNT|BONNIE|KERNEL|IOZONE|NCECAT|DBENCH
3ware|16|16|6|256|ext3|1507.12798|0.966668|158.819063|835.115628|905.026998|124.557138|0.0
areca|16|16|6|256|ext3|1754.609643|62.738649|218.925547|735.370265|729.249532|101.68025|0.0
areca|16|16|5|256|ext3|626.779672|104.035842|137.148965|705.861009|622.139284|0.0|148.997351
areca|16|16|5|256|ext3|1835.480147|135.926012|134.950901|730.114779|626.945034|0.0|150.213541
areca|16|16|5|256|ext3|1893.842949|81.943981|185.853155|727.25922|639.094527|0.0|0.0
etc

it's possible to slurp this into an R dataframe named df with the following command:

> df <- read.table("other.16d.r5+6",sep="|", header = TRUE)
# then load the ggobi library routines with:
> library("rggobi")
# then load the R dataframe into a ggobi object 'ggo' and launch ggobi with :
> ggo <- ggobi(df)

Once launched you can use ggobi to interactively examine, scale, brush, identify, and plot in 1,2, and 3 dimensions. Following is a screenshot of a ggobi 3D plot, alowing instant selection of 3 variables out of all possible ones. This makes it very easy to quickly tour data relationships.

More information about using R and ggobi together is available as a PDF here.

8.3. Software used in this study

This study was performed entirely using Open Source Software. The tests were conducted on the Storage Brick running the Kubuntu Feisty 64bit Linux distribution. The previously mentioned tests (Bonnie++, NCECAT, IOZONE, and of course the kernel compile) are all Open Source, as are the filesystem implementations. In analyzing the results, I used a combination of Perl scripts to parse the output, and consolidated the extracted data in an SQLite database. Once the data was in SQLite, I used sqlite3, gnuplot, R, and ggobi to extract, examine, analyze, and plot the data. The gimp and krita were used to add labels to the graphs where needed. The report itself was composed in plain text with asciidoc format codes and formatted using the elegant asciidoc. To see what it looks like, the asciidoc source text is here.

The script that performs these tests fs_bm.pl is placed in the public domain and is structured in a way that allows arbitrary tests to be embedded in the timing and reporting structure.

9. Acknowledgements

I wish to thank Don Capps (creator of the IOZone) and Stuart Rackham (author of asciidoc) for their help, as well as Richard Knott at THINKCP for providing good advice about the initial configuration of the Storage Brick, and for arranging the loan of the 16-port 3ware and Areca cards for this testing. Of course, thanks to Areca and 3ware for providing the cards without requiring NDAs or other forms of blockage of unflattering comment. Thanks to colleagues at NACS and UCI for their comments and critiques, but mistakes are mine alone.

Thanks for reading this far.

Harry Mangalam