Adding a new BeeGFS filesystem (FS)
When adding an entirely new BeeGFS to a cluster that already has one or more BeeGFS FSs, there are 2 parts to the process:
- configuring the system
- Adding it to the clients so that it does not interfere with the existing FS mounts.
BeeGFS MD server
The MD server requires a single socket mobo, lots of RAM, and a fast bus to connect it to a fast FS optimized for high IOPS on tiny files. We use single-socket Intel cores (usually 8-16), 128GB RAM and (historically) a fast disk controller running a RAID10 of 6x10K disks. IB faster than QDR requires PCIe v3 slots which only Intel can provide (until Naples or Zen machines start appearing).
Now we would use NVME SSDs in RAID1. The /beegfs FS does not need to be large, but it needs to have large inodes and fast storage. Some groups are making their entire /beegfs on SSDs, which is not a bad idea (see NVME note above) as long as the SSDs are qualified for enterprise use.
We assume you know how to install the required daemons. We run all the administrative daemons on the MD server.
yum install beegfs-meta beegfs-opentk-lib beegfs-utils beegfs-mgmtd beegfs-admon beegfs-common
the /beegfs MD filesystem
Our /beegfs uses an LSI controller and the 6 disks are arranged in 3x500GBx2 RAID10 arangement. The Controller is using 32K stripes (optimize for small files). The FS is created as ext4 with large inodes, essentially as the BeeGFS wiki suggests:
mkfs.ext4 -i 2048 -I 1024 -J size=400 -Odir_index,filetype,^extents /dev/sdb # where /dev/sdb is the RAID device # it's mounted with the 'noatime' options as well on the std /beegfs mount -onoatime,nodiratime,nobarrier /dev/sdb /beegfs # and generate the UUID of the newly mounted FS for /etc/fstab blkid /beegfs
MD server kernel mods
Fraunhofer also suggest some kernel parameter mods:
# assuming the RAID is /dev/sdb echo 128 > /sys/block/sdb/queue/nr_requests echo deadline > /sys/block/sdb/queue/scheduler # the above lines have to be appended to /etc/rc.local to survive reboots echo " # FhgFS Tuning (MetaData) - assumes block devs don't change echo deadline > /sys/block/sdb/queue/scheduler echo 128 > /sys/block/sdb/queue/nr_requests " >> /etc/rc.local
And the following edits
- in /etc/beegfs/beegfs-meta.conf, change tuneNumWorkers = 64
- VERY important: since our cluster already has an existing BeeGFS system, you need to change the port offset connPortShift to a non-zero (and >10) number. This variable has to be set in the following files on the MD server: beegfs-admon.conf, beegfs-meta.conf, beegfs-mgmtd.conf
- change the /etc/fstab to reflect the UUID of the array, with the correct mount options:
... # /beegfs FS for the meta, mgmt, admon services UUID=95294298-cbd0-46ec-ad0e-a8b96762ab55 /beegfs ext4 rw,noatime,nodiratime,nobarrier,user_xattr 1 2 ...
And after all that's done, start up the required daemons
/etc/init.d/beegfs-admon start /etc/init.d/beegfs-mgmtd start /etc/init.d/beegfs-meta start # to shut them down, reverse the order, tho this does not seem to be critical. /etc/init.d/beegfs-meta stop /etc/init.d/beegfs-mgmtd stop /etc/init.d/beegfs-admon stop
BeeGFS Storage Servers
The storage servers we use are vanilla 36bay chassis with a single CPU socket, lots of RAM, and a good hardware LSI disk controller. We use XFS, but ZFS is a good choice as well, especially if your tendency is towards reliability over speed.
To install the BeeGFS bits for the storage servers:
yum install beegfs-utils beegfs-opentk-lib beegfs-storage beegfs-common
the /raid-X arrays
We try to allocate ~40TB arrays so with 3TB disks, we use 2x17 disk RAID6s, with 4TB disks, we use 3x12 disk arrays. After much false economy, there is nothing to be gained by using anything less than NAS-quality disks and if you can spare the $, full enterprise disks are worthwhile. After the Backblaze reporting, we only use HGST disks.
After verifying all the disks via a long SMART test, assemble into RAID6 arrays with parameters that make sense for your use case. We use 128KB strips or greater, write-thru with BBU, blah, blah.
NB: There is a LOT of gibberish written about strips, stripes, chunks, blocks, # of disks (and whether that number refers to data disks or data+parity disks), and the relationship with the underlying RAIDs and the mount parameters. I'm not going to address that too much, except that if you're using a hardware RAID system like our /dfs2, all the RAIDs are behind LSI hardware controllers and are therefore presented as generic /dev/sdX devices. The parameters that might have an effect with an mdadm RAID therefore are either ignored or intercepted and modified by the controller itself.
After the full initialization (we've tried LSI's fast initialization, but we've had some problems with it), make the FS with the parameters from the BeeGFS wiki:
# below, 'c d e' indicate the sdX devices presented by the RAID controller IBDEV=ib0 # your Infiniband device DEV=(x c d e); # define DEV array so [1]=c, [2]=d, etc ([0]=x is omitted) for ii in 1 2 3; do II=${DEV[$ii]} # set to shorten mkfs.xfs -l version=2,su=128k -i size=512 /dev/sd${II} # where -i size is the (extended) inode size # and get the UUID at the same time UUID=`blkid /dev/sd${II}` # and then mount them (note, use original $ii to name /raid) mount -onoatime,nodiratime,logbufs=8,logbsize=256k,largeio,\ inode64,swalloc,allocsize=131072k,nobarrier /dev/sd${II} /raid${ii} # and set the kernel params for each device as well: echo deadline > /sys/block/sd${II}/queue/scheduler echo 4096 > /sys/block/sd${II}/queue/nr_requests echo 4096 > /sys/block/sd${II}/queue/read_ahead_kb # and append those lines to /etc/rc.local to make sure it's permanent echo " echo deadline > /sys/block/sd${II}/queue/scheduler echo 4096 > /sys/block/sd${II}/queue/nr_requests echo 4096 > /sys/block/sd${II}/queue/read_ahead_kb " >> /etc/rc.local echo "UUID for /dev/sd${II} = $UUID" done echo "Setting kernel params" # and finally set the overall kernel params echo 5 > /proc/sys/vm/dirty_background_ratio echo 10 > /proc/sys/vm/dirty_ratio echo 50 > /proc/sys/vm/vfs_cache_pressure echo 262144 > /proc/sys/vm/min_free_kbytes echo never > /sys/kernel/mm/transparent_hugepage/enabled echo connected > /sys/class/net/${IBDEV}/mode # and write them to /etc/rc.local echo "Writing kernel params to rc.local" echo " # overall kernel params echo never > /sys/kernel/mm/transparent_hugepage/enabled echo 5 > /proc/sys/vm/dirty_background_ratio echo 10 > /proc/sys/vm/dirty_ratio echo 50 > /proc/sys/vm/vfs_cache_pressure echo 262144 > /proc/sys/vm/min_free_kbytes echo connected > /sys/class/net/${IBDEV}/mode " >> /etc/rc.local
Infiniband setup
On a new server, the Infiniband should be set up, but if not, you may have to add the config files to bring up the IB AND to make sure that it's running with RDMA enabled.
The IB config file in /etc/sysconfig/network-scripts/ifcfg-ib0 should look like the following (except that the IP # should be changed of course). Note that the MTU setting is reinforced by the line that gets set above in the rc.local file: echo connected > /sys/class/net/ib0/mode
DEVICE=ib0 IPADDR=10.2.254.179 NETMASK=255.255.0.0 BOOTPROTO=none ONBOOT=yes MTU=65520 # <-- efficient transfer requires large MTU
This is not enough to ensure that the storage node is running in RDMA mode. You also have to make sure that the ibacm daemon is disabled and the rdma setup has been run (should be run automatically after the beegfs-opentk-lib package installation, but ...) And then restart the beegfs-storage server if it had already been started.
service ibacm stop beegfs-setup-rdma /etc/init.d/beegfs-storage restart
At this point, from a client, the output of the beegfs-net utility should show that all the storage nodes should be running on RDMA.
NB: Until a client has an established connection to the metadada server, it will show up in the output of beegfs-net as Connections: <none>. If that's the case, just do a short ls -lR on the mountpoint to establish the connection. It will then show the type of connection.
meta_nodes ============= dfm-2-1.local [ID: 42943] Connections: RDMA: 1 (10.2.255.139:8105); storage_nodes ============= dfs-2-3.local [ID: 1] Connections: RDMA: 3 (10.2.254.165:8103); dfs-2-1.local [ID: 38261] Connections: RDMA: 2 (10.2.255.203:8103); dfs-2-2.local [ID: 48946] Connections: RDMA: 3 (10.2.255.144:8103);
Metadata server mods
As with the MetaData? server, need to do some similar specific edits.
- VERY important: since our cluster already has an existing BeeGFS system, you need to change the port offset connPortShift to a non-zero (and >10) number. This variable has to be set in the following files on the MD server: beegfs-storage.conf
- if this is an ADDITIONAL storage server to an existing BeeGFS, there's no need to change this offset, unless the other servers in the same FS have been shifted. The new storage server has to match the other storage servers in this regard.
- change the /etc/fstab to reflect the UUID of the array, with the correct mount options, including the quota options (uqnoenforce,gqnoenforce).
... # assuming 3 RAIDs of UUID=<use above-generated-UUID> /raid1 xfs rw,uqnoenforce,gqnoenforce,noatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k,nobarrier 1 2 UUID=<use above-generated-UUID> /raid2 xfs rw,uqnoenforce,gqnoenforce,noatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k,nobarrier 1 2 UUID=<use above-generated-UUID> /raid3 xfs rw,uqnoenforce,gqnoenforce,noatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k,nobarrier 1 2 ... # to verify the fstab lines, umount them, remount them, and check that # they came up correctly umount /raid1 /raid2 /raid3 mount -a df -h | grep raid /dev/sdc 55T 35M 55T 1% /raid1 /dev/sdd 55T 35M 55T 1% /raid2 /dev/sde 50T 35M 50T 1% /raid3
Metadata server mods for adding a new storage server
(Usually) on the metadata server, you'll have to explicitly allow the new server to be added, unless this is your (dangerous) default.
Modify the mgmtd config file to allow new storage servers to be accepted:
# in /etc/beegfs/beegfs-mgmtd.conf # set to 'true' to accept new server sysAllowNewServers = true # and then restart the mgmtd server /etc/init.d/beegfs-mgmtd restart # once the new storage server has been integrated, reset the # sysAllowNewServers param to 'false' and restart the server again.
And then start the storage server:
/etc/init.d/beegfs-storage start
That should be it. The metadata server picks up the new probe from the storage server and brings it onboard. df -h almost immediately picks up the new space and over a fairly short time, the storage will be equalized over all the storage servers via deletion, creation, and editing.
The admon monitor
BeeGFS ships a java-based utility that's very good for monitoring the system (not so good for setting it up; hence the above cmdlines). To start it, open an X11 connection to the MD server and initialize an alias of:
alias admon='java -jar /opt/beegfs/beegfs-admon-gui/beegfs-admon-gui.jar >& /dev/null &' # & append it to your ~/.bashrc. alias admon >> ~/.bashrc admon # should launch the admin GUI util. # the default passwords are 'admin'. Change them immediately.
You should be able to see the MD server and Storage servers up but quiescent at this point.
Modifying the Brenner setup-beegfs.sh config files
Adam Brenner has set up a script that will, if set up properly with the correct FS-specific dirs, do a complete removal and re-installation of all BeeGFS FSs in a well-controlled manner for either ethernet or IB. The key comes in setting up these correct FS-specific dirs and providing the necessary substitutions and mods.
The script lives here: /data/node-setup/setup-beegfs.sh
It currently supports 3 different BeeGFS FSs. It depends on:
- the /etc/hosts entries for the new BeeGFS MD server being entered into the hosts stanza
- the dirs in /data/node-setup/node-files/beegfs/configs/ being duplicated and then modified correctly for the new BeeGFS FS.
- these dirs specify each of the different BeeGFS FSs and contain config files specific for each. Currently it includes:
- dfs1.d dfs2.d fast-scratch.d lib (where the 'lib' is FS-independent)
- these dirs specify each of the different BeeGFS FSs and contain config files specific for each. Currently it includes:
- adding some extra test stanzas in the script in fairly obvious places (if there are currently 3 tests, add another).
- Each dir contains all the client files for each FS:
beegfs-client-autobuild.conf beegfs-client.conf < -- edit for connPortShift beegfs-helperd.conf < -- edit for connPortShift beegfs-libopentk.conf beegfs-mounts.conf < -- edit for mountpoint
Client-side installs.
Once the above edits are done, you can run the /data/node-setup/setup-beegfs.sh on each client that needs the mount and it will de-install and completely re-install the entire beegfs software and then mount all the referenced BeeGFS FSs.