wiki:AddingANewBeeGFS

Adding a new BeeGFS filesystem (FS)

When adding an entirely new BeeGFS to a cluster that already has one or more BeeGFS FSs, there are 2 parts to the process:

  1. configuring the system
  2. Adding it to the clients so that it does not interfere with the existing FS mounts.

BeeGFS MD server

The MD server requires a single socket mobo, lots of RAM, and a fast bus to connect it to a fast FS optimized for high IOPS on tiny files. We use single-socket Intel cores (usually 8-16), 128GB RAM and (historically) a fast disk controller running a RAID10 of 6x10K disks. IB faster than QDR requires PCIe v3 slots which only Intel can provide (until Naples or Zen machines start appearing).

Now we would use NVME SSDs in RAID1. The /beegfs FS does not need to be large, but it needs to have large inodes and fast storage. Some groups are making their entire /beegfs on SSDs, which is not a bad idea (see NVME note above) as long as the SSDs are qualified for enterprise use.

We assume you know how to install the required daemons. We run all the administrative daemons on the MD server.

yum install beegfs-meta beegfs-opentk-lib beegfs-utils beegfs-mgmtd beegfs-admon beegfs-common 

the /beegfs MD filesystem

Our /beegfs uses an LSI controller and the 6 disks are arranged in 3x500GBx2 RAID10 arangement. The Controller is using 32K stripes (optimize for small files). The FS is created as ext4 with large inodes, essentially as the BeeGFS wiki suggests:

mkfs.ext4 -i 2048 -I 1024 -J size=400 -Odir_index,filetype,^extents /dev/sdb
# where /dev/sdb is the RAID device

# it's mounted with the 'noatime' options as well on the std /beegfs
mount -onoatime,nodiratime,nobarrier /dev/sdb /beegfs

# and generate the UUID of the newly mounted FS for /etc/fstab
blkid /beegfs

MD server kernel mods

Fraunhofer also suggest some kernel parameter mods:

# assuming the RAID is /dev/sdb
echo      128 > /sys/block/sdb/queue/nr_requests
echo deadline > /sys/block/sdb/queue/scheduler

# the above lines have to be appended to /etc/rc.local to survive reboots
echo "
# FhgFS Tuning (MetaData) - assumes block devs don't change
echo deadline > /sys/block/sdb/queue/scheduler
echo 128 > /sys/block/sdb/queue/nr_requests
" >> /etc/rc.local

And the following edits

  • in /etc/beegfs/beegfs-meta.conf, change tuneNumWorkers = 64
  • VERY important: since our cluster already has an existing BeeGFS system, you need to change the port offset connPortShift to a non-zero (and >10) number. This variable has to be set in the following files on the MD server: beegfs-admon.conf, beegfs-meta.conf, beegfs-mgmtd.conf
  • change the /etc/fstab to reflect the UUID of the array, with the correct mount options:
...
# /beegfs FS for the meta, mgmt, admon services
UUID=95294298-cbd0-46ec-ad0e-a8b96762ab55  /beegfs ext4  rw,noatime,nodiratime,nobarrier,user_xattr 1 2
...

And after all that's done, start up the required daemons

/etc/init.d/beegfs-admon  start
/etc/init.d/beegfs-mgmtd  start
/etc/init.d/beegfs-meta   start

# to shut them down, reverse the order, tho this does not seem to be critical.
/etc/init.d/beegfs-meta   stop
/etc/init.d/beegfs-mgmtd  stop
/etc/init.d/beegfs-admon  stop

BeeGFS Storage Servers

The storage servers we use are vanilla 36bay chassis with a single CPU socket, lots of RAM, and a good hardware LSI disk controller. We use XFS, but ZFS is a good choice as well, especially if your tendency is towards reliability over speed.

To install the BeeGFS bits for the storage servers:

yum install beegfs-utils beegfs-opentk-lib beegfs-storage beegfs-common

the /raid-X arrays

We try to allocate ~40TB arrays so with 3TB disks, we use 2x17 disk RAID6s, with 4TB disks, we use 3x12 disk arrays. After much false economy, there is nothing to be gained by using anything less than NAS-quality disks and if you can spare the $, full enterprise disks are worthwhile. After the Backblaze reporting, we only use HGST disks.

After verifying all the disks via a long SMART test, assemble into RAID6 arrays with parameters that make sense for your use case. We use 128KB strips or greater, write-thru with BBU, blah, blah.

NB: There is a LOT of gibberish written about strips, stripes, chunks, blocks, # of disks (and whether that number refers to data disks or data+parity disks), and the relationship with the underlying RAIDs and the mount parameters. I'm not going to address that too much, except that if you're using a hardware RAID system like our /dfs2, all the RAIDs are behind LSI hardware controllers and are therefore presented as generic /dev/sdX devices. The parameters that might have an effect with an mdadm RAID therefore are either ignored or intercepted and modified by the controller itself.

After the full initialization (we've tried LSI's fast initialization, but we've had some problems with it), make the FS with the parameters from the BeeGFS wiki:

# below, 'c d e' indicate the sdX devices presented by the RAID controller 
IBDEV=ib0  # your Infiniband device
DEV=(x c d e); # define DEV array so [1]=c, [2]=d, etc ([0]=x is omitted)
for ii in 1 2 3; do
   II=${DEV[$ii]} # set to shorten
   mkfs.xfs  -l version=2,su=128k -i size=512 /dev/sd${II}
   # where -i size is the (extended) inode size 
   # and get the UUID at the same time
   UUID=`blkid /dev/sd${II}`
   # and then mount them (note, use original $ii to name /raid)
   mount -onoatime,nodiratime,logbufs=8,logbsize=256k,largeio,\
inode64,swalloc,allocsize=131072k,nobarrier /dev/sd${II} /raid${ii}
   # and set the kernel params for each device as well:
   echo deadline > /sys/block/sd${II}/queue/scheduler
   echo 4096     > /sys/block/sd${II}/queue/nr_requests
   echo 4096     > /sys/block/sd${II}/queue/read_ahead_kb
   # and append those lines to /etc/rc.local to make sure it's permanent
   echo "
echo deadline > /sys/block/sd${II}/queue/scheduler
echo 4096     > /sys/block/sd${II}/queue/nr_requests
echo 4096     > /sys/block/sd${II}/queue/read_ahead_kb
" >> /etc/rc.local
echo "UUID for /dev/sd${II} = $UUID"
done

echo "Setting kernel params"
# and finally set the overall kernel params
echo 5 > /proc/sys/vm/dirty_background_ratio
echo 10 > /proc/sys/vm/dirty_ratio
echo 50 > /proc/sys/vm/vfs_cache_pressure
echo 262144 > /proc/sys/vm/min_free_kbytes
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo connected > /sys/class/net/${IBDEV}/mode


# and write them to /etc/rc.local
echo "Writing kernel params to rc.local"
echo "
# overall kernel params
echo never    > /sys/kernel/mm/transparent_hugepage/enabled
echo 5        > /proc/sys/vm/dirty_background_ratio
echo 10       > /proc/sys/vm/dirty_ratio
echo 50       > /proc/sys/vm/vfs_cache_pressure
echo 262144   > /proc/sys/vm/min_free_kbytes
echo connected > /sys/class/net/${IBDEV}/mode
" >> /etc/rc.local

Infiniband setup

On a new server, the Infiniband should be set up, but if not, you may have to add the config files to bring up the IB AND to make sure that it's running with RDMA enabled.

The IB config file in /etc/sysconfig/network-scripts/ifcfg-ib0 should look like the following (except that the IP # should be changed of course). Note that the MTU setting is reinforced by the line that gets set above in the rc.local file: echo connected > /sys/class/net/ib0/mode

DEVICE=ib0
IPADDR=10.2.254.179 
NETMASK=255.255.0.0
BOOTPROTO=none
ONBOOT=yes
MTU=65520 # <-- efficient transfer requires large MTU

This is not enough to ensure that the storage node is running in RDMA mode. You also have to make sure that the ibacm daemon is disabled and the rdma setup has been run (should be run automatically after the beegfs-opentk-lib package installation, but ...) And then restart the beegfs-storage server if it had already been started.

service ibacm stop
beegfs-setup-rdma
/etc/init.d/beegfs-storage restart

At this point, from a client, the output of the beegfs-net utility should show that all the storage nodes should be running on RDMA.

NB: Until a client has an established connection to the metadada server, it will show up in the output of beegfs-net as Connections: <none>. If that's the case, just do a short ls -lR on the mountpoint to establish the connection. It will then show the type of connection.

meta_nodes
=============
dfm-2-1.local [ID: 42943]
   Connections: RDMA: 1 (10.2.255.139:8105); 

storage_nodes
=============
dfs-2-3.local [ID: 1]
   Connections: RDMA: 3 (10.2.254.165:8103); 
dfs-2-1.local [ID: 38261]
   Connections: RDMA: 2 (10.2.255.203:8103); 
dfs-2-2.local [ID: 48946]
   Connections: RDMA: 3 (10.2.255.144:8103); 

Metadata server mods

As with the MetaData? server, need to do some similar specific edits.

  • VERY important: since our cluster already has an existing BeeGFS system, you need to change the port offset connPortShift to a non-zero (and >10) number. This variable has to be set in the following files on the MD server: beegfs-storage.conf
  • if this is an ADDITIONAL storage server to an existing BeeGFS, there's no need to change this offset, unless the other servers in the same FS have been shifted. The new storage server has to match the other storage servers in this regard.
  • change the /etc/fstab to reflect the UUID of the array, with the correct mount options, including the quota options (uqnoenforce,gqnoenforce).
 ...
# assuming 3 RAIDs of
UUID=<use above-generated-UUID> /raid1 xfs rw,uqnoenforce,gqnoenforce,noatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k,nobarrier 1 2
UUID=<use above-generated-UUID> /raid2 xfs rw,uqnoenforce,gqnoenforce,noatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k,nobarrier 1 2
UUID=<use above-generated-UUID> /raid3 xfs rw,uqnoenforce,gqnoenforce,noatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k,nobarrier 1 2
...

# to verify the fstab lines, umount them, remount them, and check that 
# they came up correctly
umount /raid1 /raid2 /raid3
mount -a
df -h | grep raid
/dev/sdc              55T   35M   55T   1% /raid1
/dev/sdd              55T   35M   55T   1% /raid2
/dev/sde              50T   35M   50T   1% /raid3

Metadata server mods for adding a new storage server

(Usually) on the metadata server, you'll have to explicitly allow the new server to be added, unless this is your (dangerous) default.

Modify the mgmtd config file to allow new storage servers to be accepted:

# in  /etc/beegfs/beegfs-mgmtd.conf
# set to 'true' to accept new server
sysAllowNewServers             = true
# and then restart the mgmtd server
/etc/init.d/beegfs-mgmtd restart

# once the new storage server has been integrated, reset the 
# sysAllowNewServers param to 'false' and restart the server again.

And then start the storage server:

/etc/init.d/beegfs-storage start

That should be it. The metadata server picks up the new probe from the storage server and brings it onboard. df -h almost immediately picks up the new space and over a fairly short time, the storage will be equalized over all the storage servers via deletion, creation, and editing.

The admon monitor

BeeGFS ships a java-based utility that's very good for monitoring the system (not so good for setting it up; hence the above cmdlines). To start it, open an X11 connection to the MD server and initialize an alias of:

alias admon='java -jar /opt/beegfs/beegfs-admon-gui/beegfs-admon-gui.jar >& /dev/null &'
# & append it to your ~/.bashrc.
alias admon >> ~/.bashrc
admon  # should launch the admin GUI util.

# the default passwords are 'admin'.  Change them immediately.

You should be able to see the MD server and Storage servers up but quiescent at this point.

Modifying the Brenner setup-beegfs.sh config files

Adam Brenner has set up a script that will, if set up properly with the correct FS-specific dirs, do a complete removal and re-installation of all BeeGFS FSs in a well-controlled manner for either ethernet or IB. The key comes in setting up these correct FS-specific dirs and providing the necessary substitutions and mods.

The script lives here: /data/node-setup/setup-beegfs.sh

It currently supports 3 different BeeGFS FSs. It depends on:

  • the /etc/hosts entries for the new BeeGFS MD server being entered into the hosts stanza
  • the dirs in /data/node-setup/node-files/beegfs/configs/ being duplicated and then modified correctly for the new BeeGFS FS.
    • these dirs specify each of the different BeeGFS FSs and contain config files specific for each. Currently it includes:
      • dfs1.d dfs2.d fast-scratch.d lib (where the 'lib' is FS-independent)
  • adding some extra test stanzas in the script in fairly obvious places (if there are currently 3 tests, add another).
  • Each dir contains all the client files for each FS:
    beegfs-client-autobuild.conf
    beegfs-client.conf           < -- edit for connPortShift
    beegfs-helperd.conf          < -- edit for connPortShift
    beegfs-libopentk.conf
    beegfs-mounts.conf           < -- edit for mountpoint
    

Client-side installs.

Once the above edits are done, you can run the /data/node-setup/setup-beegfs.sh on each client that needs the mount and it will de-install and completely re-install the entire beegfs software and then mount all the referenced BeeGFS FSs.

Last modified 3 months ago Last modified on 04/14/17 17:05:09