v0.1, 03 Dec 2008

Note
Executive Summary

If you have to transfer data, transfer only that which is necessary. If you unavoidably have lots of data to transfer, consider having your institution set up a GridFTP node.

If GridFTP is not available, the fastest, easiest, user-mode, node-to-node method to move data for Linux and MacOSX is with bbcp.

If you use Windows, fdt is Java-based and will run there as well.

Note that bbcp and the similar bbftp can require considerable tuning to extract maximum bandwidth. If these applications do not work at expected rates, ESNet’s Guide to Bulk Data Transfer over a WAN is an excellent summary of the deeper network issues.

We all need to transfer data, and the amount of that data is increasing as the world gets more digital. If it’s not climate model data from the IPCC, it’s high energy particle physics data from the LHC, or audio & video streams from a performance recording.

The usual methods of transferring data (scp, http and ftp utilities (such as curl or wget) work fine when your data is in the MB range, but when you have very large collections of data there are some tricks that are worth mentioning.

1. Compression & Encryption

Whether to compress and/or encrypt your data in transit depends on the cost of doing so. For a modern desktop or laptop computer, the CPU(s) are usually not doing much of anything so the cost incurred in doing the compression/encryption is generally not even noticed. However on an otherwise loaded machine, it can be significant, so it depends on what has to be done at the same time. Compression can reduce the amount of data that needs to be transmitted considerably if the data is of a type that is compressible (text, XML, uncompressed images and music), however progressively such data is already compressed on the disk (in the form of jpeg or mp3 compression), and compressing already compressed data yields little improvement. Some compression utilities try to detect already-compressed data and skip it, so there’s often no penalty in requesting compression, but some utilities (like the popular Linux archiving tar) will not detect it correctly and waste lots of time trying.

As an extreme example, here’s the timing of making a tar archive of a large directory that consists of mostly already compressed data, using compression or not.

Using compression:

$ time tar -czpf /bduc/data.tar.gz /data
tar: Removing leading `/' from member names

real    201m38.540s
user    95m32.114s
sys     7m13.807s

tar file = 84,284,016,900 bytes

NOT using compression:

$ time tar -cpf /bduc/data.tar /data
tar: Removing leading `/' from member names

real    127m13.404s
user    0m43.579s
sys     5m35.437s

tar file = 86,237,952,000

It took more than 74 minutes (about 58%) longer using compression which gained us about 2GB less storage (2.3% decrease in size.) YMMV.

Similarly, there is a computational cost to encrypting and decrypting a text, but less so than with compression. scp uses ssh to do the underlying encryption and it does a very good job, but like the other single-TCP-stream utilities like curl and wget, it will only be able to push so much thru a connection.

2. Avoiding data transfer

The most efficient way to transfer data is not to transfer it at all. There are a number of utilities that can be used to assist in NOT transferring data. Some of them are listed below.

2.1. kdirstat

The elegant, open source kdirstat (and it’s ports to MacOSX Disk Inventory X and Windows Windirstat) are quick ways to visualize what’s taking up space on your disk so you can either exclude the unwanted data that needs to be copied or delete it to make more space. All of these are fully native GUI applications that show disk space utilization by file type and directory structure.

kdirstat-main.png

2.2. rsync

rsync, from the fertile mind of Andrew (samba) Tridgell, is an application that will synchronize 2 directory trees, transferring only blocks which are different.

For example, if you had recently added some songs to your 120 GB MP3 collection and you wanted to refresh the collection to your backup machine, instead of sending the entire collection over the network, rsync would detect and send only the new songs.

For example, the first time rsync is used to transfer a directory tree, there will be no speedup.

$ rsync -av ~/FF moo:~
building file list ... done
FF/
FF/6vxd7_10_2.pdf
FF/Advanced_Networking_SDSC_Feb_1_minutes_HJM_fw.doc
FF/Amazon Logitech $30 MIR MX Revolution mouse.pdf
FF/Atbatt.com_receipt.gif
FF/BAG_bicycle_advisory_group.letter.doc
FF/BAG_bicycle_advisory_group.letter.odt
 ...

sent 355001628 bytes  received 10070 bytes  11270212.63 bytes/sec
total size is 354923169  speedup is 1.00

but a few minutes later after adding danish_wind_industry.html to the FF directory

$ rsync -av ~/FF moo:~
building file list ... done
FF/
FF/danish_wind_industry.html

sent 63294 bytes  received 48 bytes  126684.00 bytes/sec
total size is 354971578  speedup is 5604.05

So the synchronization has a speedup of 5600-fold relative to the initial transfer.

Even more efficiently, if you had a huge database to back up and you had recently modified it so that most of the bits were identical, rsync would send only the blocks that contained the differences.

Here’s a modest example using a small binary database file:

$ rsync -av mlocate.db moo:~
building file list ... done
mlocate.db

sent 13580195 bytes  received 42 bytes  9053491.33 bytes/sec
total size is 13578416  speedup is 1.00

After the transfer, I update the database and rsync it again:

$ rsync -av mlocate.db moo:~
building file list ... done
mlocate.db

sent 632641 bytes  received 22182 bytes  1309646.00 bytes/sec
total size is 13614982  speedup is 20.79

There are many utilities based on rsync that are used to synchronize data on 2 sides of a connection by only transmitting the differences. The backup utility BackupPC is one.

The open source rsync is included by default with almost all Linux distributions. Versions of rsync exist for Windows as well, via Cygwin and DeltaCopy

Note
MacOSX

rsync is included with MacOSX as well but because of the Mac’s twisted history of using the using the AppleSingle/AppleDouble file format (remember those Resource fork problems?), the version of rsync (2.6.9) shipped with OSX versions up to Leopard will not handle older Mac-native files correctly. However, rsync version 3.x will apparently do the conversions correctly.

2.3. Unison

Unison is a slightly different take on transmitting only changes. It uses a bi-directional sync algorithm to unify filesystems across a network. Native versions exist for Windows as well as Linux/Unix and it is usually available from the standard Linux repositories.

From a Ubuntu or Debian machine, to install it would require:

$ sudo apt-get install unison

3. Streaming Data Transfer

3.1. bbcp

bbcp seems to be a very similar utility to bbftp below, with the exception that it does not require a remote server running. In this behavior, it’s much more like scp in that data transfer requires only user-executable copies on both sides of the connection. Short of access to a GridFTP site, this appears to be the fastest, most convenient single-node method for transferring data.

The code compiled & installed easily with one manual intervention

curl http://www.slac.stanford.edu/~abh/bbcp/bbcp.tar.Z |tar -xZf -
cd bbcp
# edit Makefile to change line 18 to: LIBZ       =  /usr/lib/libz.a
make
# there is no *install* stanza in the distributed 'Makefile'
cp bin/your_arch/bbcp ~/bin   # if that's where you store your personal bins.
hash -r   # or 'rehash' if using cshrc
# bbcp now ready to use.

bbcp can act very much like scp for simple usage:

$ time bbcp  file.633M   user@remotehost.subnet.uci.edu:/high/perf/raid/file
real    0m9.023s

The file transferred in under 10s for a 633MB file, giving >63MB/s on a Gb net. Note that this is over our very fast internal campus backbone. That’s pretty good, but the transfer rate is sensitive to a number of things and can be tuned considerably. If you look at all the bbcp options, it’s obvious that bbcp was written to handle lots of exceptions.

If you increase the number of streams (-s) from the default 4 (as above), you can squeeze a bit more bandwidth from it as well:

$ bbcp -P 10 -w 2M -s 10 file.4.2G hjm@remotehost.subnet.uci.edu:/userdata/hjm/
bbcp: Creating /userdata/hjm/file.4.2G
bbcp: At 081210 12:48:18 copy 20% complete; 89998.2 KB/s
bbcp: At 081210 12:48:28 copy 41% complete; 89910.4 KB/s
bbcp: At 081210 12:48:38 copy 61% complete; 89802.5 KB/s
bbcp: At 081210 12:48:48 copy 80% complete; 88499.3 KB/s
bbcp: At 081210 12:48:58 copy 96% complete; 84571.9 KB/s

or almost 85MB/s for 4.2GB which is very good sustained transfer.

Even traversing the CENIC net from UCI to SDSC is fairly good:

$ time bbcp -P 2 -w 2M -s 10 file.633M   user@machine.sdsc.edu:~/test.file

bbcp: Source I/O buffers (61440K) > 25% of available free memory (200268K); copy may be slow
bbcp: Creating ./test.file
bbcp: At 081205 14:24:28 copy 3% complete; 23009.8 KB/s
bbcp: At 081205 14:24:30 copy 11% complete; 22767.8 KB/s
bbcp: At 081205 14:24:32 copy 20% complete; 25707.1 KB/s
bbcp: At 081205 14:24:34 copy 33% complete; 29374.4 KB/s
bbcp: At 081205 14:24:36 copy 41% complete; 28721.4 KB/s
bbcp: At 081205 14:24:38 copy 52% complete; 29320.0 KB/s
bbcp: At 081205 14:24:40 copy 61% complete; 29318.4 KB/s
bbcp: At 081205 14:24:42 copy 72% complete; 29824.6 KB/s
bbcp: At 081205 14:24:44 copy 81% complete; 29467.3 KB/s
bbcp: At 081205 14:24:46 copy 89% complete; 29225.5 KB/s
bbcp: At 081205 14:24:48 copy 96% complete; 28454.3 KB/s

real    0m26.965s

or almost 30MB/s.

When making the above test, I noticed the disks to and from which the data was being written can have a large effect on the transfer rate. If the data is not (or cannot be) cached in RAM, the transfer will eventually require the data to be read from or written to the disk. Depending on the storage system, this may slow the eventual transfer if the disk I/O cannot keep up with the the network. On the systems that I used in the example above, I saw this effect when I transferred the data to the /home partition (on a slow IDE disk - see below) rather than the higher performance RAID system that I used above.

$ time bbcp -P 2  file.633M  user@remotehost.subnet.uci.edu:/home/user/nother.big.file
bbcp: Creating /home/user/nother.big.file
bbcp: At 081205 13:59:57 copy 19% complete; 76545.0 KB/s
bbcp: At 081205 13:59:59 copy 43% complete; 75107.7 KB/s
bbcp: At 081205 14:00:01 copy 58% complete; 64599.1 KB/s
bbcp: At 081205 14:00:03 copy 59% complete; 48997.5 KB/s
bbcp: At 081205 14:00:05 copy 61% complete; 39994.1 KB/s
bbcp: At 081205 14:00:07 copy 64% complete; 34459.0 KB/s
bbcp: At 081205 14:00:09 copy 66% complete; 30397.3 KB/s
bbcp: At 081205 14:00:11 copy 69% complete; 27536.1 KB/s
bbcp: At 081205 14:00:13 copy 71% complete; 25206.3 KB/s
bbcp: At 081205 14:00:15 copy 72% complete; 23011.2 KB/s
bbcp: At 081205 14:00:17 copy 74% complete; 21472.9 KB/s
bbcp: At 081205 14:00:19 copy 77% complete; 20206.7 KB/s
bbcp: At 081205 14:00:21 copy 79% complete; 19188.7 KB/s
bbcp: At 081205 14:00:23 copy 81% complete; 18376.6 KB/s
bbcp: At 081205 14:00:25 copy 83% complete; 17447.1 KB/s
bbcp: At 081205 14:00:27 copy 84% complete; 16572.5 KB/s
bbcp: At 081205 14:00:29 copy 86% complete; 15929.9 KB/s
bbcp: At 081205 14:00:31 copy 88% complete; 15449.6 KB/s
bbcp: At 081205 14:00:33 copy 91% complete; 15039.3 KB/s
bbcp: At 081205 14:00:35 copy 93% complete; 14616.6 KB/s
bbcp: At 081205 14:00:37 copy 95% complete; 14278.2 KB/s
bbcp: At 081205 14:00:39 copy 98% complete; 13982.9 KB/s

real    0m46.103s

You can see how the transfer rate decays as it approaches the write capacity of the /home disk.

3.2. bbftp

bbftp is a modification of the FTP protocol that enables you to open multiple simultaneous TCP streams to transfer data. It therefore allows you to sometimes bypass per-TCP restrictions that result from badly configured intervening machines.

In order to use it, you 'll need a bbftp client and server. Most places that recieve large amounts of data (SDSC, NCAR, other supercomputer centers, teragrid nodes) will already have a bbftp server running, but you can also compile and run the server yourself.

The more usual case is to run only the client. It builds very easily on Linux with just the typical curl/untar, cd, ./configure, make, make install dance:

$ curl http://doc.in2p3.fr/bbftp/dist/bbftp-client-3.2.0.tar.gz |tar -xzvf -
$ cd bbftp-client-3.2.0/bbftpc/
$ ./configure --prefix=/usr/local
$ make -j3
$ sudo make install

Using bbftp is more complicated than the usual ftp client because it has its own syntax:

To send data to a server:

$ bbftp -s -e 'put file.154M  /gpfs/mangalam/big.file' -u mangalam -p 10 -V tg-login1.sdsc.teragrid.org
Password:
>> COMMAND : put file.154M /gpfs/mangalam/big.file
<< OK
160923648 bytes send in 7.32 secs (2.15e+04 Kbytes/sec or 168 Mbits/s)


the arguments mean:
-s  use ssh encryption
-e  'local command'
-E  'remote command' (not used above, but often used to cd on the remote system)
-u  'user_login'
-p  # use # parallel TCP streams
-V  be verbose

The data was sent at 21MB/s to SDSC thru 10 parallel TCP streams (but well below the peak bandwidth of about 90MB/s on a Gb network)

To get data from a server:

$ bbftp -s -e 'get /gpfs/mangalam/big.file from.sdsc' -u mangalam -p 10 -V tg-login1.sdsc.teragrid.org
Password:
>> COMMAND : get /gpfs/mangalam/big.file from.sdsc
<< OK
160923648 bytes got in 3.46 secs (4.54e+04 Kbytes/sec or 354 Mbits/s)

I was able to get the data at 45MB/s, about half of the theoretical maximum.

As a comparison, because the remote reciever is running an old (2.4) kernel which does not handle dynamic TCP window scaling, scp is only able to manage 2.2MB/s to this server:

$ scp  file.154M mangalam@tg-login1.sdsc.teragrid.org:/gpfs/mangalam/junk
Password:
file.154M                                  100%  153MB   2.2MB/s   01:10

3.3. Fast Data Transfer (fdt)

Fast Data Transfer is an application for moving data quickly writ in Java so it can theoretically run on any platform. The performance results on the web page are very impressive, but in local tests, it was slower than bbcp and the startup time for Java (as well as its failure to work in scp mode (couldn’t find the fdt.jar, even tho it was in the CLASSPATH, required you to explicitly start the receiving FDT server (not hard - see below, but another step)) argue somewhat against it.

Starting the server is easy; it starts by default in server mode:

java -jar ./fdt.jar
# usual Java verbosity omitted

The client uses the same jarfile but a different syntax:

java -jar ./fdt.jar -ss 1M -P 10 -c remotehost.domain.uci.edu  ~/file.633M  -d /userdata/hjm

# where
# -ss 1M  ..... sets the TCP SO_SND_BUFFER size to 1 MB
# -P 10 ....... uses 10 parallel streams (default is 1)
# -c host ..... defines the remote host
# -d dir ...... sets the remote dir

The speed is certainly impressive. Much more than scp:

# scp done over the same net, about the same time

$ scp file.4.2G  remotehost.domain.uci.edu:~
hjm@remotehost's password: ***********
 file.4.2G                   100% 4271MB  25.3MB/s   02:49
                                          ^^^^^^^^
# using the default 1 stream:
$ java -jar fdt.jar -c remotehost.domain.uci.edu ../file.4.2G -d /userdata/hjm/
[transferred in 86s for *53MB/s* ]

# with 10 streams and a larger buffer:
$ java -jar fdt.jar -P 10 -bs 1M -c remotehost.domain.uci.edu ../file.4.2G -d /userdata/hjm/
[transferred in 68s for *66MB/s* with 10 streams]

But fdt is slower than bbcp. The following test was done at about the same time between the same hosts:

bbcp -P 10 -w 2M -s 10 file.4.2G hjm@remotehost.domain.uci.edu:/userdata/hjm/
bbcp: Creating /userdata/hjm/file.4.2G
bbcp: At 081210 12:48:18 copy 20% complete; 89998.2 KB/s
bbcp: At 081210 12:48:28 copy 41% complete; 89910.4 KB/s
bbcp: At 081210 12:48:38 copy 61% complete; 89802.5 KB/s
bbcp: At 081210 12:48:48 copy 80% complete; 88499.3 KB/s
bbcp: At 081210 12:48:58 copy 96% complete; 84571.9 KB/s

3.4. GridFTP

If you and your colleagues have to transfer data in the range of multiple GBs and you have to do it regularly, it’s probably worth setting up a GridFTP site. Because it allows multipoint, multi-stream TCP connections, it can transfer data at mulitple GB/s. However, it’s beyond the scope of this simple doc to describe its setup and use, so if this sounds useful, bother your local network guru/sysadmin.

3.5. netcat

netcat (aka nc) is installed by default on most Linux and MacOSX systems. It provides a way of opening TCP or UDP network connections between nodes, acting as an open pipe thru which you can send any data as fast as the connection will allow, imposing no additional protocol load on the transfer. Because of its widespread availability and it’s speed, it can be used to transmit data between 2 points relatively quickly, especially if the data doesn’t need to be encrypted or compressed (or if it already is).

However, to use netcat, you have to have login privs on both ends of the connection and you need to explicitly set up a sender that waits for a connection request on a specific port from the receiver. This is less convenient to do than simply initiating an scp or rsync connection from one end, but may be worth the effort if the size of the data transfer is very large. To monitor the transfer, you also have to use something like pv (pipeviewer); netcat itself is quite laconic.

How it works: On the sending end, you need to set up a listening port:

[send_host]: $ pv -pet honkin.big.file | nc -q 1 -l -p 1234 <enter>

This sends the honkin.big.file thru pv -pet which will display progress, ETA, and time taken. The command will hang, listening (-l) for a connection from the other end. The -q 1 option tells the sender to wait 1s after getting the EOF and then quit.

On the receiving end, you connect to the nc listener

[receive_host] $ nc host.domain.uci.edu 1234 |pv -b > honkin.big.file <enter>

(note: no -p to indicate port on the receiving side). The -b option to pv shows only bytes received.

Once the receive_host command is inititated, the transfer starts, as can be seen by the pv output on the sending side and the bytecount on the receiving side. When it finishes, both sides terminate the connection 1s after getting the EOF.

This arrangement is slightly arcane, but supports the unix tools philosophy which allows you to chain various small tools together to perform a task. While the above example shows the case for a sinle large file, it can also be modified only slightly to do recursive transfers, using tar, shown here recursively copying the local sge directory to the remote host.

[send_host]: $ tar -czvf - sge | nc -q 1 -l -p 1234
[receive_host] $  nc host.domain.uci.edu 1234 |tar -xzvf -

In this case, I’ve added the verbose flag (-v) to the tar command so using pv is redundant. It also uses tar’s built-in compression flag (-c) to compress as it transmits.

You could also bundle the 2 together in a script, using ssh to execute the remote command. etc, etc, etc, etc.

4. Latest version of this Document

The latest version of this document should always be here.