v1.7, Mar 1st, 2012
|
|
Executive Summary
If you have to transfer data, transfer only that which is necessary. If you unavoidably have TBs to transfer regularly, consider having your institution set up a GridFTP node. If GridFTP is not available, and you’ve not added a Globus Online endpoint, the fastest, easiest, user-mode, node-to-node method to move data for Linux and MacOSX is with bbcp. The only exception to this is for extremely large directory trees for which bbcp is inefficient due to time required for building the directory tree. For first-time transfers of multi-GB directory trees containing 10,000s of files, the use of tar & netcat seems to be the fastest way to move the data. If you use Windows, fdt is Java-based and will run there as well. Note that bbcp and the similar bbftp can require considerable tuning to extract maximum bandwidth. If these applications do not work at expected rates, ESNet’s Guide to Bulk Data Transfer over a WAN is an excellent summary of the deeper network issues. And everyone should know how to use rsync, which is available on most *nix sytems and should be the default fallback for most data transfers. |
We all need to transfer data, and the amount of that data is increasing as the world gets more digital. If it’s not climate model data from the IPCC, it’s high energy particle physics data from the LHC, or audio & video streams from a performance recording.
1. Compression & Encryption
Whether to compress and/or encrypt your data in transit depends on the cost of doing so. For a modern desktop or laptop computer, the CPU(s) are usually not doing much of anything so the cost incurred in doing the compression/encryption is generally not even noticed. However on an otherwise loaded machine, it can be significant, so it depends on what has to be done at the same time. Compression can reduce the amount of data that needs to be transmitted considerably if the data is of a type that is compressible (text, XML, uncompressed images and music), however progressively such data is already compressed on the disk (in the form of jpeg or mp3 compression), and compressing already compressed data yields little improvement. Some compression utilities try to detect already-compressed data and skip it, so there’s often no penalty in requesting compression, but some utilities (like the popular Linux archiving tar) will not detect it correctly and waste lots of time trying.
As an extreme example, here’s the timing of making a tar archive of a large directory that consists of mostly already compressed data, using compression or not.
Using compression:
$ time tar -czpf /bduc/data.tar.gz /data tar: Removing leading `/' from member names real 201m38.540s user 95m32.114s sys 7m13.807s tar file = 84,284,016,900 bytes
NOT using compression:
$ time tar -cpf /bduc/data.tar /data tar: Removing leading `/' from member names real 127m13.404s user 0m43.579s sys 5m35.437s tar file = 86,237,952,000
It took more than 74 minutes (about 58%) longer using compression which gained us about 2GB less storage (2.3% decrease in size.) YMMV.
Similarly, there is a computational cost to encrypting and decrypting a text, but less so than with compression. scp uses ssh to do the underlying encryption and it does a very good job, but like the other single-TCP-stream utilities like curl and wget, it will only be able to push so much thru a connection.
2. Avoiding data transfer
The most efficient way to transfer data is not to transfer it at all. There are a number of utilities that can be used to assist in NOT transferring data. Some of them are listed below.
2.1. kdirstat
The elegant, open source kdirstat (and it’s ports to MacOSX Disk Inventory X and Windows Windirstat) are quick ways to visualize what’s taking up space on your disk so you can either exclude the unwanted data that needs to be copied or delete it to make more space. All of these are fully native GUI applications that show disk space utilization by file type and directory structure.
2.2. rsync
rsync, from the fertile mind of Andrew (samba) Tridgell, is an application that will synchronize 2 directory trees, transferring only blocks which are different.
|
|
rsync vs bbcp
bbcp can act similarly to rsync but will only checksum entire files, not blocks, so for sub-GB transfers, rsync is probably a better choice in general. For very large files or directory trees, bbcp may be a better choice due to its multi-stream protocol and therefore better bandwidth utilization. Note also that rsync is often used with ssh as the remote shell protocol. If this is the case and you’re using it to transfer large amounts of data, note that there is a known ssh bug with the static flow control buffers that cripples it for large data transfers. There is a well-maintained patch for ssh that addresses this at the High Performance SSH/SCP page. This is well worth pursuing if you use rsync or scp for large transfers. |
For example, if you had recently added some songs to your 120 GB MP3 collection and you wanted to refresh the collection to your backup machine, instead of sending the entire collection over the network, rsync would detect and send only the new songs.
For example, the first time rsync is used to transfer a directory tree, there will be no speedup.
$ rsync -av ~/FF moo:~ building file list ... done FF/ FF/6vxd7_10_2.pdf FF/Advanced_Networking_SDSC_Feb_1_minutes_HJM_fw.doc FF/Amazon Logitech $30 MIR MX Revolution mouse.pdf FF/Atbatt.com_receipt.gif FF/BAG_bicycle_advisory_group.letter.doc FF/BAG_bicycle_advisory_group.letter.odt ... sent 355001628 bytes received 10070 bytes 11270212.63 bytes/sec total size is 354923169 speedup is 1.00
but a few minutes later after adding danish_wind_industry.html to the FF directory
$ rsync -av ~/FF moo:~ building file list ... done FF/ FF/danish_wind_industry.html sent 63294 bytes received 48 bytes 126684.00 bytes/sec total size is 354971578 speedup is 5604.05
So the synchronization has a speedup of 5600-fold relative to the initial transfer.
Even more efficiently, if you had a huge database to back up and you had recently modified it so that most of the bits were identical, rsync would send only the blocks that contained the differences.
Here’s a modest example using a small binary database file:
$ rsync -av mlocate.db moo:~ building file list ... done mlocate.db sent 13580195 bytes received 42 bytes 9053491.33 bytes/sec total size is 13578416 speedup is 1.00
After the transfer, I update the database and rsync it again:
$ rsync -av mlocate.db moo:~ building file list ... done mlocate.db sent 632641 bytes received 22182 bytes 1309646.00 bytes/sec total size is 13614982 speedup is 20.79
There are many utilities based on rsync that are used to synchronize data on 2 sides of a connection by only transmitting the differences. The backup utility BackupPC is one.
The open source rsync is included by default with almost all Linux distributions. Versions of rsync exist for Windows as well, via Cygwin and DeltaCopy
|
|
MacOSX
rsync is included with MacOSX as well but because of the Mac’s twisted history of using the using the AppleSingle/AppleDouble file format (remember those Resource fork problems?), the version of rsync (2.6.9) shipped with OSX versions up to Leopard will not handle older Mac-native files correctly. However, rsync version 3.x will apparently do the conversions correctly. |
2.3. Unison
Unison is a slightly different take on transmitting only changes. It uses a bi-directional sync algorithm to unify filesystems across a network. Native versions exist for Windows as well as Linux/Unix and it is usually available from the standard Linux repositories.
From a Ubuntu or Debian machine, to install it would require:
$ sudo apt-get install unison
3. Streaming Data Transfer
3.1. bbcp
bbcp seems to be a very similar utility to bbftp below, with the exception that it does not require a remote server running. In this behavior, it’s much more like scp in that data transfer requires only user-executable copies on both sides of the connection. Short of access to a GridFTP site, this appears to be the fastest, most convenient single-node method for transferring data.
The code compiled & installed easily with one manual intervention
curl http://www.slac.stanford.edu/~abh/bbcp/bbcp.tgz |tar -xzf - cd bbcp make # edit Makefile to change line 18 to: LIBZ = /usr/lib/libz.a make # there is no *install* stanza in the distributed 'Makefile' cp bin/your_arch/bbcp ~/bin # if that's where you store your personal bins. hash -r # or 'rehash' if using cshrc # bbcp now ready to use.
bbcp can act very much like scp for simple usage:
$ time bbcp file.633M user@remotehost.subnet.uci.edu:/high/perf/raid/file real 0m9.023s
The file transferred in under 10s for a 633MB file, giving >63MB/s on a Gb net. Note that this is over our very fast internal campus backbone. That’s pretty good, but the transfer rate is sensitive to a number of things and can be tuned considerably. If you look at all the bbcp options, it’s obvious that bbcp was written to handle lots of exceptions.
If you increase the number of streams (-s) from the default 4 (as above), you can squeeze a bit more bandwidth from it as well:
$ bbcp -P 10 -w 2M -s 10 file.4.2G hjm@remotehost.subnet.uci.edu:/userdata/hjm/ bbcp: Creating /userdata/hjm/file.4.2G bbcp: At 081210 12:48:18 copy 20% complete; 89998.2 KB/s bbcp: At 081210 12:48:28 copy 41% complete; 89910.4 KB/s bbcp: At 081210 12:48:38 copy 61% complete; 89802.5 KB/s bbcp: At 081210 12:48:48 copy 80% complete; 88499.3 KB/s bbcp: At 081210 12:48:58 copy 96% complete; 84571.9 KB/s
or almost 85MB/s for 4.2GB which is very good sustained transfer.
Even traversing the CENIC net from UCI to SDSC is fairly good:
$ time bbcp -P 2 -w 2M -s 10 file.633M user@machine.sdsc.edu:~/test.file bbcp: Source I/O buffers (61440K) > 25% of available free memory (200268K); copy may be slow bbcp: Creating ./test.file bbcp: At 081205 14:24:28 copy 3% complete; 23009.8 KB/s bbcp: At 081205 14:24:30 copy 11% complete; 22767.8 KB/s bbcp: At 081205 14:24:32 copy 20% complete; 25707.1 KB/s bbcp: At 081205 14:24:34 copy 33% complete; 29374.4 KB/s bbcp: At 081205 14:24:36 copy 41% complete; 28721.4 KB/s bbcp: At 081205 14:24:38 copy 52% complete; 29320.0 KB/s bbcp: At 081205 14:24:40 copy 61% complete; 29318.4 KB/s bbcp: At 081205 14:24:42 copy 72% complete; 29824.6 KB/s bbcp: At 081205 14:24:44 copy 81% complete; 29467.3 KB/s bbcp: At 081205 14:24:46 copy 89% complete; 29225.5 KB/s bbcp: At 081205 14:24:48 copy 96% complete; 28454.3 KB/s real 0m26.965s
or almost 30MB/s.
When making the above test, I noticed the disks to and from which the data was being written can have a large effect on the transfer rate. If the data is not (or cannot be) cached in RAM, the transfer will eventually require the data to be read from or written to the disk. Depending on the storage system, this may slow the eventual transfer if the disk I/O cannot keep up with the the network. On the systems that I used in the example above, I saw this effect when I transferred the data to the /home partition (on a slow IDE disk - see below) rather than the higher performance RAID system that I used above.
$ time bbcp -P 2 file.633M user@remotehost.subnet.uci.edu:/home/user/nother.big.file bbcp: Creating /home/user/nother.big.file bbcp: At 081205 13:59:57 copy 19% complete; 76545.0 KB/s bbcp: At 081205 13:59:59 copy 43% complete; 75107.7 KB/s bbcp: At 081205 14:00:01 copy 58% complete; 64599.1 KB/s bbcp: At 081205 14:00:03 copy 59% complete; 48997.5 KB/s bbcp: At 081205 14:00:05 copy 61% complete; 39994.1 KB/s bbcp: At 081205 14:00:07 copy 64% complete; 34459.0 KB/s bbcp: At 081205 14:00:09 copy 66% complete; 30397.3 KB/s bbcp: At 081205 14:00:11 copy 69% complete; 27536.1 KB/s bbcp: At 081205 14:00:13 copy 71% complete; 25206.3 KB/s bbcp: At 081205 14:00:15 copy 72% complete; 23011.2 KB/s bbcp: At 081205 14:00:17 copy 74% complete; 21472.9 KB/s bbcp: At 081205 14:00:19 copy 77% complete; 20206.7 KB/s bbcp: At 081205 14:00:21 copy 79% complete; 19188.7 KB/s bbcp: At 081205 14:00:23 copy 81% complete; 18376.6 KB/s bbcp: At 081205 14:00:25 copy 83% complete; 17447.1 KB/s bbcp: At 081205 14:00:27 copy 84% complete; 16572.5 KB/s bbcp: At 081205 14:00:29 copy 86% complete; 15929.9 KB/s bbcp: At 081205 14:00:31 copy 88% complete; 15449.6 KB/s bbcp: At 081205 14:00:33 copy 91% complete; 15039.3 KB/s bbcp: At 081205 14:00:35 copy 93% complete; 14616.6 KB/s bbcp: At 081205 14:00:37 copy 95% complete; 14278.2 KB/s bbcp: At 081205 14:00:39 copy 98% complete; 13982.9 KB/s real 0m46.103s
You can see how the transfer rate decays as it approaches the write capacity of the /home disk.
bbcp can recursively copy directories with the -r flag. Like rsync, it first has to build a file list to send to the receiver, but unlike rsync, it doesn’t tell you that it’s doing that, so unless you use the '-D (debug) flag, it looks like it has just hung. The time required to build the file list is of course proportional to the complexity of the recursive directory scan. It can also do incremental copies like rsync with the -a -k flags.
Note that bbcp is very slow at copying deep directory trees of small files. If you need to copy such trees, you should first tar up the trees and use bbcp to copy the tarball. Such an approach will increase the transfer speed enormously. The most recent version of bbcp can use the -N named pipes option to use external programs or pipes to feed the network stream.
root@cranky $ bbcp -a -k -p -P 5 -w 2M -s 10 -V -r home \
user@filer:/data/backups/
Source cranky.nacs.uci.edu using initial send window of 1048576
Target filer.nacs.uci.edu using initial recv window of 1048576
(silently builds file list; can take hours depending on size of dir)
...
bbcp_SNK 27260: Append signature file is /root/.bbcp/bbcp.10.103.1.1.90160021a89.I.0186.s19
bbcp_SNK 27260: Received from 10.103.1.1: 28871 f 9905808614325 644 8192 4e52a99d 4e52a99d /path/to/file...
bbcp_SNK 27260: Append signature file is /root/.bbcp/bbcp.10.103.1.1.90160021a89.I.0037.s12
bbcp_SNK 27260: Received from 10.103.1.1: 28872 f 9905808614326 644 8192 4e52a99d 4e52a99d /path/to/file...
bbcp_SNK 27260: Append signature file is /root/.bbcp/bbcp.10.103.1.1.90160021a89.I.0091.s13
bbcp_SNK 27260: Received from 10.103.1.1: 28873 f 9905808614327 644 8192 4e52a99d 4e52a99d /path/to/file...
...
3.2. bbftp
bbftp is a modification of the FTP protocol that enables you to open multiple simultaneous TCP streams to transfer data. It therefore allows you to sometimes bypass per-TCP restrictions that result from badly configured intervening machines.
In order to use it, you 'll need a bbftp client and server. Most places that recieve large amounts of data (SDSC, NCAR, other supercomputer centers, teragrid nodes) will already have a bbftp server running, but you can also compile and run the server yourself.
The more usual case is to run only the client. It builds very easily on Linux with just the typical curl/untar, cd, ./configure, make, make install dance:
$ curl http://doc.in2p3.fr/bbftp/dist/bbftp-client-3.2.0.tar.gz |tar -xzvf - $ cd bbftp-client-3.2.0/bbftpc/ $ ./configure --prefix=/usr/local $ make -j3 $ sudo make install
Using bbftp is more complicated than the usual ftp client because it has its own syntax:
To send data to a server:
$ bbftp -s -e 'put file.154M /gpfs/mangalam/big.file' -u mangalam -p 10 -V tg-login1.sdsc.teragrid.org Password: >> COMMAND : put file.154M /gpfs/mangalam/big.file << OK 160923648 bytes send in 7.32 secs (2.15e+04 Kbytes/sec or 168 Mbits/s) the arguments mean: -s use ssh encryption -e 'local command' -E 'remote command' (not used above, but often used to cd on the remote system) -u 'user_login' -p # use # parallel TCP streams -V be verbose
The data was sent at 21MB/s to SDSC thru 10 parallel TCP streams (but well below the peak bandwidth of about 90MB/s on a Gb network)
To get data from a server:
$ bbftp -s -e 'get /gpfs/mangalam/big.file from.sdsc' -u mangalam -p 10 -V tg-login1.sdsc.teragrid.org Password: >> COMMAND : get /gpfs/mangalam/big.file from.sdsc << OK 160923648 bytes got in 3.46 secs (4.54e+04 Kbytes/sec or 354 Mbits/s)
I was able to get the data at 45MB/s, about half of the theoretical maximum.
As a comparison, because the remote reciever is running an old (2.4) kernel which does not handle dynamic TCP window scaling, scp is only able to manage 2.2MB/s to this server:
$ scp file.154M mangalam@tg-login1.sdsc.teragrid.org:/gpfs/mangalam/junk Password: file.154M 100% 153MB 2.2MB/s 01:10
3.3. Fast Data Transfer (fdt)
Fast Data Transfer is an application for moving data quickly writ in Java so it can theoretically run on any platform. The performance results on the web page are very impressive, but in local tests, it was slower than bbcp and the startup time for Java (as well as its failure to work in scp mode (couldn’t find the fdt.jar, even tho it was in the CLASSPATH, required you to explicitly start the receiving FDT server (not hard - see below, but another step)) argue somewhat against it.
Starting the server is easy; it starts by default in server mode:
java -jar ./fdt.jar # usual Java verbosity omitted
The client uses the same jarfile but a different syntax:
java -jar ./fdt.jar -ss 1M -P 10 -c remotehost.domain.uci.edu ~/file.633M -d /userdata/hjm # where # -ss 1M ..... sets the TCP SO_SND_BUFFER size to 1 MB # -P 10 ....... uses 10 parallel streams (default is 1) # -c host ..... defines the remote host # -d dir ...... sets the remote dir
The speed is certainly impressive. Much more than scp:
# scp done over the same net, about the same time
$ scp file.4.2G remotehost.domain.uci.edu:~
hjm@remotehost's password: ***********
file.4.2G 100% 4271MB 25.3MB/s 02:49
^^^^^^^^
# using the default 1 stream: $ java -jar fdt.jar -c remotehost.domain.uci.edu ../file.4.2G -d /userdata/hjm/ [transferred in 86s for *53MB/s* ] # with 10 streams and a larger buffer: $ java -jar fdt.jar -P 10 -bs 1M -c remotehost.domain.uci.edu ../file.4.2G -d /userdata/hjm/ [transferred in 68s for *66MB/s* with 10 streams]
But fdt is slower than bbcp. The following test was done at about the same time between the same hosts:
bbcp -P 10 -w 2M -s 10 file.4.2G hjm@remotehost.domain.uci.edu:/userdata/hjm/ bbcp: Creating /userdata/hjm/file.4.2G bbcp: At 081210 12:48:18 copy 20% complete; 89998.2 KB/s bbcp: At 081210 12:48:28 copy 41% complete; 89910.4 KB/s bbcp: At 081210 12:48:38 copy 61% complete; 89802.5 KB/s bbcp: At 081210 12:48:48 copy 80% complete; 88499.3 KB/s bbcp: At 081210 12:48:58 copy 96% complete; 84571.9 KB/s
3.4. GridFTP
If you and your colleagues have to transfer data in the range of multiple GBs and you have to do it regularly, it’s probably worth setting up a GridFTP site. Because it allows multipoint, multi-stream TCP connections, it can transfer data at mulitple GB/s. However, it’s beyond the scope of this simple doc to describe its setup and use, so if this sounds useful, bother your local network guru/sysadmin.
3.5. Globus Online
Globus Online is grid technology that has been wrapped in a web interface to enable mere humans to use the capailities of the GRIDFTP system to transfer very large amounts of data very quickly between nodes that are part of this system. The tagline is Move files fast No IT required. The advantages are that when it’s set up, it works very well. The disadvatnages are that the No IT required part is at this point, fairly optimistic and that it will only work between registered sites (which you can add with the Globus Connect process, so it’s not very useful for ad hoc file transfers.
Snarky Point of Contention: The documentation overuses the word seamlessly which all computer users realizes is a contraction for seamlessly if nothing goes wrong and your setup is exactly like mine and monkeys fly out my butt. YMMV.
However, if you have a set of endpoints that frequently need to transfer large amounts of data, this approach would be very useful. Especially if you run a cluster or other multi-user system, there is an associated utility called Globus Connect Multi-User which will allow all users of registered endpoint to use the Globus transfer capabilities.
I presume it will be getting progressively easier to use as time passes.
3.6. netcat
netcat (aka nc) is installed by default on most Linux and MacOSX systems. It provides a way of opening TCP or UDP network connections between nodes, acting as an open pipe thru which you can send any data as fast as the connection will allow, imposing no additional protocol load on the transfer. Because of its widespread availability and it’s speed, it can be used to transmit data between 2 points relatively quickly, especially if the data doesn’t need to be encrypted or compressed (or if it already is).
However, to use netcat, you have to have login privs on both ends of the connection and you need to explicitly set up a listener that waits for a connection request on a specific port from the receiver. This is less convenient to do than simply initiating an scp or rsync connection from one end, but may be worth the effort if the size of the data transfer is very large. To monitor the transfer, you also have to use something like pv (pipeviewer); netcat itself is quite laconic.
How it works: On one end (the sending end, in this case), you need to set up a listening port:
[send_host]: $ pv -pet honkin.big.file | nc -q 1 -l 1234 <enter>
This sends the honkin.big.file thru pv -pet which will display progress, ETA, and time taken. The command will hang, listening (-l) for a connection from the other end. The -q 1 option tells the sender to wait 1s after getting the EOF and then quit.
On the receiving end, you connect to the nc listener
[receive_host] $ nc sender.net.uci.edu 1234 |pv -b > honkin.big.file <enter>
(note: no -p to indicate port on the receiving side). The -b option to pv shows only bytes received.
Once the receive_host command is inititated, the transfer starts, as can be seen by the pv output on the sending side and the bytecount on the receiving side. When it finishes, both sides terminate the connection 1s after getting the EOF.
This arrangement is slightly arcane, but supports the unix tools philosophy which allows you to chain various small tools together to perform a task. While the above example shows the case for a single large file, it can also be modified only slightly to do recursive transfers, using tar, shown here recursively copying the local sge directory to the remote host.
3.6.1. tar and netcat
The combination of these 2 crusty relics from the stone age of Unix are remarkably effective for moving data if you don’t need encryption. Since they impose very little protocol overhead to the data, the transfer can run at close to wire speed for large files. Compression can be added with the tar options of -z (gzip) or '-j (bzip2).
The setup is not as trivial as rsync, scp, or bbcp, since it requires commands to be issued at both ends of the connection, but for large transfers, the speed payoff is non-trivial. For example, using a single rsync on a 10Gb private connection, we were getting only about 30MB/s, mostly because of many tiny files. Using tar/netcat, the average speed went up to about 100MB/s. And using multiple tar/netcat combinations to move specific subdirs, we were able to get an average of 500GB/hr, still not great (~14% of theoretical max), but about 5x better than rsync alone.
Note that you can set up the listener on either side. In this example, I’ve set the listener to the receiving side.
In the following example, the receiver is 10.255.78.10; the sender is 10.255.78.2.
First start the listener waiting on port 12378
[receive_host] $ nc -l receiver port_# | tar -xf -
#eg
$ nc -l 10.103.1.10 12378 | tar -xzf -
# when the command is issued, the prompt hangs, waiting for the sender to start
Below, local.interface is the sender interface (by IP # or hostname) you want to use. Often a server will have many and you will want to use a specific one.
[send_host]: $ tar -czvf - dir_target | nc -s sender receiver port_#
# eg
$ tar -czvf - fmri_classic | nc -s 10.255.78.2 10.255.78.10 12378
In this case, I’ve added the verbose flag (-v) to the tar command on the sender side so using pv is redundant. It also uses tar’s built-in compression flag (-z) to compress/decompress as it transmits. Depending on the bandwidth available to you and the CPUs of the hosts, this may actually slow transmission. As noted above, it’s most effective on bandwidth-limited channels.
You could also bundle the 2 together in a script, using ssh to execute the remote command. etc, etc, etc, etc.
4. Latest version of this Document
The latest version of this document should always be here.