How to transfer large amounts of data via network. ================================================== by Harry Mangalam //v0.1, 03 Dec 2008 //Harry Mangalam mailto:harry.mangalam@uci.edu[harry.mangalam@uci.edu] // Convert this file to HTML & move to it's final dest with the command: // asciidoc -a toc -a numbered HOWTO_move_data.txt; scp HOWTO_move_data.* moo:~/public_html [NOTE] .Executive Summary ============================================================================ link:[Avoid data transfer] as much as possible by detecting that which is unnecessary to transfer. If you have to transfer data, transfer only that which is necessary. If you unavoidably have *lots of data* to transfer, consider having your institution set up a GridFTP node. Failing that, the fastest, user-mode, node-to-node way to move data is with bbcp. ============================================================================ We all need to transfer data, and the amount of that data is increasing as the world gets more digital. If it's not climate model data from the IPCC, it's high energy particle physics data from the LHC (or your MP3 collection). The usual methods of transferring data (http://en.wikipedia.org/wiki/Secure_copy[scp], http://en.wikipedia.org/wiki/Http[http] and http://en.wikipedia.org/wiki/Ftp[ftp] utilities (such as http://curl.haxx.se/[curl] or http://en.wikipedia.org/wiki/Wget[wget]) work fine when your data is in the MB range, but when you have very large collections of data there are some tricks that are worth mentioning. anchor: Compression & Encryption ------------------------ Whether to compress and/or encrypt your data in transit depends on the cost of doing so. For a modern desktop or laptop computer, the CPU(s) are usually not doing much of anything so the cost incurred in doing the compression/encryption is generally not even noticed. However on an otherwise loaded machine, it can be significant, so it depends on what has to be done at the same time. Compression can reduce the amount of data that needs to be transmitted considerably if the data is of a type that is compressible (text, XML, uncompressed images and music), however progressively such data is already compressed on the disk, and compressing already compressed data yields little improvement. Most compression utilities try to detect already-compressed data and skip it, so there is usually no penalty in requesting compression, but some utilities will not detect it correctly and waste lots of time. Similarly, there is a computational cost to encrypting and decrypting a text, but less so than with compression. 'scp' uses 'ssh' to do the underlying encryption and it does a very good job, but like the other single-stream utilities like 'curl' and 'wget', it will only be able to push so much thru a connection. anchor: Avoiding data transfer ---------------------- The most efficient way to transfer data is not to transfer it at all. There are a number of utilities that can be used to assist in NOT transferring data. Some of them are listed below. anchor: kdirstat ~~~~~~~~ The elegant, open source http://kdirstat.sourceforge.net/[kdirstat] (and it's ports to MacOSX http://www.derlien.com/[Disk Inventory X] and Windows http://windirstat.info/[Windirstat]) are quick ways to visualize what's taking up space on your disk so you can either exclude the unwanted data that needs to be copied or delete it to make more space. All of these are fully native GUI applications that show disk space utilization by file type and directory structure. image:kdirstat-main.png[kdirstat screenshot] anchor: rsync ~~~~~~ http://samba.anu.edu.au/rsync[rsync], from the fertile mind of Andrew (http://us6.samba.org/samba/[samba]) Tridgell, is a protocol that can recurse thru a directory tree and create 'rolling checksums' for the data it encounters and send only changed data over the network to the remote rsync. For example, if you had recently added a song to your 120 GB MP3 collection and you wanted to refresh the collection to your backup machine, instead of sending the entire collection over the network, rsync would detect and send only the new songs. For example, the first time rsync is used to transfer a directory tree, there will be no speedup. ------------------------------------------------------------------- $ rsync -av ~/FF moo:~ building file list ... done FF/ FF/6vxd7_10_2.pdf FF/Advanced_Networking_SDSC_Feb_1_minutes_HJM_fw.doc FF/Amazon Logitech $30 MIR MX Revolution mouse.pdf FF/Atbatt.com_receipt.gif FF/BAG_bicycle_advisory_group.letter.doc FF/BAG_bicycle_advisory_group.letter.odt ... sent 355001628 bytes received 10070 bytes 11270212.63 bytes/sec total size is 354923169 speedup is 1.00 ------------------------------------------------------------------- but a few minutes later after adding 'danish_wind_industry.html' to the 'FF' directory ------------------------------------------------------------------- $ rsync -av ~/FF moo:~ building file list ... done FF/ FF/danish_wind_industry.html sent 63294 bytes received 48 bytes 126684.00 bytes/sec total size is 354971578 speedup is 5604.05 ------------------------------------------------------------------- So the synchronization has a speedup of 5600-fold. Even more efficiently, if you had a huge database to back up and you had recently modified it so that most of the bits were identical, rsync would send only the blocks that contained the differences. Here's a modest example using a small binary database file: ------------------------------------------------------------------- $ rsync -av mlocate.db moo:~ building file list ... done mlocate.db sent 13580195 bytes received 42 bytes 9053491.33 bytes/sec total size is 13578416 speedup is 1.00 -------------------------------------------------------------------- After the transfer, I update the database and rsync it again: ------------------------------------------------------------------ $ rsync -av mlocate.db moo:~ building file list ... done mlocate.db sent 632641 bytes received 22182 bytes 1309646.00 bytes/sec total size is 13614982 speedup is 20.79 ------------------------------------------------------------------- There are many utilities based on rsync that are used to synchronize data on 2 sides of a connection by only transmitting the differences. The backup utility http://backuppc.sf.net[BackupPC] is one. The open source rsync is included by default with almost all Linux distributions as well as Mac OSX. Versions of rsync exist for Windows as well, via http://www.cygwin.com[Cygwin] and http://www.aboutmyip.com/AboutMyXApp/DeltaCopy.jsp[DeltaCopy] anchor: Unison ~~~~~~ http://www.cis.upenn.edu/~bcpierce/unison/[Unison] is a slightly different take on transmitting only changes. It uses a bi-directional sync algorithm to 'unify' filesystems across a network. Native versions exist for Windows as well as Linux/Unix and it is usually available from the standard Linux repositories. From a Ubuntu or Debian machine, to install it would require: ------------------------------------------------------------------------------- $ sudo apt-get install unison ------------------------------------------------------------------------------- anchor: Streaming Data transfer -------------------------- anchor: GridFTP ~~~~~~~ If you and your colleagues have to transfer data in the range of multiple GBs and you have to do it regularly, it's probably worth setting up a http://en.wikipedia.org/wiki/GridFTP[GridFTP] site. Because it allows multipoint, multi-stream TCP connections, it can transfer data at mulitple GB/s. However, it's beyond the scope of this simple doc to describe its setup and use, so if this sounds useful, bother your local network guru/sysadmin. anchor: bbcp ~~~~ http://www.slac.stanford.edu/~abh/bbcp/[bbcp] seems to be a very similar utility to bbftp (see below), with the exception that it does not require a remote server running. In this behavior, it's much more like 'scp' in that data transfer requires only user-executable copies on both sides of the connection. After compiling and installing the code: ------------------------------------------------------------------------------- curl http://www.slac.stanford.edu/~abh/bbcp/bbcp.tar.Z |tar -xZf - cd bbcp # edit Makefile to change line 18 to: LIBZ = /usr/lib/libz.a make # there is no *install* stanza in the distributed 'Makefile' cp bin/your_arch/bbcp ~/bin # if that's where you store your personal bins. hash -r # or 'rehash' if using cshrc # bbcp now ready to use. ------------------------------------------------------------------------------- 'bbcp' can act very much like 'scp' for simple usage: ------------------------------------------------------------------------------- $ time bbcp file.633M user@remotehost.subnet.uci.edu:/high/perf/raid/file real 0m9.023s ------------------------------------------------------------------------------- The file transferred in under 10s for a 633MB file, giving >63MB/s on a Gb net. Note that this is over our very fast internal campus backbone. That's pretty good, but the transfer rate is sensitive to a number of things and can be tuned considerably. If you look at http://www.slac.stanford.edu/~abh/bbcp/[all the bbcp options], it's obvious that 'bbcp' was written to handle lots of exceptions. If you increase the number of streams (-s) from the default 4 (as above), you can squeeze a bit more bandwidth from it as well): ------------------------------------------------------------------------------- 560 $ time bbcp -P 2 -w 2M -s 10 file.633M user@remotehost.subnet.uci.edu:/hiperstore/userdata/hjm/junk.pos bbcp: Source I/O buffers (61440K) > 25% of available free memory (202708K); copy may be slow bbcp: Creating /hiperstore/userdata/hjm/junk.pos bbcp: At 081205 14:07:16 copy 23% complete; 97912.4 KB/s bbcp: At 081205 14:07:18 copy 49% complete; 86859.7 KB/s bbcp: At 081205 14:07:20 copy 78% complete; 87875.0 KB/s real 0m8.467s ------------------------------------------------------------------------------- or about 75MB/s which is pretty good. Even traversing the CENIC net from UCI to SDSC is fairly good: ------------------------------------------------------------------------------- $ time bbcp -P 2 -w 2M -s 10 file.633M user@machine.sdsc.edu:~/test.file bbcp: Source I/O buffers (61440K) > 25% of available free memory (200268K); copy may be slow bbcp: Creating ./test.file bbcp: At 081205 14:24:28 copy 3% complete; 23009.8 KB/s bbcp: At 081205 14:24:30 copy 11% complete; 22767.8 KB/s bbcp: At 081205 14:24:32 copy 20% complete; 25707.1 KB/s bbcp: At 081205 14:24:34 copy 33% complete; 29374.4 KB/s bbcp: At 081205 14:24:36 copy 41% complete; 28721.4 KB/s bbcp: At 081205 14:24:38 copy 52% complete; 29320.0 KB/s bbcp: At 081205 14:24:40 copy 61% complete; 29318.4 KB/s bbcp: At 081205 14:24:42 copy 72% complete; 29824.6 KB/s bbcp: At 081205 14:24:44 copy 81% complete; 29467.3 KB/s bbcp: At 081205 14:24:46 copy 89% complete; 29225.5 KB/s bbcp: At 081205 14:24:48 copy 96% complete; 28454.3 KB/s real 0m26.965s ------------------------------------------------------------------------------- or almost 30MB/s. When making the above test, I noticed the disks to and from which the data was being written can have a large effect on the transfer rate. If the data is not (or cannot be) cached in RAM, the transfer will eventually require the data to be read from or written to the disk. Depending on the storage system, this may slow the eventual transfer if the disk I/O cannot keep up with the the network. On the systems that I used in the example above, I saw this effect when I transferred the data to the /home partition (on a slow IDE disk - see below) rather than the higher performance RAID system that I used above. ------------------------------------------------------------------------------- $ time bbcp -P 2 file.633M user@remotehost.subnet.uci.edu:/home/user/nother.big.file bbcp: Creating /home/user/nother.big.file bbcp: At 081205 13:59:57 copy 19% complete; 76545.0 KB/s bbcp: At 081205 13:59:59 copy 43% complete; 75107.7 KB/s bbcp: At 081205 14:00:01 copy 58% complete; 64599.1 KB/s bbcp: At 081205 14:00:03 copy 59% complete; 48997.5 KB/s bbcp: At 081205 14:00:05 copy 61% complete; 39994.1 KB/s bbcp: At 081205 14:00:07 copy 64% complete; 34459.0 KB/s bbcp: At 081205 14:00:09 copy 66% complete; 30397.3 KB/s bbcp: At 081205 14:00:11 copy 69% complete; 27536.1 KB/s bbcp: At 081205 14:00:13 copy 71% complete; 25206.3 KB/s bbcp: At 081205 14:00:15 copy 72% complete; 23011.2 KB/s bbcp: At 081205 14:00:17 copy 74% complete; 21472.9 KB/s bbcp: At 081205 14:00:19 copy 77% complete; 20206.7 KB/s bbcp: At 081205 14:00:21 copy 79% complete; 19188.7 KB/s bbcp: At 081205 14:00:23 copy 81% complete; 18376.6 KB/s bbcp: At 081205 14:00:25 copy 83% complete; 17447.1 KB/s bbcp: At 081205 14:00:27 copy 84% complete; 16572.5 KB/s bbcp: At 081205 14:00:29 copy 86% complete; 15929.9 KB/s bbcp: At 081205 14:00:31 copy 88% complete; 15449.6 KB/s bbcp: At 081205 14:00:33 copy 91% complete; 15039.3 KB/s bbcp: At 081205 14:00:35 copy 93% complete; 14616.6 KB/s bbcp: At 081205 14:00:37 copy 95% complete; 14278.2 KB/s bbcp: At 081205 14:00:39 copy 98% complete; 13982.9 KB/s real 0m46.103s ------------------------------------------------------------------------------- You can see how the transfer rate decays as it approaches the write capacity of the /home disk. anchor: bbftp ~~~~~ http://doc.in2p3.fr/bbftp/[bbftp] is a modification of the FTP protocol that enables you to open multiple simultaneous TCP streams to transfer data. It therefore allows you to sometimes bypass per-TCP restrictions that result from badly configured intervening machines. Short of access to a GridFTP site, this appears to be the best single-node method for transferring data. In order to use it, you 'll need a bbftp client and server. Most places that recieve large amounts of data (SDSC, NCAR, other supercomputer centers, teragrid nodes) will already have a bbftp server running, but you can also compile and run the server yourself. The more usual case is to run only the client. It builds very easily on Linux with just the typical 'curl/untar, cd, ./configure, make, make install' dance: ------------------------------------------------------------------------------- $ curl http://doc.in2p3.fr/bbftp/dist/bbftp-client-3.2.0.tar.gz |tar -xzvf - $ cd bbftp-client-3.2.0/bbftpc/ $ ./configure --prefix=/usr/local $ make -j3 $ sudo make install ------------------------------------------------------------------------------- Using bbftp is more complicated than the usual ftp client because it has its own syntax: To send data to a server: ------------------------------------------------------------------------------- $ bbftp -s -e 'put file.154M /gpfs/mangalam/big.file' -u mangalam -p 10 -V tg-login1.sdsc.teragrid.org Password: >> COMMAND : put file.154M /gpfs/mangalam/big.file << OK 160923648 bytes send in 7.32 secs (2.15e+04 Kbytes/sec or 168 Mbits/s) the arguments mean: -s use ssh encryption -e 'local command' -E 'remote command' (not used above, but often used to cd on the remote system) -u 'login' -p # use # parallel TCP streams -V be verbose ------------------------------------------------------------------------------- The data was 'sent' at 21MB/s to SDSC thru 10 parallel TCP streams (but well below the peak bandwidth of about 90MB/s on a Gb network) To get data from a server: ------------------------------------------------------------------------------- $ bbftp -s -e 'get /gpfs/mangalam/big.file from.sdsc' -u mangalam -p 10 -V tg-login1.sdsc.teragrid.org Password: >> COMMAND : get /gpfs/mangalam/big.file from.sdsc << OK 160923648 bytes got in 3.46 secs (4.54e+04 Kbytes/sec or 354 Mbits/s) ------------------------------------------------------------------------------- I was able to 'get' the data at 45MB/s, about half of the theoretical maximum. As a comparison, because the remote reciever is running an old (2.4) kernel which does not handle dynamic TCP window scaling, scp is only able to manage 2.2MB/s to this server: ------------------------------------------------------------------------------- $ scp file.154M mangalam@tg-login1.sdsc.teragrid.org:/gpfs/mangalam/junk Password: file.154M 100% 153MB 2.2MB/s 01:10 ------------------------------------------------------------------------------- anchor: netcat ~~~~~~ http://netcat.sourceforge.net/[netcat] (aka 'nc') is installed by default on most Linux and MacOSX systems. It provides a way of opening TCP or UDP network connections between nodes, acting as an open pipe thru which you can send any data as fast as the connection will allow, imposing no additional protocol load on the transfer. Because of its widespread availability and it's speed, it can be used to transmit data between 2 points relatively quickly, especially if the data doesn't need to be encrypted or compressed (or if it already is). However, to use netcat, you have to have login privs on both ends of the connection and you need to explicitly set up a sender that waits for a connection request on a specific port from the receiver. This is less convenient to do than simply initiating an 'scp' or 'rsync' connection from one end, but may be worth the effort if the size of the data transfer is very large. To monitor the transfer, you also have to use something like 'pv' (pipeviewer); netcat itself is quite laconic. How it works: On the sending end, you need to set up a listening port: ------------------------------------------------------------------------------- [send_host]: $ pv -pet honkin.big.file | nc -q 1 -l -p 1234 ------------------------------------------------------------------------------- This sends the 'honkin.big.file' thru 'pv -pet' which will display progress, ETA, and time taken. The command will hang, listening (-l) for a connection from the other end. The '-q 1' option tells the sender to wait 1s after getting the EOF and then quit. On the receiving end, you connect to the nc listener ------------------------------------------------------------------------------- [receive_host] $ nc host.domain.uci.edu 1234 |pv -b > honkin.big.file ------------------------------------------------------------------------------- (note: no '-p' to indicate port on the receiving side). The '-b' option to 'pv' shows only bytes received. Once the receive_host command is inititated, the transfer starts, as can be seen by the pv output on the sending side and the bytecount on the receiving side. When it finishes, both sides terminate the connection 1s after getting the EOF. This arrangement is slightly arcane, but supports the unix tools philosophy which allows you to chain various small tools together to perform a task. While the above example shows the case for a sinle large file, it can also be modified only slightly to do recursive transfers, using tar, shown here recursively copying the local 'sge' directory to the remote host. ------------------------------------------------------------------------------- [send_host]: $ tar -czvf - sge | nc -q 1 -l -p 1234 ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- [receive_host] $ nc host.domain.uci.edu 1234 |tar -xzvf - ------------------------------------------------------------------------------- In this case, I've added the verbose flag (-v) to the tar command so using 'pv' is redundant. It also uses tar's built-in compression flag (-c) to compress as it transmits. You could also bundle the 2 together in a script, using ssh to execute the remote command. etc, etc, etc, etc. Latest version of this Document ------------------------------- The latest version of this document should always be http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html[here].