Mar 15, 2016
If you already know you want it, get it here: parsync+utils.tar.gz (contains parsync plus the kdirstat-cache-writer, stats, and scut utilities below)
Extract it into a dir on your $PATH and after verifying the other dependencies below, give it a shot.
parsync requires the following utilities to work:
ethtool - std Linux tool for probing ethernet interfaces. Install from repositories.
iwconfig - std Linux tool for probing wireless interfaces. Install from repositories.
ifstat - std Linux tool for extracting metrics from network interfaces. Install from repositories.
stats - self-writ Perl utility for providing descriptive stats on STDIN
scut - self-writ Perl utility like cut that allows regex split tokens
kdirstat-cache-writer (included in the tarball mentioned above), requires a non-default Perl utility: URI::Escape qw(uri_escape)
sudo yum install perl-URI # CentOS-like sudo apt-get install liburi-perl # Debian-like
parsync needs to be installed only on the SOURCE end of the transfer and uses whatever rsync is available on the TARGET. It uses a number of Linux- specific utilities so if you’re transferring between Linux and a FreeBSD host, install parsync on the Linux side. In fact, as currently written, it will only PUSH data to remote targets; it will not pull data as rsync itself can do. This will probably in the near future.
rsync is a fabulous data mover. Possibly more bytes have been moved (or have been prevented from being moved) by rsync than by any other application.
So what’s not to love?
For transferring large, deep file trees, rsync will pause while it generates lists of files to process. Since Version 3, it does this pretty fast, but on sluggish filesystems, it can take hours or even days before it will start to actually exchange rsync data.
Second, due to various bottlenecks, rsync will tend to use less than the available bandwidth on high speed networks. Starting multiple instances of rsync can improve this significantly. However, on such transfers, it is also easy to overload the available bandwidth, so it would be nice to both limit the bandwidth used if necessary and also to limit the load on the system.
parsync tries to satisfy all these conditions and more by:
allowing re-use of the cache files so generated.
doing crude loadbalancing of the number of active rsyncs, suspending and un-suspending the processes as necessary.
using rsync’s own bandwidth limiter (--bwlimit) to throttle the total bandwidth.
using rsync’s own vast option selection is available as a pass-thru (tho limited to those compatible with the --files-from option).
Only use for LARGE data transfers
The main use case for parsync is really only very large data transfers thru fairly fast network connections (>1Gb/s). Below this speed, a single rsync can saturate the connection, so there’s little reason to use parsync and in fact the overhead of testing the existence of and starting more rsyncs tends to worsen its performance on small transfers to slightly less than rsync alone.
Beyond this introduction, parsync’s internal help is about all you’ll need to figure out how to use it; below is what you’ll see when you type parsync -h. At version 1.5x (beta), there are still edge cases where parsync will fail or behave oddly, especially with small data transfers, so I’d be happy to hear of such misbehavior or suggestions to improve it.
Download the complete tarball of parsync, plus the required utilities (minus ifstat) here:
parsync+utils.tar.gz Unpack it, move the contents to a dir on your $PATH, chmod it executable, and try it out.
NB: parsync has a number of dependencies listed above.
Below is what you should see:
4. parsync help
parsync version 1.5 (beta) 03-15-2016 by Harry Mangalam <firstname.lastname@example.org> || <email@example.com> parsync is a Perl script that wraps Andrew Tridgell's miraculous 'rsync' to provide some load balancing and parallel operation across network connections to increase the amount of bandwidth it can use. parsync needs to be installed only on the SOURCE end of the transfer and only works in local SOURCE -> remote TARGET mode (it won't allow remote TARGET -> local SOURCE, emitting an error and exiting if attempted). It uses whatever rsync is available on the TARGET. It uses a number of Linux-specific utilities so if you're transferring between Linux and a FreeBSD host, install parsync on the Linux side. The only native rsync option that parsync uses is '-a (archive). If you need more, then it's up to you to provide them via '--rsyncopts'. parsync checks to see if the current system load is too heavy and tries to throttle the rsyncs during the run by monitoring and suspending / continuing them as needed. It uses the very efficient (also Perl-based) kdirstat-cache-writer from kdirstat to generate lists of files which are summed and then crudely divided into NP jobs by size. It appropriates rsync's bandwidth throttle mechanism, using '--maxbw' as a passthru to rsync's 'bwlimit' option, but divides it by NP so as to keep the total bw the same as the stated limit. It monitors and shows network bandwidth, but can't change the bw allocation mid-job. It can only suspend rsyncs until the load decreases below the cutoff. If you suspend parsync (^Z), all rsync children will suspend as well, regardless of current state. Unless changed by '--interface', it tried to figure out how to set the interface to monitor. The transfer will use whatever interface routing provides, normally set by the name of the target. It can also be used for non-host-based transfers (between mounted filesystems) but the network bandwidth continues to be (pointlessly) shown. [[NB: Between mounted filesystems, parsync sometimes works very poorly for reasons still mysterious. In such cases (monitor with 'ifstat'), use 'cp' for the initial data movement and a single rsync to finalize. I believe the multiple rsync chatter is interfering with the transfer.]] It only works on dirs and files that originate from the current dir (or specified via "--rootdir"). You cannot include dirs and files from discontinuous or higher-level dirs. ** the ~/.parsync files ** The ~/.parsync dir contains the cache (*.gz), the chunk files (kds*), and the time-stamped log files. The cache files can be re-used with '--reusecache' (which will re-use ALL the cache and chunk files. The log files are datestamped and are not NOT overwritten. ** Odd characters in names ** parsync will refuse to transfer some oddly named files. Filenames with embedded newlines, DOS EOLs, and some other odd chars will be recorded in the log files in the ~/.parsync dir. OPTIONS ======= [i] = integer number [f] = floating point number [s] = "quoted string" ( ) = the default if any --NP [i] (sqrt(#CPUs)) ................ number of rsync processes to start optimal NP depends on many vars. Try the default and incr as needed --startdir [s] (`pwd`) ................ the directory it works relative to --maxbw [i] (unlimited) .......... in KB/s max bandwidth to use (--bwlimit passthru to rsync). maxbw is the total BW to be used, NOT per rsync. --maxload [f] (NP+2) ........ max total system load - if sysload > maxload, sleeps an rsync proc for 10s --rsyncopts [s] ... options passed to rsync as a quoted string (CAREFUL!) this opt triggers a pause before executing to verify the command. --interface [s] ............. network interface to /monitor/, not nec use. default: `/sbin/route -n | grep "^0.0.0.0" | rev | cut -d' ' -f1 | rev` above works on most simple hosts, but complex routes will confuse it. --reusecache .......... don't re-read the dirs; re-use the existing caches --email [s] ..................... email address to send completion message (requires working mail system on host) --barefiles ..... set to allow rsync of individual files, as oppo to dirs --nowait ................ for scripting, sleep for a few s instead of wait --version ................................. dumps version string and exits --help ......................................................... this help Examples ======== -- Good example 1 -- % parsync --maxload=5.5 --NP=4 --startdir='/home/hjm' dir1 dir2 dir3 hjm@remotehost:~/backups where = "--startdir='/home/hjm'" sets the working dir of this operation to '/home/hjm' and dir1 dir2 dir3 are subdirs from '/home/hjm' = the target "hjm@remotehost:~/backups" is the same target rsync would use = "--NP=4" forks 4 instances of rsync = -"-maxload=5.5" will start suspending rsync instances when the 5m system load gets to 5.5 and then unsuspending them when it goes below it. It uses 4 instances to rsync dir1 dir2 dir3 to hjm@remotehost:~/backups -- Good example 2 -- % parsync --rsyncopts="--ignore-existing" --reusecache --NP=3 --barefiles *.txt /mount/backups/txt where = "--rsyncopts='--ignore-existing'" is an option passed thru to rsync telling it not to disturb any existing files in the target directory. = "--reusecache" indicates that the filecache shouldn't be re-generated, uses the previous filecache in ~/.parsync = "--NP=3" for 3 copies of rsync (with no "--maxload", the default is 4) = "--barefiles" indicates that it's OK to transfer barefiles instead of recursing thru dirs. = "/mount/backups/txt" is the target - a local disk mount instead of a network host. It uses 3 instances to rsync *.txt from the current dir to "/mount/backups/txt". -- Error Example 1 -- % pwd /home/hjm # executing parsync from here % parsync --NP4 --compress /usr/local /media/backupdisk why this is an error: = '--NP4' is not an option (parsync will say "Unknown option: np4") It should be '--NP=4' = if you were trying to rsync '/usr/local' to '/media/backupdisk', it will fail since there is no /home/hjm/usr/local dir to use as a source. This will be shown in the log files in ~/.parsync/rsync-logfile-<datestamp>_# as a spew of "No such file or directory (2)" errors = the '--compress' is a native rsync option, not a native parsync option. You have to pass it to rsync with "--rsyncopts='--compress'" The correct version of the above command is: % parsync --NP=4 --rsyncopts='--compress' --startdir=/usr local /media/backupdisk -- Error Example 2 -- % parsync --start-dir /home/hjm mooslocal firstname.lastname@example.org:/usr/local why this is an error: = this command is trying to PULL data from a remote SOURCE to a local TARGET. parsync doesn't support that kind of operation yet. The correct version of the above command is: # ssh to hjm@moo, install parsync, then: % parsync --startdir=/usr local hjm@remote:/home/hjm/mooslocal