We describe some typical tools to transfer small to large amounts of data from one host to another.
SMB / sFTP / FileZilla / CyberDuck
- User-friendly graphical interface instead of command line
The easiest solution is to mount the groupshare on one computer and manually copy the files with drag-n-drop. This is only recommended for transfers of up to several gigabytes.
- Best all-rounder solution that works for small and large transfers.
- Scans files in target location, only copies what is not yet present.
- Allows to resume transfer after interruption, without having to re-copy all files.
rsync -avP /path/to/local/folder/ email@example.com:/home/groupname/subfolder/
man rsync for a full documentation of all available options.
Globus Online & GridFTP
- Best solution to transfer several terabytes of data
- Uses the high-performance data transfer protocol GridFTP
- May not be supported by all universities.
- CSCS support command line (GridFTP with SSH authentication)
- CSCS support Globus.org, Endpoint:
CSCS Globus Online Endpoint
via Globus web interface
- Create an account and log in at Globus.org
- Use our endpoint
D-PHYS ETH Zurich
via command line
globus-url-copy to copy data from CSCS to D-PHYS:
ssh <cscsuser>@ela.cscs.ch globus-url-copy -rst -cd -r -p 4 -cc 4 \ sshftp://<cscsuser>@gridftp.cscs.ch/path/to/folder/ \ sshftp://<dphysuser>@login.phys.ethz.ch/home/<groupshare>/path/to/subfolder/
- Spawns multiple rsync processes (one for each subfolder) for faster transfers
- Speedup depends on folder structure
Delete what is no longer needed. Document folder structure and file locations for your future self and others.
Avoid too many files in the same directory
Don't store thousands of files in a single folder. This makes listing of the contents of the folder very slow. Create some folder hierarchy to split the files up for faster access.
Combine small files into single tar
Don't store your data in thousands of files of a only few kilobytes each. Even the smallest file will always allocate the minimal block size, and therefore waste disk space. Working with many small files also comes with a big overhead of system calls and disk seek operations. Combine such results into larger files using
tar (with or without compression).
Prefer binary formats to plain text
Consider optimized binary formats (eg hdf5) to store your data - but make sure they're open and well documented so you can still read your data in 5 year's time. They use less disk space and allow faster input/output than plain text files. Common formats have libraries for most programming languages to ease read/write operations.