Data Transfer

We describe some typical tools to transfer small to large amounts of data from one host to another.

Small transfers

SMB / sFTP / FileZilla / CyberDuck

  • User-friendly graphical interface instead of command line
  • Cross-platform

The easiest solution is to mount the groupshare on one computer and manually copy the files with drag-n-drop. This is only recommended for transfers of up to several gigabytes.

rsync

  • Best all-rounder solution that works for small and large transfers.
  • Scans files in target location, only copies what is not yet present.
  • Allows to resume transfer after interruption, without having to re-copy all files.

Typical usage:

rsync -avP /path/to/local/folder/ dphysuser@login.phys.ethz.ch:/home/groupname/subfolder/

See man rsync for a full documentation of all available options.

Large transfers

Globus Online & GridFTP

  • Best solution to transfer several terabytes of data
  • Uses the high-performance data transfer protocol GridFTP
  • May not be supported by all universities.
    • CSCS support command line (GridFTP with SSH authentication)
    • CSCS support Globus.org, Endpoint: CSCS Globus Online Endpoint

via Globus web interface

  • Create an account and log in at Globus.org
  • Select File Transfer
  • Use our endpoint
    • D-PHYS ETH Zurich

via command line

Usage of globus-url-copy to copy data from CSCS to D-PHYS:

ssh <cscsuser>@ela.cscs.ch

globus-url-copy -rst -cd -r -p 4 -cc 4 \
  sshftp://<cscsuser>@gridftp.cscs.ch/path/to/folder/ \
  sshftp://<dphysuser>@login.phys.ethz.ch/home/<groupshare>/path/to/subfolder/

Further reading: Documentation by CSCS, Parameter descriptions

multirsync

  • GitHub
  • Spawns multiple rsync processes (one for each subfolder) for faster transfers
  • Speedup depends on folder structure

General Advice

Regular housekeeping

Delete what is no longer needed. Document folder structure and file locations for your future self and others.

Avoid too many files in the same directory

Don't store thousands of files in a single folder. This makes listing of the contents of the folder very slow. Create some folder hierarchy to split the files up for faster access.

Combine small files into single tar

Don't store your data in thousands of files of a only few kilobytes each. Even the smallest file will always allocate the minimal block size, and therefore waste disk space. Working with many small files also comes with a big overhead of system calls and disk seek operations. Combine such results into larger files using tar (with or without compression).

Prefer binary formats to plain text

Consider optimized binary formats (eg hdf5) to store your data - but make sure they're open and well documented so you can still read your data in 5 year's time. They use less disk space and allow faster input/output than plain text files. Common formats have libraries for most programming languages to ease read/write operations.