User Tools

Site Tools


storage:datatransfer

Data Transferring

The bulk of this guide to transferring data to Rosalind is going to concentrate on linux systems using a command line interface, much of this will apply to using a mac command line. There are graphical clients available for linux, mac and windows operating systems, which are too many to go into in any detail and on the whole should be intuitive to use, so rather than provide a step by step guide to these I'll try to provide the connection information that will be needed to allow users to connect to Rosalind using these.

For larger datasets on linux or mac operation systems, (that is data that are too large to store on a desktop system) I'd recommend using rsync with the command line but that is by no means a hard and fast rule and you should feel free to

Transferring with the linux command line

What NOT to transfer with

Use of scp should be limited. This is because if transferring multiple files or recursively copying directory trees (e.g. with scp -r) scp will often launch multiple threads. For file transfers over a network this has little gain and in some cases can actually slow down the overall transfer time. On a shared system it can also slow down the nodes at both ends to the extent where other users have a delay of several seconds to do even a simple directory listing.

rsync

Of the many ways to get your data from one storage system on to another in a linux environment, rsync is the command that I consider to be the most reliable and versatile. The advantage of rsync is that its easy to copy directory hierarchies and, in its usual mode of operation, it will transfer only the difference between the source and the destination. That is to say if a file or directory are identical, rsync won't waste time transferring them again. This is very useful if you're transferring a large directory tree because it means that if the transfer is interrupted then all you have to do is re-run the same rsync command and it will pick up where it left off. Almost every linux distribution will have rsync already installed and if not its usually available via your local package management system.

The rsync command is usually run with the following basic pattern

rsync <options> <source> <destination>

If the source and destination are on the same machine that you're running the command on then you just give it the path to the directory or file which is being transferred.

If you want to transfer to or from a remote machine, the form is

user@host:/path/to/destination

By default rsync will normally connect using the ssh protocol but there are ways to connect using other protocols, see the man page for details.

To transfer a directory hierarchy with rsync, the best way is to use the archive option “-a”. Combining it with the “-v” flag will produce a more verbose output.

To copy the directory /home/alan/skinMeta to the lustre file system for example, the following incantation could be run on the source system:

rsync -av /home/alan/skinMeta k1214122@10.202.64.28:/mnt/lustre/users/k1214122/

You can view the progress by adding the –progress option

rsync -av --progress /home/alan/skinMeta k1214122@10.202.64.28:/mnt/lustre/users/k1214122/

If you wanted to copy from the remote system to the one you're on then you more or less just have to swap the source and destination components around i.e.

rsync -av --progress k1214122@10.202.64.28:/mnt/lustre/users/k1214122/skinMeta /home/alan/ 

The above command is probably the most common use of rsync and

The trailing slash

The trailing slash in the source is often makes no difference in linux commands but with rsync using the -a option it has an important effect. Without the “/”

rsync -av /home/alan/skinMeta k1214122@10.202.64.28:/mnt/lustre/users/k1214122/

will make a directory called skinMeta at the destination and everything being transferred will be put into that directory. If you leave a trailing “/” on the source e.g.:

rsync -av /home/alan/skinMeta/ k1214122@10.202.64.28:/mnt/lustre/users/k1214122/

rsync won't bother to make a skinMeta directory at the destination but will instead put the contents of the skinMeta directory in the /mnt/lustre/users/k1214122/ directory

Screen

If you're transferring the data from a remote machine and expect the transfer to last a while it may be worthwhile running a screen session in case your network connection to the remote machine is interrupted.

Checksums

Checksums are a useful way to verify the integrity of your data. Usually you can tell if a transfer has completed by looking at the size of a file, however in very rare circumstances, data may have become silently corrupted (an individual or small number of binary values may have become inverted) and the file will still be the same size. This will effect some data sets more than others e.g. if there are a few numbers that are significant within a data set then corruption of these can be a problem (and if a bit which is encoding a high value of a binary number is flipped it can change the value of a single number by orders of magnitude). Raw data like images or genomic sequences are usually not sensitive to a small number of these errors. Checksum algorithms were developed to test for these errors. When run from the command line they usually produce a string of numbers to letters. To test whether a file transfer was successful one would run the checksum program on the source file as well as the file which has been transferred and copare the resulting strings. The algorithms are designed so that even if a single or a small number of bits in the files are different, the resulting string are radically different.

On linux common command line check sum algorithms include md5sum, sha1sum, sha256sum.

[k1214122@login1(rosalind) ~]$ md5sum jdk-8u65-linux-x64.tar.gz
196880a42c45ec9ab2f00868d69619c0  jdk-8u65-linux-x64.tar.gz

Compression

Non-linux systems

MAC rsync is available as are many graphical client programs.

Windows There are many graphical client programs available such as filezilla.

Client Settings

The most important things to know are that the

username: Your Rosalind login
password: Your Rosalind password
protocol: ssh, sftp or rsync
port: 22

sftp and rsync are all configured to use the ssh protocol so port 22 is the only one which is used. ftp is not allowed.

storage/datatransfer.txt · Last modified: 2016/11/03 17:40 by admin