How to transfer data: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
Line 39: Line 39:
</syntaxhighlight>
</syntaxhighlight>


== rsync -- Remote SYNChronizer ==
== <code>rsync</code> ==
<code>rsync</code> is a utility for transferring and synchronizing files efficiently. The efficiency for its file synchronization is achieved by its delta-transfer algorithm, which reduces the amount of data sent over the network by sending only the differences between the source files and the existing files in the destination. 


* Manual page on-line: http://man7.org/linux/man-pages/man1/rsync.1.html
<code>rsync</code> can be used to copy files and directories locally on a system or between multiple computers via SSH. Unlike <code>scp</code>. Because it is designed to synchronize two locations, partial transfers can be restarted by re-running <code>rsync</code> without losing progress. Resuming a partial transfer is not possible with <code>scp</code>.
 
 
'''Rsync''' is a fast and extraordinarily versatile file copying tool. 
It can copy locally, to/from another host over any remote shell. 
It is famous for its delta-transfer algorithm, which reduces the amount of data sent over
the network by sending only the differences between the source files and the existing files in the destination. 
'''Rsync''' is widely used for backups and mirroring and as an improved copy command for everyday use.
 
 
In practice, '''rsync''' is '''scp''' on steroids.  
It is designed to '''synchronize''' two locations, that is to make them the same.
So, if a transfer stops for some reason, if one restarts the transfer, '''rsync''' will check the destination
and only transfers what is needed.
This way, you can conveniently restart the transfer at any moment without losing the progress.
With '''scp''' this is not an option.


The general format for the command is similar to '''scp''':
The general format for the command is similar to '''scp''':
  $ rsync [options] source destination
  $ rsync [options] source destination
* The <code>source</code> and <code>destination</code> fields can be a local file / directory or a remote one.
* The ''local location'' is a normal Unix path, absolute or relative and
* The ''remote location'' has a format <code>username@remote.system.name:file/path</code>.
* The ''remote relative file path'' is relative to the home directory of the <code>username</code> on the remote system.


The <code>source</code> and <code>destination</code> fields can be a local file / directory or a remote one.
You may see all the available options with <code>rsync</code> by viewing the [http://man7.org/linux/man-pages/man1/rsync.1.html man page].
The ''local location'' is a normal Unix path, absolute or relative and
the ''remote location'' has a format <code>username@remote.system.name:file/path</code>.
The ''remote relative file path'' is relative to the home directory of the <code>username</code> on the remote system.
 
=== Examples ===
The commands below are issued on '''your local computer'''.
 
Upload '''one file''' <code>data.dat</code> on your workstation in your current directory to your ARC's home directory:
$ rsync -v data.dat username@arc.ucalgary.ca:
 
Upload '''several files matching a wildcard''' on your workstation in your current directory to your ARC's home directory:
$ rsync -v *.dat username@arc.ucalgary.ca:
 
Upload '''a directory''' <code>my_data</code> on your workstation in your current directory into 
<code>projects/project2</code> directory inside your ARC's home directory:
$ rsync -axv my_data username@arc.ucalgary.ca:projects/project2/
 
Upload '''several directories''' on your workstation in your current directory into 
<code>projects/project2</code> directory inside your ARC's home directory:
$ rsync -axv my_data1 my_data2 my_data3 username@arc.ucalgary.ca:projects/project2/


Download '''one file''' <code>output.dat</code> from ARC to the current directory on your workstation:
=== Example Usage ===
$ rsync -v username@arc.ucalgary.ca:projects/project1/output.dat .
Common operations are given below. On your desktop, to:
Note the '''"."''' at the end of the command, it means '''current directory'''.


Download '''one directory''' <code>outputs</code> from ARC to the current directory on your workstation:
* Upload a single file (eg. <code>data.dat</code>) from your workstation to your ARC: <syntaxhighlight lang="bash">
$ rsync -axv username@arc.ucalgary.ca:projects/project1/outputs .
desktop$ rsync -v data.dat username@arc-dtn.ucalgary.ca:/desired/destination
</syntaxhighlight>
* Upload all files matching a wildcard (eg. ending in <code>*.dat</code>): <syntaxhighlight lang="bash">
$ rsync -v *.dat username@arc-dtn.ucalgary.ca:/desired/destination
</syntaxhighlight>
* Upload an entire directory (eg. <code>my_data</code> to <code>~/projects/project2</code>): <syntaxhighlight lang="bash">
$ rsync -axv my_data username@arc-dtn.ucalgary.ca:~projects/project2/
</syntaxhighlight>
* Upload more than one directory: <syntaxhighlight lang="bash">
desktop$ rsync -axv my_data1 my_data2 my_data3 username@arc-dtn.ucalgary.ca:/desired/destination
</syntaxhighlight>
* Download one file (eg. <code>output.dat</code>) from ARC to the current directory on your workstation: <syntaxhighlight lang="bash">
## Note the '.' at the end of the command which references the current working directory on your computer
desktop$ rsync -v username@arc-dtn.ucalgary.ca:projects/project1/output.dat .
</syntaxhighlight>
* Download one directory (eg. <code>outputs</code>) from ARC to the current directory on your workstation:<syntaxhighlight lang="bash">
desktop$ rsync -axv username@arc-dtn.ucalgary.ca:projects/project1/outputs .
</syntaxhighlight>


== sftp -- secure file transfer protocol ==
== sftp -- secure file transfer protocol ==

Revision as of 20:33, 29 July 2020

General

Linux and MacOS

While you can find transfer programs for MacOS and Linux that have graphical point-and-click interface, both of these operating system come with pre-installed (most of the time) command line transfer tools: scp, rsync, sftp. These are powerful and convenient tools that can handle any practical data transfer to and from our compute clusters.

File transfers should not be performed on the the ARC login node. Instead, transfers should be performed on the ARC DTN (Data Transfer Node). Since the ARC DTN has the same shares as ARC, any files you transfer to the DTN will also be available on ARC.

scp: Secure Copy

scp is a secure and encrypted method of transferring files between machines via SSH. It is available on Linux and Mac computers by default and can be installed on Windows by installing the OpenSSH package.

The general format for the command is:

$ scp [options] source destination
  • The source and destination fields can be a local file / directory or a remote one.
  • The local location is a normal Unix path, absolute or relative and
  • The remote location has a format username@remote.system.name:file/path.
  • The remote relative file path is relative to the home directory of the username on the remote system.

You may see all the available options with scp by viewing the man page.

Example Usage

Common operations are given below. On your desktop, to:

  • Transfer a single file (eg. data.dat) to ARC:
desktop$ scp data.dat username@arc-dtn.ucalgary.ca:/desired/destination
  • Transfer all files ending with .dat to ARC:
desktop$ scp *.dat username@arc-dtn.ucalgary.ca:/desired/destination
  • To transfer an entire directory to ARC:
desktop$ scp -r my_data_directory/ username@arc-dtn.ucalgary.ca:/desired/destination

rsync

rsync is a utility for transferring and synchronizing files efficiently. The efficiency for its file synchronization is achieved by its delta-transfer algorithm, which reduces the amount of data sent over the network by sending only the differences between the source files and the existing files in the destination.

rsync can be used to copy files and directories locally on a system or between multiple computers via SSH. Unlike scp. Because it is designed to synchronize two locations, partial transfers can be restarted by re-running rsync without losing progress. Resuming a partial transfer is not possible with scp.

The general format for the command is similar to scp:

$ rsync [options] source destination
  • The source and destination fields can be a local file / directory or a remote one.
  • The local location is a normal Unix path, absolute or relative and
  • The remote location has a format username@remote.system.name:file/path.
  • The remote relative file path is relative to the home directory of the username on the remote system.

You may see all the available options with rsync by viewing the man page.

Example Usage

Common operations are given below. On your desktop, to:

  • Upload a single file (eg. data.dat) from your workstation to your ARC:
    desktop$ rsync -v data.dat username@arc-dtn.ucalgary.ca:/desired/destination
    
  • Upload all files matching a wildcard (eg. ending in *.dat):
     $ rsync -v *.dat username@arc-dtn.ucalgary.ca:/desired/destination
    
  • Upload an entire directory (eg. my_data to ~/projects/project2):
     $ rsync -axv my_data username@arc-dtn.ucalgary.ca:~projects/project2/
    
  • Upload more than one directory:
    desktop$ rsync -axv my_data1 my_data2 my_data3 username@arc-dtn.ucalgary.ca:/desired/destination
    
  • Download one file (eg. output.dat) from ARC to the current directory on your workstation:
    ## Note the '.' at the end of the command which references the current working directory on your computer
    desktop$ rsync -v username@arc-dtn.ucalgary.ca:projects/project1/output.dat .
    
  • Download one directory (eg. outputs) from ARC to the current directory on your workstation:
    desktop$ rsync -axv username@arc-dtn.ucalgary.ca:projects/project1/outputs .
    

sftp -- secure file transfer protocol


sftp is a file transfer program, similar to ftp, which performs all operations over an encrypted ssh transport. It may also use many features of ssh, such as public key authentication and compression.


sftp has an interactive mode, in which sftp understands a set of commands similar to those of ftp. Commands are case insensitive.

rclone -- rsync for cloud storage

Rclone is a command line program to sync files and directories to and from a number of on-line storage services.

Windows

Newer versions of Windows 10 (1903 and up) have SSH builtin as part of the openssh package. See above on how to use the commands SCP and SFTP via cmd.exe shell or powershell.

MobaXterm is the recommended tool for remote access and data transfer in Windows OSes.

MobaXterm

MobaXterm is a one-stop solution for most remote access work on a compute cluster or a Unix / Linux server. It provides many Unix like utilities for Windows including an SSH client and X11 graphics server. It provides a graphical interface for data transfer operations.

Large Data transfers

Using screen and rsync

If you want to transfer a large amount of data from a remote Unix system to ARC you can use 'rsync to handle the transfer. However, you will have to keep your SSH session from your workstation alive during the entire transfer. Very often this is not convenient or not feasible.

To overcome this one can run the rsync transfer inside a screen virtual session on ARC. screen creates an SSH session local to ARC and allows for reconnection from SSH sessions from your workstation.

To initialize

# Login to ARC
$ ssh username@arc.ucalgary.ca

# Start a screen session
$ screen

# Start the transfer.
$ rsync -axv ext_user@external.system:path/to/remote/data  .

# Now you can disconnect from ARC. Close the lid of you laptop or turn off the computer.

To check if the transfer has been finished.

# Login to ARC
$ ssh username@arc.ucalgary.ca

# Reconnect to the screen session
$ screen -r

# If the transfer has been finished close the screen session.
$ exit

Very large files

If the files are large and the transfer speed is low the transfer may fail before the file has been transferred. rsync may not be of help here, as it will not restart the file transfer (have not tested recently).

The solution may be to split the large file into smaller chunks, transfer them using rsync and then join them on the remote system (ARC for example):

# Large file is 506MB in this example.
$ ls -l t.bin
-rw-r--r-- 1 drozmano drozmano 530308481 Jun  8 11:06 t.bin

# split the file:
$ split -b 100M t.bin t.bin_chunk.

# Check the chunks.
$ ls -l t.bin_chunk.*
-rw-r--r-- 1 drozmano drozmano 104857600 Jun  8 11:09 t.bin_chunk.aa
-rw-r--r-- 1 drozmano drozmano 104857600 Jun  8 11:09 t.bin_chunk.ab
-rw-r--r-- 1 drozmano drozmano 104857600 Jun  8 11:09 t.bin_chunk.ac
-rw-r--r-- 1 drozmano drozmano 104857600 Jun  8 11:09 t.bin_chunk.ad
-rw-r--r-- 1 drozmano drozmano 104857600 Jun  8 11:09 t.bin_chunk.ae
-rw-r--r-- 1 drozmano drozmano   6020481 Jun  8 11:09 t.bin_chunk.af

# Transfer the files:
$ rsync -axv t.bin_chunks.* username@arc.ucalgary.ca:

Then login to ARC and join the files:

$ cat t.bin_chunk.* > t.bin

$ ls -l 
-rw-r--r-- 1 drozmano drozmano 530308481 Jun  8 11:06 t.bin

Success.