Tuesday, January 10, 2012

s3cmd sync HowTo

s3cmd sync HowTo

Program s3cmd can transfer files to and from Amazon S3 in two basic modes:

  1. Unconditional transfer — all matching files are uploaded to S3 (put operation) or downloaded back from S3 (get operation). This is similar to a standard unix cp command that also copies whatever it’s told to.
  2. Conditional transfer — only files that don’t exist at the destination in the same version are transferred by the s3cmd sync command. By default a md5 checksum and file size is compared. This is similar to a unixrsync command, with some exceptions outlined below.
    Filenames handling rules and some other options are common for both these methods.

Filenames handling rules

Sync, get and put all support multiple arguments for source files and one argument for destination file or directory (optional in some case of get). The source can be a single file or a directory and there could be multiple sources used in one command. Let’s have these files in our working directory:

~/demo$ find . file0-1.msg file0-2.txt file0-3.log dir1/file1-1.txt dir1/file1-2.txt dir2/file2-1.log dir2/file2-2.txt 

Obviously we can for instance upload one of the files to S3 and give it a different name:

~/demo$ s3cmd put file0-1.msg s3://s3tools-demo/test-upload.msg file0-1.msg -> s3://s3tools-demo/test-upload.msg  [1 of 1] 

We can also upload a directory with --recursive parameter:

~/demo$ s3cmd put --recursive dir1 s3://s3tools-demo/some/path/ dir1/file1-1.txt -> s3://s3tools-demo/some/path/dir1/file1-1.txt  [1 of 2] dir1/file1-2.txt -> s3://s3tools-demo/some/path/dir1/file1-2.txt  [2 of 2] 

With directories there is one thing to watch out for – you can either upload the directory and its contents or justthe contents. It all depends on how you specify the source.

To upload a directory and keep its name on the remote side specify the source without the trailing slash:

~/demo$ s3cmd put -r dir1 s3://s3tools-demo/some/path/ dir1/file1-1.txt -> s3://s3tools-demo/some/path/dir1/file1-1.txt  [1 of 2] dir1/file1-2.txt -> s3://s3tools-demo/some/path/dir1/file1-2.txt  [2 of 2] 

On the other hand to upload just the contents, specify the directory it with a trailing slash:

~/demo$ s3cmd put -r dir1/ s3://s3tools-demo/some/path/ dir1/file1-1.txt -> s3://s3tools-demo/some/path/file1-1.txt  [1 of 2] dir1/file1-2.txt -> s3://s3tools-demo/some/path/file1-2.txt  [2 of 2] 

Important — in both cases just the last part of the path name is taken into account. In the case of dir1 without trailing slash (which would be the same as, say, ~/demo/dir1 in our case) the last part of the path is dir1 and that’s what’s used on the remote side, appended after s3://s3…/path/ to make s3://s3…/path/dir1/….

On the other hand in the case of dir1/ (note the trailing slash), which would be the same as ~/demo/dir1/(trailing slash again) is actually similar to saying dir1/* – ie expand to the list of the files in dir1. In that case the last part(s) of the path name are the filenames (file1-1.txt and file1-2.txt) without the dir1/ directory name. So the final S3 paths are s3://s3…/path/file1-1.txt and s3://s3…/path/file1-2.txt respectively, both without the dir1/ member in them. I hope it’s clear enough, if not ask in the mailing list or send me a better wording ;-)

The above examples were built around put command. A bit more powerful is sync – the path names handling is the same as was just explained. However the important difference is that sync first checks the list and details of the files already present at the destination, compares with the local files and only uploads the ones that either are not present remotely or have a different size or md5 checksum. If you ran all the above examples you’ll get a similar output to the following one from a sync:

~/demo$ s3cmd sync  ./  s3://s3tools-demo/some/path/ dir2/file2-1.log -> s3://s3tools-demo/some/path/dir2/file2-1.log  [1 of 2] dir2/file2-2.txt -> s3://s3tools-demo/some/path/dir2/file2-2.txt  [2 of 2] 

As you can see only the files that we haven’t uploaded yet, that is those from dir2, were now sync‘ed. Now modify for instance dir1/file1-2.txt and see what happens. In this run we’ll first check with code>—dry-run to see what would be uploaded. We’ll also add code>—delete-removed to get a list of files that exist remotely but are no longer present locally (or perhaps just have different names here):

~/demo$ s3cmd sync --dry-run --delete-removed ~/demo/ s3://s3tools-demo/some/path/ delete: s3://s3tools-demo/some/path/file1-1.txt delete: s3://s3tools-demo/some/path/file1-2.txt upload: ~/demo/dir1/file1-2.txt -> s3://s3tools-demo/some/path/dir1/file1-2.txt WARNING: Exitting now because of --dry-run 

So there are two files to delete – they’re those that were uploaded without dir1/ prefix in one of the previous examples. And also one file to be uploaded — dir1/file1-2.txt, the file that we’ve just modified.

Sometimes you don’t want to compare checksums and sizes of the remote vs local files and only want to upload those that are new. For that use code>—skip-existing option:

~/demo$ s3cmd sync --dry-run --skip-existing --delete-removed ~/demo/ s3://s3tools-demo/some/path/ delete: s3://s3tools-demo/some/path/file1-1.txt delete: s3://s3tools-demo/some/path/file1-2.txt WARNING: Exitting now because of --dry-run 

See? Nothing to upload in this case because dir1/file1-2.txt already exists in S3. With a different content, indeed, but --skip-existing only checks for the file presence, not the content.

Download from S3

Download from S3 with get and sync works pretty much along the same lines as explained above for upload. All the same rules apply and I’m not going to repeat myself. If in doubts run your command with —dry-run. If still in doubts ask on the mailing list for a help :-)

Filtering with —exclude / —include rules

Once the list of source files is compiled it is filtered through a set of exclude and include rules, in this order. That’s quite a powerful way to fine tune your uploads or downloads — you can for example instruct s3cmd to backup your home directory but don’t backup the JPG pictures (exclude pattern), except those whose name begins with a capital M and contain a digit. These you want to backup (include pattern).

S3cmd has one exclude list and one include list. Each can hold any number of filename match patterns, for instance in the exclude list the first pattern could be “match all JPG files” and the second one “match all files beginning with letter A” while in the include pattern may be just one pattern (or none or two hundreds) saying “match all GIF files”.

There is a number of options available to put the patterns in these lists.

  • —exclude / —include — standard shell-style wildcards, enclose them into apostrophes to avoid their expansion by the shell. For example --exclude 'x*.jpg' will match x12345.jpg but not abcdef.jpg.
  • —rexclude / —rinclude — regular expression version of the above. Much more powerful way to create match patterns. I realise most users have no clue about RegExps, which is sad. Anyway, if you’re one of them and can get by with shell style wildcards just use —exclude/—include and don’t worry about —rexclude/—rinclude. Or read some tutorial on RegExps, such a knowledge will come handy one day, I promise ;-)
  • —exclude-from / —rexclude-from / —(r)include-from — Instead of having to supply all the patterns on the command line, write them into a file and pass that file’s name as a parameter to one of these options. For instance --exclude '*.jpg' --exclude '*.gif' is the same as --exclude-from pictures.exclude wherepictures.exclude contains these three lines:
    # Hey, comments are allowed here ;-) *.jpg *.gif 

All these parameters are equal in the sense that a file excluded by a --exclude-from rule can be put back into a game by, say, --rinclude rule.

One example to demonstrate the theory…

~/demo$ s3cmd sync --dry-run --exclude '*.txt' --include 'dir2/*' . s3://s3tools-demo/demo/ exclude: dir1/file1-1.txt exclude: dir1/file1-2.txt exclude: file0-2.txt upload: ./dir2/file2-1.log -> s3://s3tools-demo/demo/dir2/file2-1.log upload: ./dir2/file2-2.txt -> s3://s3tools-demo/demo/dir2/file2-2.txt upload: ./file0-1.msg -> s3://s3tools-demo/demo/file0-1.msg upload: ./file0-3.log -> s3://s3tools-demo/demo/file0-3.log WARNING: Exitting now because of --dry-run 

The line in bold shows a file that has a ,txt extension, ie matches an exclude pattern, but because it also matches the ‘dir2/*’ include pattern it is still scheduled for upload.

This exclude / _include filtering is available for put, get and sync. In the future del, cp and mv will support it as well.

Monday, January 9, 2012

s3cmd

s3cmd is an intuitive way to work with Amazon's S3 on the command line.

Install s3cmd

$ sudo apt-get install s3cmd 

Configure s3cmd

$ s3cmd --configure Enter new values or accept defaults in brackets with Enter. Refer to user manual for detailed description of all options.  Access key and Secret key are your identifiers for Amazon S3 Access Key: XXXXXXXXXXXXXX Secret Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX  Encryption password is used to protect your files from reading by unauthorized persons while in transfer to S3 Encryption password: XXXXX Path to GPG program [/usr/bin/gpg]:   When using secure HTTPS protocol all communication with Amazon S3 servers is protected from 3rd party eavesdropping. This method is slower than plain HTTP and can't be used if you're behind a proxy Use HTTPS protocol [No]: yes  New settings:   Access Key: XXXXXXXXXXXXXX   Secret Key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX   Encryption password: XXXXX   Path to GPG program: /usr/bin/gpg   Use HTTPS protocol: True   HTTP Proxy server name:    HTTP Proxy server port: 0  Test access with supplied credentials? [Y/n]  Please wait... Success. Your access key and secret key worked fine :-)  Now verifying that encryption works... Success. Encryption and decryption worked fine :-)  Save settings? [y/N] y Configuration saved to '/home/saltycrane/.s3cfg' 

List all your buckets

$ s3cmd ls 

List contents of your bucket

$ s3cmd ls s3://mybucket 

Upload a file (and make it public)

$ s3cmd -P put /path/to/local/file.jpg s3://mybucket/my/prefix/file.jpg 

Delete a file

$ s3cmd del s3://mybucket/my/prefix/file.jpg 

Get help

$ s3cmd --help 
Usage: s3cmd [options] COMMAND [parameters]
S3cmd is a tool for managing objects in Amazon S3 storage. It allows for making and removing "buckets" and uploading, downloading and removing "objects" from these buckets.  
Options:   
-h, --help            show this help message and exit   
--configure           Invoke interactive (re)configuration tool.   
-c FILE, --config=FILE   Config file name. Defaults to /home/saltycrane/.s3cfg   
--dump-config         Dump current configuration after parsing config files and command line options and exit.   
-n, --dry-run         Only show what should be uploaded or downloaded but don't actually do it. May still perform S3 requests to get bucket listings and other information though (only for file transfer commands)   
-e, --encrypt         Encrypt files before uploading to S3.   
--no-encrypt          Don't encrypt files.   
-f, --force           Force overwrite and other dangerous operations.   
--continue            Continue getting a partially downloaded file (only for [get] command).   
--skip-existing       Skip over files that exist at the destination (only for [get] and [sync] commands).   
-r, --recursive       Recursive upload, download or removal.   
-P, --acl-public      Store objects with ACL allowing read for anyone.   
--acl-private         Store objects with default ACL allowing access for you only.   
--delete-removed      Delete remote objects with no corresponding local file [sync]   
--no-delete-removed   Don't delete remote objects.   
-p, --preserve        Preserve filesystem attributes (mode, ownership,timestamps). Default for [sync] command.   
--no-preserve         Don't store FS attributes   
--exclude=GLOB        Filenames and paths matching GLOB will be excluded                         from sync   --exclude-from=FILE   Read --exclude GLOBs from FILE   --rexclude=REGEXP     Filenames and paths matching REGEXP (regular                         expression) will be excluded from sync   --rexclude-from=FILE  Read --rexclude REGEXPs from FILE   --include=GLOB        Filenames and paths matching GLOB will be included                         even if previously excluded by one of                         --(r)exclude(-from) patterns   --include-from=FILE   Read --include GLOBs from FILE   --rinclude=REGEXP     Same as --include but uses REGEXP (regular expression)                         instead of GLOB   --rinclude-from=FILE  Read --rinclude REGEXPs from FILE   --bucket-location=BUCKET_LOCATION                         Datacentre to create bucket in. Either EU or US                         (default)   -m MIME/TYPE, --mime-type=MIME/TYPE                         Default MIME-type to be set for objects stored.   -M, --guess-mime-type                         Guess MIME-type of files by their extension. Falls                         back to default MIME-Type as specified by --mime-type                         option   --add-header=NAME:VALUE                         Add a given HTTP header to the upload request. Can be                         used multiple times. For instance set 'Expires' or                         'Cache-Control' headers (or both) using this options                         if you like.   --encoding=ENCODING   Override autodetected terminal and filesystem encoding                         (character set). Autodetected: UTF-8   --list-md5            Include MD5 sums in bucket listings (only for 'ls'                         command).   -H, --human-readable-sizes                         Print sizes in human readable form (eg 1kB instead of                         1234).   --progress            Display progress meter (default on TTY).   --no-progress         Don't display progress meter (default on non-TTY).   --enable              Enable given CloudFront distribution (only for                         [cfmodify] command)   --disable             Enable given CloudFront distribution (only for                         [cfmodify] command)   --cf-add-cname=CNAME  Add given CNAME to a CloudFront distribution (only for                         [cfcreate] and [cfmodify] commands)   --cf-remove-cname=CNAME                         Remove given CNAME from a CloudFront distribution                         (only for [cfmodify] command)   --cf-comment=COMMENT  Set COMMENT for a given CloudFront distribution (only                         for [cfcreate] and [cfmodify] commands)   -v, --verbose         Enable verbose output.   -d, --debug           Enable debug output.   --version             Show s3cmd version (0.9.9) and exit.  Commands:   Make bucket       s3cmd mb s3://BUCKET   Remove bucket       s3cmd rb s3://BUCKET   List objects or buckets       s3cmd ls [s3://BUCKET[/PREFIX]]   List all object in all buckets       s3cmd la    Put file into bucket       s3cmd put FILE [FILE...] s3://BUCKET[/PREFIX]   Get file from bucket       s3cmd get s3://BUCKET/OBJECT LOCAL_FILE   Delete file from bucket       s3cmd del s3://BUCKET/OBJECT   Synchronize a directory tree to S3       s3cmd sync LOCAL_DIR s3://BUCKET[/PREFIX] or s3://BUCKET[/PREFIX] LOCAL_DIR   Disk usage by buckets       s3cmd du [s3://BUCKET[/PREFIX]]   Get various information about Buckets or Files       s3cmd info s3://BUCKET[/OBJECT]   Copy object       s3cmd cp s3://BUCKET1/OBJECT1 s3://BUCKET2[/OBJECT2]   Move object       s3cmd mv s3://BUCKET1/OBJECT1 s3://BUCKET2[/OBJECT2]   Modify Access control list for Bucket or Files       s3cmd setacl s3://BUCKET[/OBJECT]   List CloudFront distribution points       s3cmd cflist   Display CloudFront distribution point parameters       s3cmd cfinfo [cf://DIST_ID]   Create CloudFront distribution point       s3cmd cfcreate s3://BUCKET   Delete CloudFront distribution point       s3cmd cfdelete cf://DIST_ID   Change CloudFront distribution point parameters       s3cmd cfmodify cf://DIST_ID  See program homepage for more information at http://s3tools.org