This special remote type stores file contents in Amazon Glacier.

To use it, you need to have glacier-cli installed.

The unusual thing about Amazon Glacier is the multiple-hour delay it takes to retrieve information out of Glacier. To deal with this, commands like "git-annex get" request Glacier start the retrieval process, and will fail due to the data not yet being available. You can then wait appriximately four hours, re-run the same command, and this time, it will actually download the data.

configuration

The standard environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are used to supply login credentials for Amazon. You need to set these only when running git annex initremote, as they will be cached in a file only you can read inside the local git repository.

A number of parameters can be passed to git annex initremote to configure the Glacier remote.

  • encryption - Required. Either "none" to disable encryption (not recommended), or a value that can be looked up (using gpg -k) to find a gpg encryption key that will be given access to the remote, or "shared" which allows every clone of the repository to access the encrypted data (use with caution).

    Note that additional gpg keys can be given access to a remote by running enableremote with the new key id. See encryption.

  • embedcreds - Optional. Set to "yes" embed the login credentials inside the git repository, which allows other clones to also access them. This is the default when gpg encryption is enabled; the credentials are stored encrypted and only those with the repository's keys can access them.

    It is not the default when using shared encryption, or no encryption. Think carefully about who can access your repository before using embedcreds without gpg encryption.

  • datacenter - Defaults to "us-east-1".

  • vault - By default, a vault name is chosen based on the remote name and UUID. This can be specified to pick a vault name.

  • fileprefix - By default, git-annex places files in a tree rooted at the top of the Glacier vault. When this is set, it's prefixed to the filenames used. For example, you could set it to "foo/" in one special remote, and to "bar/" in another special remote, and both special remotes could then use the same vault.

The glacier-cli tool seems to have been abandoned, and there are a number of outstanding issues with it. boto has a glacier tool, but it doesn't seem to include caching, which seems to be something git annex needs.

Looking through the PRs, it seems like we should build a tool specifically tailored to git annex's needs. It seems that there are at least three of us willing to hack on this if it's in Python. I'm not sure any of us knows haskell, though...

I'm the glacier-cli author. It is not abandoned!

glacier-cli is supposed to map to Glacier exactly, so that it is compatible with all other tools. Most of the outstanding PRs break this essential behaviour, so I have not merged them. Many of the feature requests and bugs related to the upstream boto library, which is just about the best maintained client library that exists for AWS on any platform (and Amazon have adopted it now, IIRC). I have written appropriate reviews on all the PRs.

If there is specific behaviour that git-annex needs, them I am happy to accept PRs for this, provided that they do not break the ability (and default) for glacier-cli to talk to Glacier natively without an extra layer of interpretation. If an extra layer of interpretation is needed (eg. forbidding duplicate "keys"), then this needs to be an option, or wrapped in a separate tool, or written into git-annex's Glacier special remote.

Comment by basak Fri May 17 08:35:10 2013

Hi! :)

The main issue I'm hitting is the "Multiple rows were found for one()" error. I think I get this when git-annex tries to upload the same file twice (which may be a bug in git-annex, which could apply de-duplication earlier), but I think I also get it when trying to upload a file whose upload I've canceled in the past.

I don't quite understand what git-annex needs here, and I totally understand that you're writing a general-purpose tool. But there does seem to be an issue that git-annex needs fixed one way or another.

I'm happy to try fixing it myself if you can help me understand what's going on (I didn't quite understand your review in the PR), but if I'm the only person in the world using git-annex to back up to glacier, that scares me a little!

copy foo/bar/baz (checking glacier...) Traceback (most recent call last):
  File "/home/jlebar/code/glacier-cli/glacier", line 694, in <module>
    App().main()
  File "/home/jlebar/code/glacier-cli/glacier", line 680, in main
    args.func(args)
  File "/home/jlebar/code/glacier-cli/glacier", line 579, in archive_checkpresent
    last_seen = self.cache.get_archive_last_seen(args.vault, args.name)
  File "/home/jlebar/code/glacier-cli/glacier", line 157, in get_archive_last_seen
    result = self._get_archive_query_by_ref(vault, ref).one()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2182, in one
    "Multiple rows were found for one()")
sqlalchemy.orm.exc.MultipleResultsFound: Multiple rows were found for one()
Let's discuss this in a bug. I've created http://git-annex.branchable.com/bugs/Glacier_remote_uploads_duplicates/
Comment by basak Wed May 22 18:10:32 2013
You are not the only one, Justin. I am just getting into git-annex and I am setting up a glacier remote as I write this.
Comment by http://id.clacke.se/ Mon Jun 3 09:03:57 2013
Comments on this page are closed.