(Hi, this is paulproteus@debian, AKA Asheesh).

I've been enjoying using git-annex to archive my data.

It's great that, by using git-annex and the SHA1 backend, I get a space-saving kind of deduplication through the symbolic links.

I'm looking for the ability to filter files, before they get added to the annex, so that I don't add new files whose content is already in the annex.look That would help me in terms of personal file organization.

It seems there is not, so this is a wishlist bug filed so that maybe such a thing might exist. What I would really like to do is:

  • $ git annex add --no-add-if-already-present .
  • $ git commit -m "Slurping in some photos I found on my old laptop hard drive"

And then I'd do something like:

  • $ git clean -f

to remove the files that didn't get annexed in this run. That way, only one filename would ever point to a particular SHA1.

I want this because I have copies of various of mine (photos, in particular) scattered across various hard disks. If this feature existed, I could comfortably toss them all into one git annex that grew, bit by bit, to store all of these files exactly once.

(I would be even happier for "git annex add --unlink-duplicates .")

(Another way to do this would be to "git annex add" them all, and then use a "git annex remove-duplicates" that could prompt me about which files are duplicates of each other, and then I could pipe that command's output into xargs git rm.)

(As I write this, I realize it's possible to parse the destination of the symlink in a way that does this..)

Hey Asheesh, I'm happy you're finding git-annex useful.

So, there are two forms of duplication going on here. There's duplication of the content, and duplication of the filenames pointing at that content.

Duplication of the filenames is probably not a concern, although it's what I thought you were talking about at first. It's probably info worth recording that backup-2010/some_dir/foo and backup-2009/other_dir/foo are two names you've used for the same content in the past. If you really wanted to remove backup-2009/foo, you could do it by writing a script that looks at the basenames of the symlink targets and removes files that point to the same content as other files.

Using SHA1 ensures that the same key is used for identical files, so generally avoids duplication of content. But if you have 2 disks with an identical file on each, and make them both into annexes, then git-annex will happily retain both copies of the content, one per disk. It generally considers keeping copies of content a good thing. :)

So, what if you want to remove the unnecessary copies? Well, there's a really simple way:

cd /media/usb-1
git remote add other-disk /media/usb-0
git annex add
git annex drop

This asks git-annex to add everything to the annex, but then remove any file contents that it can safely remove. What can it safely remove? Well, anything that it can verify is on another repository such as "other-disk"! So, this will happily drop any duplicated file contents, while leaving all the rest alone.

In practice, you might not want to have all your old backup disks mounted at the same time and configured as remotes. Look into configuring trust to avoid needing do to that. If usb-0 is already a trusted disk, all you need is a simple "git annex drop" on usb-1.

Comment by http://joey.kitenet.net/ Thu Jan 27 18:29:44 2011

I really do want just one filename per file, at least for some cases.

For my photos, there's no benefit to having a few filenames point to the same file. As I'm putting them all into the git-annex, that is a good time to remove the pure duplicates so that I don't e.g. see them twice when browsing the directory as a gallery. Also, I am uploading my photos to the web, and I want to avoid uploading the same photo (by content) twice.

I hope that makes things clearer!

For now I'm just doing this:

  • paulproteus@renaissance:/mnt/backups-terabyte/paulproteus/sd-card-from-2011-01-06/sd-cards/DCIM/100CANON $ for file in *; do hash=$(sha1sum "$file"); if ls /home/paulproteus/Photos/in-flickr/.git-annex | grep -q "$hash"; then echo already annexed ; else flickr_upload "$file" && mv "$file" "/home/paulproteus/Photos/in-flickr/2011-01-28/from-some-nested-sd-card-bk" && (cd /home/paulproteus/Photos/in-flickr/2011-01-28/from-some-nested-sd-card-bk && git annex add . && git commit -m ...) ; fi; done

(Yeah, Flickr for my photos for now. I feel sad about betraying the principle of autonomo.us-ness.)

For what it's worth, yes, I want to actually forget I ever had the same file in the filesystem with a duplicated name. I'm not just aiming to clean up the disk's space usage; I'm also aiming to clean things up so that navigating the filesystem is easier.

I can write my own script to do that based on the symlinks' target (and I wrote something along those lines), but I still think it'd be nicer if git-annex supported this use case.

Perhaps:

git annex drop --by-contents

could let me remove a file from git-annex if the contents are available through a different name. (Right now, "git annex drop" requires the name and contents match.)

-- Asheesh.

Comments on this page are closed.