
I do my backups with rsync from the root directory, as is right and proper. One of my machines has a dozen git repositories scattered around its file system in various places. This machine's backup target is at the other end of a not-fast connection.
And, git's data model seems to be "I'm going to continuously make 100% changes to multiple 50+ MB files, to ensure that there's nothing incremental about your incremental backups."
Is there any way to tune git's file usage to be less egregious? Do "git gc" and "git repack" make this worse, or better?
You're about to suggest that I not back up my git repositories, but just do a dozen different git checkouts on the backup-target machine instead. No. Let's just leave it at "no" so that I don't have to explain the several different ways in which that suggestion is stupid.
Kinda missing CVS right now.
Update: Unless I missed something, there have been 3½ suggestions here:
Turn off pack files and gc entirely, which will cause small files to accumulate for every future change, and will eventually make things get slow. gc.auto 0, gc.autopacklimit 0.
Set the maximum pack size to a smaller number, so that no pack file gets too large, and subsequent layers of diffs get bundled into smaller pack files. pack.packSizeLimit.
Dissenting opinion on #2: That doesn't do what you think it does, it just slices a single large pack file into N different files with the same bits in them, so you haven't saved anything.
If you already have one gigantic pack file, create a .keep file next to it. New pack files will appear but they will be diffs against that saved one, and thus smaller.
I guess option #4 is the only practical one?
Hm, if you're still using --delete, isn't that not-incremental?
Although, I guess the concern is less about that and more about the fact that it takes longer since git rearranges and modifies a lot of files, causing the transfer itself to take longer?
You can reduce the maximum size of the pack files created by git-repack with, say,
git config pack.packSizeLimit 5m
IIRC, old pack files shouldn't get modified by later repacks.
I should clarify that git-repack is run by git-gc and git-gc is auto-run every n commits. (And probably has five other triggers I don't know about.) So you'll have pack files whether you want them or not.
This (tuning the maximum pack size) sounds like your best option; by default git will periodically do garbage collection and pack loose objects (ie, not already packed) into pack files (with git repack I believe). I too believe existing, full, pack files won't get changed once they're full, unless a manual from-scratch repack is done -- but with a default of "unlimited" possibly they're never considered full/unchangeable.
As mentioned by someone else below, this automatic gc/pack can be turned off ("git config --global gc.auto 0"), but over time that will cause busy git repositories to be slower to use than they might otherwise have been (because there are lots of tiny files to work with), as well as have a bunch of unneeded files lying around. But combined with an occasional manual gc/pack it might be worth it.
Ewen
That option doesn't do what you think it does.
git just repacks everything into new packs with new names that happen to be individually smaller than the maximum size. It'll replace a big pack file with several smaller (and collectively slightly larger) pack files, and rewrite all of them on every gc.
The option is designed for filesystems that can't ever have files bigger than some magic size, e.g. 2GB or 4GB.
Ugh.
Git Bundle (https://schacon.github.io/git/git-bundle.html) makes this less terrible in some ways, but you've gotta special case your git repos, which is stupid.
gc and repack supposedly make the deltas smaller, but I haven't tested this out.
repack should smush everything into a pack file and gc should remove unreachable things. Perhaps you'd be better off not having pack files, but I don't know if there's a way to do the opposite of repack.
I read the question a bit lazily, so I thought of bup:
Bup uses a rolling checksum to chop files at recurring places. I think the distribution might be exponential or something like that, so you might get hideous outliers, but for the most part it seems to do the right thing.
How about what appears to be the right and proper way: make git bundles and transfer those instead. Git bundles don't have to start all the way from the root commit, too. Of course, if this sounds like manually performing a git push, you'd be right. I suppose you do get to skip the part where you have to apply the bundles to a working directory until it's convenient/necessary/you want to ensure that you can restore from such a backup.
Alternatively, investigate the ways you can use to manage large 50 MB+s files outside of git while having them versioned anyway. git-annex is an example.
Yes, either thing requires changes to your workflow or to others around you. I am sorry.
Do keep in mind that git gc and repack commands might destroy your reflog. (That's the thing you can use to undo botched merges and debases after you've completed them.) Also, if you run them on cron and you happen to be manipulating these repositories at that very time nasal demons may occur.
You just suggested the exact thing I said not to suggest.
You do, I'm afraid, need to enumerate why you think using git's own transfer mechanisms is stupid - it's really not obvious.
The downside that comes to my mind is that you'd lose un-named branch heads - the nameless things that git rebase leaves behind, and that git gc cleans up for you, and that git reflog so helpfully lets you play with. This is probably the very thing you want to backup, unless your backups are really for catastrophic failure cases only.
So what I'm really thinking is that your question should be:
How do I backup the reflog so I don't need to ever really expire it?
I'd love to know the answer to that question, too.
But I don't know whether you're concerned with other issues I've not picked up on. They're just not as obvious - to me at least - as you seem to think.
No.
Adding a dozen magic .git directories to the --exclude list is completely unacceptable and I'm not wasting my time explaining why.
Like I said at the top.
Oh, OK, so you don't care about how useful the backups are, you just want the process you're already using to be efficient.
Surely --exclude .git is enough to avoid transferring the git repositories entirely, and just transfers the working directory, as CVS usage would give you? No need to list them all explicitly.
Any repos you have uniquely on the machine (if any) you can just backup history manually using git push as needed.
You're still making those mouth-sounds, you might wanna look into that.
No, I'm really trying to help find solutions to your problem.
Unfortunately, your problem is roughly:
"I have chosen my backup solution. It is not suitable for all my data, because it's too inefficient."
So my first suggestion, along with most people's, would have been to change your chosen solution - but you explicitly ruled out the sane options there. I made a stab at why, but you shot me down without further explanation.
Based on this, I assumed you were entirely immovable on how the backups worked (ie, it must be a single, simple, rsync command), so tried to work with this. You make some snarky comment.
The basic problem is that you're trying to backup a database with entirely unsuitable commands, then complaining this doesn't work.
I can't speak for jwz but what this sounds like to me is: filesystems have a well-defined notion of change and there are really good tools which exploit that notion to efficiently copy the filesystem on the assumption that changes are small.
Now you install some tool which doesn't like using the filesystem in a natural way ('because performance' but in fact 'because stupid') and causes the changes seen by the first tool to appear to be enormous.
Well, OK, so you exclude all this from what the first tool sees, and use the second tool's notion of change to back it up. Well, OK: it turns out that the second tool doesn't really support backups in the sense you need them natively but you can write a bunch of support stuff which will use the things it can do to make them, and all this will probably work. And now you've lost a few days of your life, your backup system is about twice as complicated as it was and you need an elaborate proceedure to do a restore which never gets properly tested.
And now you get a third tool which has the same problem, and so on. Pretty soon your backup system is a tangled mass of special cases which requires continual care and feeding and you try not to think about what would happen if you actually need to do a restore.
Sometime later you realise there's a clever trick. If all these too-clever-for-the-filesystem tools actually are doing it for performance then they can't be touching very many disk blocks. So, if you completely ignore the whole filesystem, but snoop the disk traffic at the block I/O level, then you can do efficient backups by watching which blocks get changed. Of course, you need expensive hardware to do that, but this turns out to be a small cost compared to the nightmare of misdesign that has been inflicted on the filesystem by these fancy tools.
And of course, that is exactly what people with enough money do: it turns out the best way of backing up a filesystem is to ignore it completely, if you can afford to do that. Now I'm not in that world I can see it as funny, but it's not.
I do wonder if rsync from an LVM snapshot might get a more coherent picture, not just of Git but of everything. I know that at least PostgreSQL can reliably recover from TARs of snapshots -- it's identical (from Postgres' point of view) to recovering from a power failure.
Of course, that may involve moving all of the data out of the way just to set up LVM...
...and setting up an entire extra storage abstraction just to deal with Git would be, obviously, insane.
It would. But somehow we're all fine with the idea that some huge extra mechanism is just fine to back up a database.
Thanks for a well-reasoned reply.
I think we're in broad agreement; except that I don't think git is particularly special here, or stupid in its usage of the filesystem.
Pretty much any database (and I use the term very loosely) has similar properties WRT rsync-style backups. When rsync was initially released, I saw people trying to clone/backup IMAP servers with it, and it was just as much of a disaster. This is all OK, since one hopes that the number of different systems is fairly small. It's not ideal, I agree, but I assert it has never been the case that you could backup using rsync anyway.
The good news is that most things haven't gone the briefly fashionable route of using their own raw disk partitions for storage, so low-level filesystem snapshot backups do still work. If efficiency is a problem, though, you'll want a different strategy - also if you actually want a backup useful outside of the catastrophic disaster recovery.
My original point was that git's own transfer mechanisms are pretty selective, and cannot be used to backup the reflog, which - at least in my opinion - is the thing I would want backed up. This is a problem I see as worth solving; whereas "make everything work with a naive backup based on rsync" just seems like trying to hammer in a screw - that just wastes your time and annoys the screw.
Yes, I don't think git is unusual in this sense. Where I differ is that I think all of these applications are written by people who are too stupid to understand the implicationf of what they are doing (including git, and definitely including all the big database systems). There are two kinds of stupidity involved: in the case of git I assume the author understood the implications but just did not care because they did not matter to him; in the case of the big database systems I think that they are written based on a combination what is momentarily convenient ('it's 1980, filesystem performance is really poor, and now it will always be 1980') and just plain not understanding performance well.
As with git, they typically also do the nice trick of having their own backup/replication mechanism only do part of the job you need, so you still need the filesystem level backups, thus maximising the complexity of your backup system.
There is no good reason why rsync (or any other straightforward filesystem-metadata-based incremental tool) should not work really well. Of course they won't work really well.
Git is not backup. It is not doing an inadequate job of backing stuff up; if you are using it for backup the problem is that.
And lo, there is a solution to jwz's problem: tune the git packing behaviour as per other comments, and bonus, you can set your preferences globally with git config --global, too.
There really is little left to complain about.
Hi there!
Maybe I'm suggesting something as stupid as using git pull as incremental backup: uploading only diffs and commit history.
When will someone invent source control that just works and doesn't require the dev team be goddamn geniuses in order to not inevitably turn the entire repo into a dumpster fire?
Yes...I find the spoof git documentation indistinguishable from the actual git documentation.
Should see my hacky setup...
My iMac with my main iTunes lib lives on a Pegasus R6 array (now 20TB RAID-5) with tier-1 backups to a synology via rsync. tier-2 backups are 2: 2x6TB (12TB stripe) Lacie 2Bigs that are Time Machine drives, rotated every few weeks to a locked desk drawer at work (off-site backups!).
Linux box I similarly backup to synology via iscsi and rsnapshot for tier-1, and then tier-2 using a simple USB adapter and swap between 2 disks (in static bag) like the 2Bigs above. I just live with the added .git bits in rsync.. but my repos are small and don't update all that much.
Then I also made just my /etc directory (and homedir) git repos that are git pulled daily to my linode instance.
In the end, though, I've got multiple copies of my data, in multiple locations, so it would take a hell of a disaster, or screw-up on my part to really lose anything.
Sounds like you resent the volume of data being transferred, likely because git is probably making packs in the background. This is useful to save disk space and for some transferring operations, but it may not make very amenable to the incremental backups rsync is trying to do.
What if you set this variable, that tells git to never ever make packs automatically?
git config --global gc.auto 0
Then you'll just have loose objects sitting around, which rsync knows all about. You can still run git gc manually if the disk space gets egregious.
There is a scary Shannonbeast truth in here about the costs of both minimizing the storage cost of a long history of many files & making it amenable to a 'naive' incremental transport protocol.
Jason +1.In addition
- you'll want to rsync the refs directories first ahead of the objects directories. This will yield usable backups in most/all situations.
- run a git GC on each repo ensuring it doesn't overlap with your rsync backup. Use flock or whatever
For most situations this Just Works. The corner cases where it doesn't involve fancy git configs which of course you won't do...
Fwiw my name is in the git credits.
Hi jwz! I actually read your post, and have personal experience running the things you're asking about.
You're basically SOL. Git is not interested in your use case or sympathetic to your attitude or network setup.
Glad to be of lazyhelp!
Hooray!
Based on reading through the git config man page (https://www.kernel.org/pub/software/scm/git/docs/git-config.html), I recommend setting the following options:
git config --global gc.auto 0
git config --global gc.autopacklimit 0
I have not tried doing this, but if I'm reading the man page right then this will reduce or eliminate git's tendency to automatically pack entities into big pack files. This will make your repositories get dog slow over time as they accumulate huge numbers of separate files, but will also mean the repositories just get more and more files added to them over time. You'll still end up with pack files as a result of pulling in changes from external repositories, but I don't think that's necessarily a problem for rsync. If it is, you can set
git config --global transfer.unpackLimit 1000000000
which should make git unpack everything (unless it has more than a billion entities) when you fetch stuff.
Again, I have not tried this myself (I like pack files, I don't want my git repos to be dog slow)
Oh, hey, Jason already gave this solution. I'm lazy and didn't read all the comments.
On the rsync side you might try adding the --fuzzy flag. And while this would make you very sad, you're likely to find that putting together a job to find and copy most new files with something like 'tar -jcf - ... | ssh ...' will save time and bandwidth - rsync was rather inefficient for copying new files (especially lots of small ones) when I last researched it ~ 8 years ago.
What blows my mind is that Git managed to fuck this up. _Nearly_ everything Git stores is immutable: packs are named after a hash of their contents, loose objects likewise. Only refs, the reflog, and config are mutable in place, and there are well-known strategies for safely updating those, and for sequencing those updates relative to other writes to the .git directory. Most of the core Git committers are filesystem people, who should by all rights understand this stuff at a deep level. And yet, here we are.
Your source control system is our learning experience.
git packs deltas in (mostly) reverse chronological order, for most of the same reasons RCS did. This makes checkouts of branch tips fast because they read data in sequential order off of your hard drive and you don't spend a lot of time waiting for deltas to be computed.
This optimization was brought to you by a bunch of filesystem geeks. You're welcome.
A side-effect of this is that when you add one line of code to a file, the next repack will move that line of code toward the front of the pack. This in turn changes all deltas derived from the newer revisions to their ancestors, and rewrites the entire pack to save maybe five whole bytes of disk space.
As you say, git people are filesystem people so they designed this right.
Copy refs first (refs dir, and the packed-refs file if you have it) and the objects dir second, that'll ensure correct backups even if taken mid-operation. If you don't want " mutating packs" because it goes against the rsync grain, there's settings for that. And there the .keep markers too.
If you can tolerate doing things minimally git-style, use git mirror.
And if you don't want any of this, maybe use monotone. Like git but store everything in an sqlite file.
All the whining here is rather idiotic. CVS was a fantastic stopped clock that gave you broken states of your code in as many ways as you might want. You can still run it. I wrote a CVS server that uses git as backend, so get your cvs-style checkouts and... Yeah rsync will run over that. Good luck with the sanity of CVS repo rsync'd mid-operation.
This advice was wrong both times it was repeated.
If you copy refs first, and the transfer is interrupted, the receiving repo will be broken because the refs will point to objects that weren't copied yet.
If you copy objects first, and the repo is modified on the sending side during the transfer, the receiving repo will be broken because the refs will point to objects that didn't exist when objects were copied.
Transfer interruption is more likely and harder to control than backups concurrent git repo changes (presumably jwz runs both himself, so he can avoid doing both at the same time). The lesser of the two evils is to transfer the objects before the refs, which is what rsync does because it copies things in alphabetical order.
There is no right answer here. There is only using backup toos that work atomically, or learning how to pick SHA1 hashes out of .git/logs to fix broken references if the origin drive fails in the middle of a git commit during a backup.
I was reading that advice and wondering WTF people were saying as I was certain that refs/ references stuff in objects/
> If you copy refs first, and the transfer is interrupted, the receiving repo will be broken
rsync never promised atomicity, so a broken rsync results in a broken state at destination. That's hardly a problem with my recommendation.
Now, you _can_ use rsync options that help with atomicity. I am talking about --link-dest (see https://blog.interlinked.org/tutorials/rsync_time_machine.html ) . So if you use rsync in a failsafe fashion, combined with copying refs ahead of objects, you have a failsafe backup.
I would combine the above for sanity, with the efficiency tricks of Zigo's cronjob strategy (disable auto repack, run a cronjob that repacks infrequently, another cronjob that adds .keep to large packs).
This is a really annoying thing with virtualbox images, too. Boot a win7 virtualbox, immediately shut it down. Apparently the entire file has changed as far as rsync can tell.
There a ways to tell git to push to many remote repo's at once. https://github.com/RichiH/vcsh
I had this problem, except on a somewhat larger scale--about two orders of magnitude larger. Watching 5GB pack files getting sloshed around network links all day is no fun.
Disabling gc isn't really helpful since git turns into "save every version of all your files with gzip under random names" if it can't make pack files. If your packs are already at 50MB now then this will be painfully slow (although if you're used to CVS then maybe your pain threshold is higher than most). rsync will not be able to use any of its bandwidth-saving tricks to help you since it can't guess what files to use as a basis for deltas. So leave gc enabled, and teach git to not rewrite big pack files.
If you already have the big pack files copied with rsync, that's great. Leave them there, but create a file next to them with a '.keep' suffix (e.g. .git/objects/pack/pack-da39a3ee5e6b4b0d3255bfef95601890afd80709.keep). git will never touch the corresponding pack file again.
git will keep creating new pack files, but they will be much smaller, since they'll contain only things you changed since the old packs were created. Repeat the process as necessary when the .pack files grow.
I automated this with a cron job that looks for files matching the pattern '"pack-", 40 hex digits, ".pack"', and drops a .keep file next to the pack files above a threshold size (4GB in my case, but whatever size causes you pain is fine).
That is a very good strategy.
Hmm. If enough people want this it's probably reasonable to teach git how to do this of its own accord. "Is this pack file larger than X? Yes? Then always treat it as kept".
The situation is so fucked up that people who make Git hosting tools recommend disabling writes during backups.
How about that.
I'd point out that this is the case with almost any source control software: If I perform an operation that modifies the metadata while performing a backup and not all of the changes are copied, it'll get out of sync. AFAIK this is a problem for RCS, CVS, SVN (traditional and modern metadata formats) and probably all others.
If you have any suggestions for version control software who's checked out metadata isn't corrupted if not all of the files modified in an operation are copied during a backup, I'm all ears. (Software that can recover from this without copying anything from another checkout is acceptable)
At least git(olite) has an option to protect you against this.
In all seriousness, since every OS/FS has _for a long time now_ supported snapshots, who takes any sort of backup without using snapshot functionality? I use snapshots on my laptop, for god sakes... just for simple 7zip local backups of the local mailstore (Thunderbird profile directory, TB has a habit of eating it if the IMAP connection gets too slow then leaving you to download 7GB of mail over HSDPA while in the middle of some desert shithole)
If you aren't considering snapshot functionality for helping with backups you aren't trying to solve the problem seriously.
With snapshots you're still in the same race with a consistent internal application state; you're just running a lot faster.
Snapshots are a tool that allow the problem to be solved practically. In my case, if I wanted to back up the local mailstore I had to keep my email client closed for the duration of the backup, which could be 45~ minutes (7ip update archive is slow.) As the backup ran in a minimised state I'd often end up opening the client again while the backup was still running, trashing it. In Jamie's server environment this equates to annoying downtime, possibly hours.
With snapshots you reduce the downtime to a barely-noticeble interval - just enough time to take the snapshot, which is fixed (approximately) regardless of the snapshot size and anyway about 10s in my experience.
Yes, you have to write-lock the DB for 10s. This is not a problem, it is an inevitable part of the solution.
Ok, I know you are interested in this, but use `git clone --mirror` / `git push --mirror` if you are going to use a git protocol solution.
Less advice and more peanut-gallery questioning; is it either/or, with packs? Ie if you turn off the auto repacking, will it leave the existing large packs, but then accumulate further diffs as small files, or will it say 'well there's this giant pack why don't I just putz with that'? Because this suggests an incremental strategy of 'turn off autopack, then do a manual pack before full backups' but I'm guessing somebody would have already tried that and possibly got burned....
I use bup to bla bla bla, so I currently have a git repo that's upwards of 56GiB of data in pack files. After bup does its thing we tell git to do a repack and land .keep files next to packs that are nearing 2GiB. This isn't terrible since we (almost) never delete objects from the repository, and if we had to we could remove the .keep files for the packs that are sufficiently new to hold the objects we want to delete and just gc within those.
The upshot is that our automatic rsync runs transfer at most 2G of data that will be deleted later. Not ideal, but not nearly as bad as the full repository every time. You could, of course, change the threshold down to 10M or so for doing the same stunt for smaller repositories.
I hope that is helpful?
You know what else I love about git? It's 2015 and it doesn't understand lock files.
Yeah, I'm not supposed to have a web app writing to a directory at the same time that a cron job might also be writing there, not the git way, yadda yadda yadda, but come on! Who writes software that, when it sees its own lock file responds with PANIC! ABORT! instead of "just wait for the fucking lock to clear before proceeding"?
Also, it's apparently an insoluble problem to determine the difference between "the lock is legitimately held" and "a previous run crashed without releasing the lock". Because nobody has ever solved that problem before. And they imply here that it's possible for git to crash without releasing its lock. Because nobody has ever solved that problem before.
Huh.. I see that same thing now and then. I've never actually seen an actual crashed/failed git command happen that resulted in the index.lock left behind either. It's like git just "forgets" to remove it once in awhile.
Wow. That is seriously amateur hour stuff, right there.
Is there a "proper" way to handle lockfiles? In my experience it often seems like a very ad-hoc thing, and a web search gives me a lot of possibilities but no clear answer. On /doc/java.html you say link() "is the only reliable way to implement file locking" but I didn't find any detail. I could guess but I'll probably miss something (eg. I'm not sure which operations work over NFS).
Although for an application that's mostly append-only, with files named after their SHA hashes, I question whether locking should ever really be needed.
Well, if you're writing a database anyway (which source control systems are) you could try implementing the MVCC pattern that any DB worth trusting your data to has implemented. Since I notice certain high profile kernel developers generally pour scorn on DBMS programmers I expect that's not going to happen.
I should point out that the git repo (or object store) does implement MVCC. However the git index does not.
Can you explain why you think that having post-commit hooks for your git repo that automatically punts your data to a backup location is stupid?
What specifically do you miss or what problems does it create for you when you use that technique?
Thanks!
Nevermind. Found your reply buried in another thread here.
To throw out a radical idea: have you considered using a version control system that doesn't suffer from the limitations with repacking?
Mercurial scales to hundreds of thousands of files. Its internal storage format is based on append-only files for most data, which is very rsync friendly. Mercurial repositories tend to be a little bigger than Git repos because similar content across files doesn't get the benefit of shared compression (packfiles). But if you are aiming for rsync friendliness, Mercurial wins.
Of course, you don't have much control over what other people use for version control. So I guess you may be stuck with forcing packfiles to not rewrite so often.