CVS

Dear Lazyweb, I have a CVS question.

Please stop laughing.

I know, it's terrible, but I have no desire to spend the time to switch to something else, and try and figure out how to not lose literal decades of change history. So stop. Staaahhhp.

My CVS question is this:

Sometimes when I do "cvs update" on a directory tree that has like 150k files in it, it takes a minute or two, but sometimes it takes an hour and seems to be laboriously diffing each file or something. So there's a "go fast" mode and it's randomly going into a "go slow" mode. I don't understand why this happens, or how to get it to tell me that it is happening, or how to make it happen less often or at least more predictably. This is with the repo on a server that is not on my local network, ssh, -z9.

It doesn't even seem to be pegging my DSL. I think it's doing something stupid on the server side.

Ideas?

Tags: , ,

61 Responses:

  1. Phil Nelson says:

    I have no good answer for your actual question, so instead I will answer a question you did not ask, and in fact, probably explicitly asked the opposite of: Look into git cvsimport, this SO thread should help http://stackoverflow.com/questions/584522/how-to-export-revision-history-from-mercurial-or-git-to-cvs

    It could very easily take your weekend. But it will be worth it.

    runs away

    • Owen Jacobson says:

      Using SCM A as a frontend for SCM B doesn't end well, even if you understand both SCM systems deeply. (Needing to do so is not a feature.)

      Migrating from SCM A to SCM B requires being in control of the project (or, in a pinch, being willing to host and maintain a fork). I have a hard time believing that our host controls the source repositories of projects he cares about and has them live outside of his own network. XScreensaver, for example, has no public source repo at all -- just public source releases as tarballs. Presumably, whatever SCM system he uses lives entirely on boxes he controls.

      This smells like a problem with someone else's code, not with his own, and suggesting they switch SCM systems just to solve his personal problems is unlikely to succeed.

      In short, your mindless proselytism does exactly nothing to address the problem.

    • Kevin says:

      I don't think jwz appreciates suggestions that end: "use up your entire weekend to solve a problem that sometimes takes an hour of your time".

    • It probably wouldn't though. I've converted several repositories from SVN to mercurial; from git to mercurial and back and whatnot; IME most repositories convert very easily. I suspect that CVS (no experience converting it) - do to its limited feature set - may actually be easier since you're unlikely to encounter weird stuff like tree-conflicts and unusual layouts that plague SVN. It's certainly worth a short; there really is a good chance it'll be little more than 1 command - especially if you're interested in the history for histories sake, and don't necessarily care that the past is a little messier than ideal (since you'll rarely be looking at it anyhow).

      Typically the things that break are stuff like SCM macros, externals, that kind of thing.

  2. Ewen McNeill says:

    Your description feels like a "working set (sometimes) doesn't fit in RAM" issue. Various (older) file systems also had problems with directories with lots of files in them (for definitions of "lots" in the order of thousands); ext[34] have some flags which add directory indexing, which helps with "large" directories.

    Is it possible that sometimes there is something else taking more of the RAM than other times, and that's causing it to drift between "working set fits in RAM" and "swapping/re-reading from disk regularly"? I'd certainly be watching RAM/disk access stats on the server when it was running slowly looking for clues. And/or strace on the server-side cvs process.

    Ewen

    PS: It is possible, but often non-trivial, to convert CVS repositories to various other things without (much) loss of revision history (no loss if the revision history has always been maintained in "normal" manners, with no hand editing). The various methods for doing so all basically suck; expect to spend a couple of days getting a non-trivial CVS repository converted.

    • phuzz says:

      Perhaps try running something on the server that uses up most of the RAM, and then run cvs update on a small set of changes and see if it still comes out slow.
      Or just stick more RAM in the box, it's usually quite cheap as long as you don't need some now-obscure type (like DDR2 FBDIMM)

  3. gryazi says:

    Only-vaguely-useful response but I suspect one of the FreeBSD Deep Ones should be able to guess at an answer, unless they've migrated to something more horrible recently.

    (Gratuitous annoying tangential question: Why is everything other than cvsweb - by which I mean gitweb - so incredibly user-hostile when you just want to dip in and check the state of some code or where some magic numbers are buried?)

    • Ben says:

      FreeBSD migrated to svn internally a while ago. Recently they have started switching off the CVS mirrors.

  4. MK says:

    Wait, but you can import CVS repository into SVN and preserve full history.

  5. Zygo says:

    Everything about CVS is terrible. Here are just a few pertinent examples relevant to your situation:

    Merges are server-side (like SVN), but so are diffs. CVS will upload your complete files to the server so it can send you a diff back. Everything else kids use these days does diffs against data on the local disk, and can upload deltas on commit, or download deltas on update.

    CVS timestamps are so imprecisely defined (localtime without time zone or fractional seconds) that you'll be uploading even when you wouldn't otherwise need to. Timestamps are the only local meta-data CVS can use to avoid having to talk to the server about what's on your disk. SVN has assorted stat() data and SHA1 hashes. Git has the whole repo right there on local disk, crushed by some obscene hundreds-to-one compression ratio.

    The CVS protocol is basically a network transport shoved sideways into the middle of the cvs command, so it does RPC calls where newer SCMs generate and process streams of data. There are round trips, plural, per file. A tiny increase in your DSL latency can be multiplied hundreds of thousands of times for the working tree sizes you mentioned. You won't saturate your DSL connection because the CVS processes spend all their time waiting for each other.

    I know it's not what you want to hear, but for your own sanity you should seriously consider dumping that repo out into something designed in 2003 or later (ten years old ought to be new enough). The newer SCMs are better, and most have facilities for easily importing history out of other SCMs as long as you don't go crazy with branches. The problems you will have getting data out of CVS are due to the "even when it's working, it's so vague it's broken" design of CVS--they'll go away once you've moved the data into something else.

    • Ian Young says:

      Not Helping: Really, having had to "manage" CVS and SVN for projects and teams, I assure you that it will be better in the long run to just use CVS2SVN now, today, and deal with the short-term hurt of learning to use Git or SVN (if you have years of the CVS habit, SVN is the methadone you probably could cope best with)

      And even then, SVN can still be slow. You really want Git. Hell, I really want Git, but it's scary and strange and involves pushing and pulling and cloning or some shit what is this i dont even.

      • hattifattener says:

        Mercurial has many of the nice points of git (decentralized, fairly compact and fast, popular, etc), but a much simpler mental model. For a project that's not too large I prefer it to git.

        • What's wrong with large mercurial projects? I think there's still a few large projects around using it. It's slightly less disk-space efficient, but otherwise it's not that different scaling-wise...

          The #1 reason to prefer git is that it's pervasive, and the differences don't really matter that much anyhow.

          • hattifattener says:

            Nothing, really— git's advantage seems to be in the tools it has for managing flows of changes in a large, ad-hoc-structured community of people all working on the same codebase. I don't have experience doing that with either git or mercurial, but from observing, e.g., Linux kernel development, that seems to be a thing.

            Most projects will never be large enough for that to be relevant, though.

            OTOH, lots of people are intimidated away from using git because of its complexity, and if Mercurial can offer the same basic benefits without quite as much confusing description of refs and blobs and such, that's great.

            Alternately, people can use github — I think much of git's popularity surge can be attributed to the fashionableness of github — which ironically leads to dropping most of git's complexity and power, and treating it as a centralized VCS again...

            • Nick Lamb says:

              For that last part: The trick there is that because git still is a distributed version control system you don't have to care. The power to handle multiple remote repos or pull patches out of a mail queue lives in every individual local repository, a corporate github install doesn't change that at all. It's like the existence of $EDITOR. The fact that weenies can set it to Notepad or something doesn't take anything away from the power of Emacs or vi.

            • I don't think there's much difference there. The workflow's a little different by default, but both support fully distributed work easily. Perhaps once you get as large as the kernel. However, other large projects (such as e.g. firefox) use mercurial, so I don't think there's much relevant difference.

              There are differences. I just don't think they're large enough to make much difference. Somebody could say Git's a little faster; mercurial's a little easier; git's tagging is simpler; mercurials branches are more flexible... I just don't think it's all that relevant (people just focus on whatever the differences are, so they appear larger than life...). I really think whatever you use you'll be happy, and converting between the two is rather easy if you're not. Never tried bazaar, might be OK too - just whatever you do, get off of CVS :-).

  6. Charles says:

    One idea from the FreeBSD committer's guide is to set up SSH control master for the host where your CVS repo lives. The example is at the bottom of http://freebsd.unixtech.be/doc/en_US.ISO8859-1/articles/committers-guide/cvs.operations.html

    Using -R during updates might also help. It pretends that the repo is read-only, which should be fine for updates, and comes with the man page comment "Using -R can also considerably speed up checkouts over NFS." I'm not sure what the internal cause of that would be, but it's worth a try if you're not using it already.

    • jwz says:

      Or not:

      % cvs -qR update -dP
      cvs: WARNING: Read-only repository access mode selected via `cvs -R'
      Using this option to access a repository which some users write to may cause intermittent sandbox corruption.
      cvs [update aborted]: Read-only repository feature unavailable with remote roots

      • Charles says:

        Wow. I know I have used -R in the past without such warnings, but it may only have been through an anonymous pserver. The internet is now suggesting to me that the flag may even have been silently ignored at the time.

        If the speed issue is related to creating/deleting lockfiles, you could also try putting a Lockdir into the CVSROOT/config file that points to a speedy memory-backed filesystem. http://www.network-theory.co.uk/docs/cvsmanual/config.html has details about that. Not having a repo that exhibits your symptoms, I don't know if it would help, but if it irritates you enough to lazyweb it, it seems painless enough to try it as a solution.

  7. MattyJ says:

    Rather than argue about cm systems, which jwz explicitly forbade, here's a suggestion. Might not be the problem but way back when I used to run into this.

    At any time did you happen to use update with the '-r tag' option? This option is sticky and will be used, perhaps, at times you don't expect it. Like, until you explicitly update to #head. Scanning tags is expensive since CVS is a filesystem/archive based system, and updating using a tag touches each of those 100K+ files.

    I think 'cvs status' will show you which sticky options you have active. I don't think there's a way to remove individual sticky options but you can remove them all using (finally has to consult manual) 'cvs update -A', then re-create the ones you actually need.

    There are probably 100 other things that can do it, but 15 years ago when I was a CVS admin, this was a common one.

  8. Marc says:

    If you have something messing with your time stamps (or perhaps time zones) on either end then that could cause CVS to do a lot of extra work.

    If you weren't you, I'd make sure you weren't using a silly file system that has strange time stamp sematics.

  9. Vincent says:

    http://cvs2svn.tigris.org/, for when you do feel like you have a weekend to spend on it.

    I mean, you'll probably discover all the history you've lost because CVS can corrupt its revision history without you knowing, but w/e.

    I've used this on a few repos. You lose functionality (in that each file is treated like a special easily corruptible snowflake), but CVS history can be represented in SVN.

    And I know you'll say "but my repo history isn't corrupted!" - CVS has no integrity checking. You won't know until you check out each and every revision of every file.

    • Malcolm says:

      This is the scenario I've seen this behaviour in, too: a server that had a misconfig so the timezone kept switching between right and weird. If the server thinks it's ahead of the client, it will stat every single file.

      To see what's happening in more depth than you normally care about, the -t ("trace") option to cvs on the client side can be useful. It will certainly show if there are timestamp mismatches or excessive stat'ing going on. Downside is you'll need to catch it in the act.

  10. Kevin Lyda says:

    Have you noticed a pattern in the lag?

    Specifically I wonder if it happens at certain times of day. If a memory hungry cron job (or cgi script or whatever) has run on your server recently it might have wiped out the file cache and forced the cvs server process to read everything from disk.

    It's been a while, but IIRC the cvs server does actually have to read the rcs files on disk for an update; it can't just depend on the history file - which for a project of the age you're discussing must be rather large any way.

    I know you probably hate people mentioning other VCSs in this thread. I apologise for being one of them.

    This is a rather short intro for git: http://eagain.net/articles/git-for-computer-scientists/ Someone could probably write a similar one for hg or bzr. I tend to use git, but I'm sure the others are fine.

    I'm going to suggest two things. First, I think you'd like a dvcs because your repo would be faster and smaller. In this case a git pull would send and receive less data as well as need to read less data off disk. Second, I'm free this weekend. If you send me a tarball of your cvsserver I'll import it to git or hg (your pick), send that back to you along with any scripts I wrote for the import. I would neither share nor keep your source. In fact I wouldn't look at it. I might need to look at commit messages.

    I just recently migrated a large svn repo into git at work and have all that in my head. Never done a cvs migration to git, but did migrate one to svn which I later migrated to hg. The main issue is how much did you edit your history file - if you did that a lot it limits the quality of the import.

    Anyway, happy to give it a try. kevin@ie.suberic.net is my email address and note you're 8 hours behind me.

  11. Iain Hartley says:

    Maybe look into reposurgeon. ESR claims that it does a pretty decent job of bringing stuff into the 21st century, and if it doesn't he seems to be looking for repos that have pathological issues so he can find a way to deal with them (http://esr.ibiblio.org/?p=4745)

    • Ken Kennedy says:

      I know migration is not on the table, but I second the reposurgeon suggestion now that it's on the table. ESR is indeed going out of his way to preserve CVS metadata with this tool.

    • Kevin Lyda says:

      Generally esr makes me despair for the fate of our species, but this looks interesting. Thanks.

  12. When I used CVS, the repository lived on an NFS share, so it wasn't networked CVS in the typical sense. After two hard drives holding the repo failed, I started investigating the impact of invoking cvs update (which my homemade continuous integration tool ran every 90 seconds). I was horrified to discover that every cvs update was creating and deleting three lock files in every directory of the repo (create write lock, create read lock, delete write lock, do reading, create write lock, delete read lock, delete write lock). You mention having 150k files, but not how many directories it takes to hold them. Whether you're doing your update locally or over the network, I would expect the repo locking requirements to be the same. I don't know what would make your repository file system occasionally slow to create and destroy all of those lock files.

    Epilogue: I changed my CI kludge to check the modification date of a commit log file every 90 seconds, and skipping the full update if the log file hadn't changed.

  13. nathan says:

    Is the remote machine virtualized? I have a similar problem with a remote SCM machine, and at least in my case, the shared storage backend is both not capable of sufficient IOPS and the iSCSI network connection between the host and the storage is congested.

  14. Chris Yeh says:

    > I think it's doing something stupid on the server side.

    Why yes, yes it is. I can't find the original article, but the issue is that when you do an update, the CVS server has to do extra work on the server side. Using CVSTMP, it'll re-recreate the entire directory hierarchy downwind from the point where you issued the update on the server, populate it with CVS/Entries files, and then compare them with what you have in your local client. On a sufficiently large tree and slow file system performance, this can make updates orders of magnitude slower.

    I distinctly recall the 'randomly updates take forever' problem. Sometimes this is caused by the server stupidity above, or something causes the date/time stamps on your local CVS/Entries files to change, thereby causing the aforementioned abuse of CVSTMP.

    There are two solutions to this problem, one of which we implemented on the original CVS servers at mozilla.org (hat tip, Noah). The first solution is to make CVSTMP be on RAM disk so that you never suffer the update penalty on slow disks.

    The second solution is better: never use 'cvs update'. A 'cvs checkout' over on an existing tree produces the same results of an updated source tree, but without the need for the server to create the entire CVS/Entries bullshit to walk through to compare. As an additional point of trivia, the old tinderbox scripts did this for exactly this reason.

    • Kevin Lyda says:

      Ah! I would assume that CVSTMP is almost always going to be /tmp. So depending on your /tmp cleanout possibilities, that tree would need to be rebuilt regularly. And since /tmp is almost always a tmpfs, it won't survive reboots.

      Obviously the need to walk a large tree is painful, but this would be even worse. So yes, more likely the cause.

  15. moof says:

    How many files do you have in the various Attic/ directories on the server? If you're willing to back up somewhere and then purge those, that can save quite a lot of time depending on how many of them there were. This is even more pronounced if you have an awful lot of subdirectories with nothing but Attic/; because CVS is rather stupid about that situation, it'll locally mkdir and then rm them some time later (assuming you're using cvs up -dP). This may require some surgery on various CVS/Entries files, but it's usually not that big a deal.

    • jwz says:

      Deleting the Attics would totally defeat the purpose of using version control at all.

      • moof says:

        In that case, if you're the only person using the repo, using cvs without -dP might help; most of the random delays I've seen have been due to the mkdir/rmdir nonsense. I also seem to recall that using a ramdisk for CVS's tmpdir on the server side can also speed things along quite a bit.

  16. phessler says:

    at the openbsd project, we also have a very large, very old codebase repository in cvs. We do 2 main things to improve cvs performance: very very fast /tmp on the server, cvsync to make a local clone of the repo on our workstation.

    The only problem with the cvsync solution is remembering to use -d when committing, otherwise you commit to your clone instead of the master server. but for everything else, local cvsync makes things far *far* nicer.

    • Stefan Bethke says:

      Back in the day before FreeBSD switched to SVN, that was the recommended practice for committers as well (apart from FreeBSD using CVSup). Worked really well.

    • grェ says:

      I suspect this may be the most useful advice on this thread, but only because OpenBSD devs still actively use CVS. :)

  17. hexmonkey says:

    Can I be an unhelpful ass in a different way by suggesting that everyone check the Urban Dictionary definition of "pegging" and then re-read the penultimate sentence in the OP?

  18. emf says:

    I have to N-th the "git cvsimport" suggestion. I know you don't want to hear it. Really, I know. Updating things that currently work for basically no reason is bullshit. But, you need a version control intervention.

    CVS. My god, you're starting to become Jerry "Chaos Manor" Pournelle. ("Well, I installed this 300 MB hard disk in the Novell server on my roof, and now my backups won't run when it's raining outside! I've even tried an RS422 mux downlink over the doorbell wiring! All I want to do is write my book!")

  19. Zygo says:

    See? Everything you've read above goes away as soon as you stop using CVS.

    Porting 20-year-old graphics hacks to the iPhone with a homebrew GL-over-GLES2.0 layer: awesome.

    Still using CVS for version control in 2013: WTF, dude.

    • MattyJ says:

      I'm all for distributed source control (git) but can someone please explain to me how that's useful to a one-man shop? Or even a several-man shop? Or specifically, jwz? Why bother with the overhead if you're not likely to use the main feature of the tool?

      jwz is a dirty, old school guy, he requires dirty, old school source control.

      • tkil says:

        Without commenting on whether our gracious host would use these capabilities, here's why I use git in my "one man shop":

        1. Distributed, disconnected operation. Which means full capabilities to fork, branch, merge, even if I'm in the wilderness or on a plane.

        2. Deep copy with full history everywhere. On each machine I clone to, I get full history, so I get distributed backup essentially for free.

        3. Effortless branching / merging. Fantastic tools like gitk for visualizing and managing branching.

        4. Effortless renaming / deletion / refactoring, including moving directories around, without needing to explicitly explain or code any of it.

        5. If you add more developers (either proprietary or open-source), there's github, which is a pretty awesome resource. (Not perfect, of course, but still quite amazing, and it's free for open source stuff.)

        Having said that, git is not the only game in town. Mercurial offers most of those, and is a bit more of a polished / all-in-one solution (built-in web server, etc). Bazaar is also a player in the same space, but it sounds like it might be somewhat abandoned these days. (There are some oddballs like Fossil out there, too, that offer a different set of tools...)

        But any of those are vastly more flexible than CVS or SVN, and I have used that flexibility often enough to appreciate it.

        (As others have mentioned, there's been a lot of work in making good cvs-to-git importers, too. A sample import should cost only a few minutes of setup, a few hours of runtime (depending on size/complexity of repo), and then some time examining the results.)

      • Zygo says:

        Why bother with the overhead if you're not likely to use the main feature of the tool?

        You have this backwards. Git is a well designed and implemented version control system for a single creative developer which is also suitable for large-scale distributed and formalized collaboration.

        The overhead is negative. Git is smaller and faster than CVS. The parts of Git that are "distributed" are the handful of commands for pushing data from your machine to some other repo, and for one-man use cases they are roughly equivalent to "cvs update" and the second half of "cvs commit". Git also comes with automation for tasks like "suck in a bunch of patches from my email mbox" which one-man development teams might actually use.

        The compression algorithm in git is a work of art. For source code it gets a couple of orders of magnitude better compression than Subversion (I've converted some one- and two-man SVN source repos from 2GB on disk to 8MB), but at the same time it's optimized to provide fast checkouts of branch tips and HEAD (without seeks if you're still using spinning disks). For a CVS to Git transition you can probably cut disk usage by a factor of 1000 or more, unless your repo is full of MP3 files or similar (and even then, Git understands how to delta-compress an MP3 much better than CVS does). If you're using a remote repo over a slow network the compression algorithm will do a few round-trips and then stream tiny amounts of delta-compressed data in either direction.

        Your whole repo can probably fit in RAM, which means you can do nasty brute-force repo history searches in seconds instead of hours (e.g. "in which revisions was the string "FrobFoo" added or removed from any file?"). You can put it on your small-but-expensive SSD media so that it's fast even with a cold cache.

        With the disk space you save, you can keep multiple copies of your favorite repos lying around as backups. You can tell if the copies are intact by cloning them at any time. If a single bit goes bad on disk, git will notice, and complain about it loudly. This is useful because one-man shops tend to have disks old enough to lose a few bits here and there without telling anyone. Over a 20-year life span some data loss is nearly certain, so it's important to check for it regularly so you can recover it while recent correct backups still exist.

        Stuff that takes minutes in Subversion or hours in CVS takes seconds in Git. Status queries, updates, commits, branching tagging, sending data to other machines, searching the logs--all fast. For a medium-sized repo (half a GB, 30,000 files) common operations have sub-second run times with a warm RAM cache. The performance is so different between Git and CVS that it changes the way you think about using SCM tools.

        You can rewrite history without permanently cluttering your repo. When I'm exploring the solution space of a problem, I have a Makefile rule that literally commits anything it finds in the source directory after any successful build. When I've got something I want to keep, I collapse all the automatic commits into a single commit with a log message that makes it look like I did a small number of predetermined things intentionally, instead of trying out ideas until I could decide what the end result should look like. If I prefer an idea I had an hour ago over what I have now, I can just check out some older version, without messing around with switching SVN repos or building ad-hoc shell scripts to make temporary copies in random places.

        If you need truly immutable history (e.g. for security or auditing purposes), it's a repo config option away. You can have a single server remote repo like CVS if you want, and except for Linux kernel developers I bet most Git users do exactly that.

        • MattyJ says:

          Interesting. Thanks. I was only trying to be a little snide, now I feel like a dick when reading serious answers. :)

          I maintain a commercial RCS for a medium-sized shop and 90% of the 'we need git!' people don't really know why, nor can they articulate it. The other 10% speak in a language above my plane of existence. This helps.

          In any case, wasn't trying to sound pro-CVS either, but there's a reason some of us wear t-shirts we had in high school, even though there are better, newer t-shirts for sale. jwz has a lot of old t-shirts.

          • Some more anecdotal evidence: One repo I maintain with 15k commits has around 50% smaller checkouts (!) using mercurial rather than SVN - so despite dragging along the _entire_ history of every file and every change, as opposed to just the previous version, the checkout is much smaller - the efficiency difference is just that large (and in reality, the difference is even bigger than it at first appears, since I'm counting the working copy - which is identical - in both as well).

            Having said that, I've also encountered repos with significant binary assets, and these don't fit the model well at all - every version is stored as a separate copy, and since these are usually compressed, delta-coding often helps little. There are extensions to mercurial and for git that mitigate the problem, but they do so by undermining the DVCS advantages, so there isn't as much reason to switch.

  20. martin langhoff says:

    Author/maintainer of various importers to git here. And of git-cvsserver which will emulate CVS on the wire. I'll spare you the "switch to GIT". You know you have to :-) -- but you need help now wit h this. OK. I know (or kinda remember) the proto.

    The most likely issue is something causing network latency or packet loss. The protocol is very chatty, so transient network issues causing latency can be very visible, more than you'd expect.

    The SSH tunnelling makes it more opaque to diagnostics. And perhaps the -z9 (which will be using large-ish blocks for compression) is making it worse. Try low -z values, say -z1 to z5?

    Maybe you can try see if rsync (which has its own chatty protocol) has the same issue. rsync the CVS repo to a tmpdir, then rsync it again telling rsync to ignore timestamps, so it does its chatty rolling checksumming thing.

    Or you can try the trace flags to see what's going on in a slow update, if you want to diagnose it further.

    FS issues (if you have directories with thousands of entries) may be a problem on the remote server, specially if there are lots of users, but you seem to be the only user. So unlikely. You can test for this over ssh I am sure.

    CVS locks again may be an issue, but they time out real slow, so you'd be setting the house on fire. And they hit multi-user servers, so unlikely to be you again.

    If all fails, come back here and whack a few "switch go git"ers. Helps with the frustration thing.

    • Zygo says:

      rsync isn't exactly chatty. The protocol was designed from the beginning to avoid waiting for network latency.

      Except for the initial protocol option negotiation, it's three processes, two of which spew data at the next process in the stream as fast as the network can carry it, without bothering to listen for replies to what they're sending downstream. rsync has trouble coping with disasters like running out of disk space because neither process that sees the "please stop I'm full" message is the process that is pushing file data down the pipe.

      The newer protocol versions are a bit more sophisticated than this, sending file lists at the same time as checksum and file data, and coping with "disk full" errors without using Unix signals. That just makes rsync care even less about latency than it did five years ago.

      CVS is nothing like that efficient. One thread on each end, synchronous send/receive/reply protocol. CVS repo replication tools are better, but they were designed by people who were familiar with and borrowed ideas from rsync.

  21. Mike Hoye says:

    I just googled keywords and now I'm going to answer a question you didn't ask about software you don't use and then tell you to switch to something else I don't understand but recognize and this one time I went fishing and there was a river and it had a source but there wasn't any control and I had a rock but I guess "rocks" aren't recent enough technology for you hurf durf so maybe you should file locks semaphor key-value prion reboot and cruft old trace wires code protocol bits hang fast compress old modern sum checked compile. I mean everyone knows that and look I'm keyword-smart.

    In conclusion, switch distros.

  22. deathdrone says:

    If it's thrashing so bad that RAM is being moved to disk, you should just be able to do a "top" and see that bad-CVS is using very little CPU compared to good-CVS (because bad-CVS is spending all of its time waiting on the disk). If this is the case, I guess "buy more RAM" might help?

    If bad-CVS and good-CVS are both using the same amount of CPU, it still might be thrashing in your processor cache. I know there's a linux command that shows you how many page-faults you're getting per second. I think it's vmstat? If bad-CVS is showing way more page-faults then good-CVS, then I guess it's thrashing more. Maybe "buy a better processor" would help?

    Otherwise, you could just try attaching to bad-CVS using gdb, and using Ctrl-C + bt a few times to print out random backtraces. If there's a really obvious bottleneck somewhere, this should give you a lot of clues about where it is, even if your shit is compiled with -O3 -S0.

    You say it's spending a lot of time in "diff." The fundamental problem that diff tries to solve is actually NP-HARD, I think. The "edit distance" problem or something. I'm guessing this means that there are some really nasty cases which cause diff to run forever? Maybe if you have the same block repeated over and over again in your code, I dunno. There might be a diff --stupid option somewhere which forces it to only use sub-exponential algorithms.

    Funny thing, almost all of the regex engines that exist in the wild nowadays actually have exponential worst case run times, even if you don't use any back-references.

    • deathdrone says:

      If it is a problem with some diff edge case, then it's probably going to be going slow on the same specific file every time. If you knew which file it was, you could probably bypass it somehow. If you're having problems figuring out which file diff is looking at, you could try using strace if you're on linux, and look for the last open(2) syscall.

    • njs says:

      Diff as used in VCS systems is not NP-hard; there are good dynamic programming tricks. It's pretty fast, even -- worst-case O(NP) where N is the size of the input and P is a measure of how much changed.