cy/LazyFetch - Mercurial

Lazy Fetching is the idea that you can only pull the revision history that is relevant to your own local changes. This cuts down on bandwidth and disk usage, and makes it easier for new developers. They can simply get a copy of the code, add a patch, and push that patch to others, or have them pull it. Normally you have to request all revisions (or changesets) all the way back to the very first one ("very first" being a relatively arbitrary distinction), before you can commit new ones. It's pretty important that you not be required to do so, especially for very large repositories with long histories.

My concept of a repository is a linked list of revisions, each one composing the contents of the working directory at a particular moment in time. Changesets come in because for small changes, having only the information needed to transition between revisions takes up a lot less space than having both revisions. So for this article at least, a revision will be a full copy of all checked in files, and a changeset will be a diff between two revisions. Let me know if that's horribly wrong or missing some crucial detail.

My idea is that you can have an arbitrary "first revision" that may have previous revisions to it, but you don't need to know that to do useful things with it. You should be able to push and pull changesets to that revision just as if it were an initial revision, even to repositories that treat it as an intermediate revision. These are the scenarios I can think of, with a rough explanation of how the two peers would interact.

Example repository:

revision A - file foo.txt containing the line "foo"
changeset A->B - starting with revision A, add a line after "foo" containing "bar"
revision B - file foo.txt containing the lines "foo" and "bar"
changeset A->C - starting with revision A, delete the line "foo" and add the line "bar"
revision C - file foo.txt containing the line "bar"
changeset B->D - starting with revision B, delete the line "foo"
revision D - file foo.txt containing the line "bar"
changeset B->E - starting with revision B, delete the line "bar"
revision E - file foo.txt containing the line "foo"

* If you have revision A, and all the changesets, then by applying them one at a time you can get to any of the revisions. * If you have revision A, revision B, and B->C, then by applying B->C to B you can get revision C. * If you have revision A, and A->C, but not anything about revision B, you can still get revision C by applying A->C to revision A. * If you have revisions A and C, you already have revision C, so you could check out either one without applying any changesets. B can't be checked out until you have either A->B or C->B.

It is worthy to note that revision D is exactly the same as revision C, so the two would end up having the same hash, and thus cannot have different names. Same with versions A and E. Effectively, once you attempted to calculate changeset B->current, it would recognize that the working directory hashes to C, and give you B->C, whereas if the hash does not already exist, it would give you B->D, and a new revision D. It is also worth noting that while revision D is identical to revision C, the changesets to reach it, A->C and B->C are not equivalent. Each would be a different diff, with a different hash. Changesets B->E and B->A would be the same, though in that case E would never exist, since you'd just get existing revision A when you tried to create E.

So this is a more accurate summation of the example repository:

revision A - file foo.txt containing the line "foo"
changeset A->B - starting with revision A, add a line after "foo" containing "bar"
revision B - file foo.txt containing the lines "foo" and "bar"
changeset A->C - starting with revision A, delete the line "foo" and add the line "bar"
revision C - file foo.txt containing the line "bar"
changeset B->C - starting with revision B, delete the line "foo"
changeset B->A - starting with revision B, delete the line "bar"

Now, where's the initial revision of this repository? It's completely circular! You may as well say A is the root, or B, or C. That's the key to lazy fetching, that there is no initial revision to a repository. If you took the current revision from one repository, and the initial revision from another, you could make the former the parent of the latter, simply by computing the changeset from the first revision, to the last one. Suddenly your "initial revision" is not an initial revision. But despite rebasing in this fashion, the contents of the initial revision itself remain unchanged, so it will have the same hash, and rebase to the same hash. I can arbitrarily say in that repository that A is the initial revision, but any of the revisions could be considered its initial state.

This is why it's important to have lazy fetching in any repository management system. Because what revision is the initial revision is arbitrary, and the concept of repository ignores that all revisions and changesets all fit in a single patch space. There's no reason you couldn't merge two totally different repositories, simply by making changesets between revisions they contain. Arbitrarily denying that functionality because the repositories don't start from the same revision just makes merging projects harder. In effect, every repository can be a branch of every other repository, and the model where it cannot is a flawed model that does not encompass all situations, or even all useful situations.

Since the "initial" revision can always have previous revisions, that makes any log of revisions an unbounded list. And to deal with unbounded lists, you use lazy operations. Thus, lazy fetching is necessary to prevent an arbitrary limitation on repositories that encumbers everyone involved.

If I guess right, mercurial does something like the following:

revision A - file foo.txt containing the line "foo"
changeset A->B - starting with revision A, add a line after "foo" containing "bar"
changeset A->C - starting with revision A, delete the line "foo" and add the line "bar"
changeset B->C - starting with revision B, delete the line "foo"
changeset B->A - starting with revision B, delete the line "bar"

and calls it a history of 4 changesets. No revisions exist beyond the initial one, and calculating a working tree is applying those changesets to reach a particular revision. It might also have backwards changesets, so A<-B (B->A) instead of A->B.

revision A - file foo.txt containing the line "foo"
changeset A->B - starting with revision A, add a line after "foo" containing "bar"
changeset A<-B - starting with revision B, delete the line "bar"
changeset A->C - starting with revision A, delete the line "foo" and add the line "bar"
changeset A<-C - starting with revision C, delete the line "bar" and add the line "foo"
changeset B->C - starting with revision B, delete the line "foo"
changeset B<-C - starting with revision C, add the line "foo"
changeset B->A - starting with revision B, delete the line "bar"
changeset B<-A - starting with revision A, add a line after "foo" containing "bar"

That way the one single revision you keep could be A, B, or C, since you have changesets that move from any one revision to any other. So you could keep revision C for instance, and check it out immediately instead of having to apply the changesets to reach it from A. Obviously there are equivalent changesets, but I don't know if mercurial optimizes those out.

So, taking the above pseudo-mercurial repository, if you just chop off the former part and have

changeset B->C - starting with revision B, delete the line "foo"
changeset B->A - starting with revision B, delete the line "bar"

...then you can't compute any sort of working directory at all. You don't have B, which is needed to apply any B-> changesets. And you can't get B since for instance you have no revision A, and A->B, or C and C->B.

Doing that is pretty dumb and I think that's why people say shallow clones, or lazy fetching, is impossible. If the ONLY revision you can have is A, then you need all changesets from A to anywhere in order to get... anywhere. Is that not the case? Is there another reason these "shallow clones" don't work well for certain things, like for instance making any contributions or changes at all that others can actually use? If so please educate me because it really puzzles me.

What I want to do is enable the possibility of having a different initial revision, simply ignoring the requirement that A be the initial revision. The changesets you calculate will still be the same, but you won't need all changesets in order to do anything useful. Furthermore you don't need the entire "future" of a project to commit and push changes to it. Even being totally ignorant of the future, your changeset would be identical to if you knew of future changesets, checked out an earlier revision, and made changes to it. Ultimately all you need is a revision to work on, and instructions for retrieving or calculating previous and next versions.

"Instruction to get the next value in a list" is the essence of lazy stream algorithms, where instead of having the entire list, you only need the front of that list, and the instructions at the end. As they beat into your head in lisp circles, '(a b c d) can be converted to and from '(a . (lambda () '(b c d)). So an infinite list '(a b c d ... forever ...) that cannot be stored in finite computer memory, can be represented by the finite list '(a . calculate-next). That finite structure can represent an infinite list, because calculate-next returns a tuple of '(b calculate-next-next), and calculate-next-next returns '(c calculate-etc), and so on indefinitely (and recursively, most likely).

Applying that to revisions and changesets if you had revision I, and nothing previous, and still wanted to go back, you could make a list of revisions '(I . find-rest), and find-rest would go to a remote server and request the changeset that results in whatever is before I. The server would have the changeset H->I, thus knowing that I want I->H, and by calculating both H and I, be able to produce H->I. After it gives me that, my list of revisions would be '(I I->H . find-rest). The order of this list being from present to past. Requesting revisions from past to present works the same way, where I specify my revisions as '(C . find-rest), the server sends me C->D giving me '(C C->D find-rest) and so on until I have H->I.

I never need an absolute initial revision, because the first find-rest lets me request past revisions as needed, on demand, when I check them out locally. I never need an absolute final version, because if I want to accept someone else's changes I simply use the second find-rest to add on to my local list of changesets. Then I can merge my branch with theirs or even rebase as desired, using normal mercurial functions.

And since I never need an absolute initial revision, I can do work on the latest revision, and not disturb anyone's repository that way, since the changeset I produce based on the latest revision will be identical to the changeset I produce based on the "arbitrary initial revision with all previous changesets applied".

What I mean is this. When the remote peer you are fetching from sends you B->C and B->A, it will either have revision B, or have a way to compute it. Nobody needs to store any changesets for which the parent revision is missing, so if the remote peer can't get B, then it won't have any B-> changesets. When you fetch, the official way is sending you A, A->B, B->C and B->A. No problems there. When the remote just sends you B->C and B->A it doesn't work, and you can't calculate any revision at all. But what the remote should do is send you B->C, B->A and also B. Sure it's more expensive to also send you B but it is necessary, and it is less expensive than sending you the entire history.

The remote can calculate B by applying A->B to A. It knows to calculate B, because you requested to fetch everything from revision B on upward. If you request the latest revision, it can just send you A, and no changesets at all. You would, in your invocation of hg, be able to specify the base revision, and how many revisions after it to fetch, so you could even get the remote to send you revision B, and revision B->C, but not the latest changeset B->A.

In the vast majority of cases you would want all future revisions from the base revision you choose, but with for instance extremely ununified projects like the linux kernel (cough), you might not want everyone's idea of "the latest revision" to be your idea of "the latest revision."

The following repository is a bit easier for demonstrating why changesets do not all need to be replicated on all machines for full mercurial functionality:

revision A
changeset A->B
changeset B->C
changeset C->D
changeset D->E
changeset E->F
changeset F->G
changeset G->H
changeset H->I

Let's say I only want to work on the latest revision. I ask the server for keyword tip, and all changesets after it (in case any are committed while I request), and it goes and pseudo-checks-out revision I, then sends that to me and nothing else. Now my repository looks like this:

revision I

Pretty simple. Now I work on it and create a changeset:

revision I
changeset I->J

Now I push back to the server. I send it I->J and now it looks like this:

revision A
changeset A->B
changeset B->C
changeset C->D
changeset D->E
changeset E->F
changeset F->G
changeset G->H
changeset H->I
revision I
changeset I->J

Let's rewind to before I pulled anything or made any changes. If instead of only requesting the current revision, I request all of them to the beginning, the server sends everything to me. That gives me this repository:

revision A
changeset A->B
changeset B->C
changeset C->D
changeset D->E
changeset E->F
changeset F->G
changeset G->H
changeset H->I

I then check out revision I, by applying all those changesets, then work on I. Now my repository looks like this:

revision A
changeset A->B
changeset B->C
changeset C->D
changeset D->E
changeset E->F
changeset F->G
changeset G->H
changeset H->I
revision I
changeset I->J

Now I push to the server, by sending it I->J. Now the server's repository looks like this:

revision A
changeset A->B
changeset B->C
changeset C->D
changeset D->E
changeset E->F
changeset F->G
changeset G->H
changeset H->I
changeset I->J

In both scenarios, my changeset I->J is exactly identical. It's the same changes, to the same revision after all! So there is no need for me to request all changesets back to the "initial" revision, since I can produce the exact same changeset with only an intermediary calculated revision I. There is no reason whatsoever that the server would reject my I->J changeset, when I requested I instead of "everything," since from its perspective, it ends up in the same state no matter what my repository looks like. (With possibly more calculated revisions hanging around, but those can be deleted when optimizing for space as opposed to speed.) If I only want to work on the latest revision, the remote server does not need to send me A, A->B, B->C, C->D, D->E, E->F, F->G, G->H and H->I. That could be a lot of changes, and take up bandwidth and disk space. Instead the remote end could calculate B by applying A->B to A. Then apply B->C to that and so forth, ending up with a revision representing the current most recent version: revision I. And then the remote server only has to send I, and no changesets.