Size: 3338
Comment:
|
Size: 8411
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 6: | Line 6: |
revision A - file foo.txt containing the line "foo" changeset A->B - starting with revision A, add a line after "foo" containing "bar" revision B - file foo.txt containing the lines "foo" and "bar" changeset A->C - starting with revision A, delete the line "foo" and add the line "bar" revision C - file foo.txt containing the line "bar" changeset B->D - starting with revision B, delete the line "foo" revision D - file foo.txt containing the line "bar" changeset B->E - starting with revision B, delete the line "bar" revision E - file foo.txt containing the line "foo" |
* revision A - file foo.txt containing the line "foo" * changeset A->B - starting with revision A, add a line after "foo" containing "bar" * revision B - file foo.txt containing the lines "foo" and "bar" * changeset A->C - starting with revision A, delete the line "foo" and add the line "bar" * revision C - file foo.txt containing the line "bar" * changeset B->D - starting with revision B, delete the line "foo" * revision D - file foo.txt containing the line "bar" * changeset B->E - starting with revision B, delete the line "bar" * revision E - file foo.txt containing the line "foo" |
Line 17: | Line 17: |
Line 18: | Line 19: |
Line 19: | Line 21: |
Line 21: | Line 24: |
It is worthy to note that revision D is exactly the same as revision C, so the two would end up having the same hash. Same with versions A and E. Effectively, once you attempted to calculate changeset B->current, it would recognize that the working directory hashes to C, and give you B->C, whereas if the hash does not already exist, it would give you B->D, and a new revision D. It is also worth noting that the revision D is identical to the revision C, but the changesets to reach it, A->C and B->C are ''not'' equivalent. Each would be a different diff, with a different hash. | It is worthy to note that revision D is exactly the same as revision C, so the two would end up having the same hash. Same with versions A and E. Effectively, once you attempted to calculate changeset B->current, it would recognize that the working directory hashes to C, and give you B->C, whereas if the hash does not already exist, it would give you B->D, and a new revision D. It is also worth noting that the revision D is identical to the revision C, but the changesets to reach it, A->C and B->C are ''not'' equivalent. Each would be a different diff, with a different hash. Changesets B->E and B->A would be the same, though in that case E would never exist, since you'd just use existing revision A. |
Line 24: | Line 27: |
revision A - file foo.txt containing the line "foo" changeset A->B - starting with revision A, add a line after "foo" containing "bar" revision B - file foo.txt containing the lines "foo" and "bar" changeset A->C - starting with revision A, delete the line "foo" and add the line "bar" revision C - file foo.txt containing the line "bar" changeset B->C - starting with revision B, delete the line "foo" changeset B->A - starting with revision B, delete the line "bar" |
* revision A - file foo.txt containing the line "foo" * changeset A->B - starting with revision A, add a line after "foo" containing "bar" * revision B - file foo.txt containing the lines "foo" and "bar" * changeset A->C - starting with revision A, delete the line "foo" and add the line "bar" * revision C - file foo.txt containing the line "bar" * changeset B->C - starting with revision B, delete the line "foo" * changeset B->A - starting with revision B, delete the line "bar" Now, where's the initial revision of this repository? It's completely circular! You may as well say A is the root, or B, or C. That's the key to lazy fetching, that there ''is'' no initial revision to a repository. If you took the current revision from one repository, and the initial revision from another, you could make the former the parent of the latter, simply by computing the changeset from the first revision, to the last one. Suddenly your "initial revision" is not an initial revision. But despite rebasing in this fashion, the contents of the initial revision ''itself'' remain unchanged. If I guess right, mercurial does something like the following: * revision A - file foo.txt containing the line "foo" * changeset A->B - starting with revision A, add a line after "foo" containing "bar" * changeset A->C - starting with revision A, delete the line "foo" and add the line "bar" * changeset B->C - starting with revision B, delete the line "foo" * changeset B->A - starting with revision B, delete the line "bar" and calls it a history of 4 changesets. If you just chop off the former part and have * changeset B->C - starting with revision B, delete the line "foo" * changeset B->A - starting with revision B, delete the line "bar" You can't compute any sort of working directory. You don't have B, which is needed to apply any B-> changesets. And you can't get B since you have no revision A, and A->B. However, wherever you are fetching from that sends you B->C and B->A, it will either have revision B, or a way to compute it. Nobody needs to store any changesets for which the parent revision is missing. So when you pull instead of sending you A, A->B, B->C and B->A, the remote end can calculate B itself, by applying A->B to A. Then it can send you B, B->C and B->A, and you will be able to apply any of the latter changesets, by the fact you have a pristine copy of B. The remote end could keep B around or not, but the important part is it can send it to you, instead of having to send you only revision A, and all changesets emerging from that. If revisions are nodes, changesets form a graph connecting the nodes. Given the following repository: * revision A * changeset A->B * changeset B->C * changeset C->D * changeset D->E * changeset E->F * changeset F->G * changeset G->H * changeset H->I if I only want to work on the latest revision, the remote server does not need to send me A, A->B, B->C, C->D, D->E, E->F, F->G, G->H and H->I. That could be a lot of changes, and take up bandwidth and disk space. Instead the remote end could calculate B by applying A->B to A. Then apply B->C to ''that'' and so forth, ending up with a revision representing the current most recent version. And then the remote server only has to send I, and ''no'' changesets. Supposing I make a change I->J, and commit that. Now I've got * revision I * changeset I->J If I wanted to publish those changes back to the original repository, doing so would be trivial. Just verifying that they have (or can get) revision I, I send them changeset I->J. They store it. They update tip to J. Simple. And now if anyone fetches they'll get revision J, even though I only sent changeset I->J. Suppose the above server goes down, and two programmers are collaborating directly on the latest version. My I->J change, and their I->K change. Neither of us know the progression it took to reach I. But both of us have revision I, since that's the tip we requested back when the server ''was'' running. And that means by exchanging only our changesets, we can all calculate revision I, J, and K. That's committing, pushing and pulling, without even having access to anything besides shallow copies. If the server comes up again, we can still push our changesets to that server. Nothing would get messed up by doing so. I just send I->J, and they send I->K, and the server now has two new changesets. Supposing a change I make produces a revision identical to an earlier one. I have only revision I, and my change I->J produces revision J which is coincidentally identical to revision B. These letters for each revision are actually a content hash though. When I create a revision, I store it by its content hash, and when I create a changeset, they list content hashes of the parent and child. So I cannot create revision J, if revision J is identical to revision B, without calling them both by the same content hash. When I say I->J, "J" in this case is always and must be identical to "B". I call it I->J but it's <something like I>-><something like B> and when I push I->J to the central server it will consider that patch (rightly) to be I->B. Thus I don't have to have a copy of revision B, or any changesets leading up to it, to prevent my revision J from being a second revision of the same thing as B with no connections. |
Lazy Fetching is the idea that you can only pull the revision history that is relevant to your own local changes. This cuts down on bandwidth and disk usage, and makes it easier for new developers. They can simply get a copy of the code, add a patch, and push that patch to others, or have them pull it. Normally you have to request all revisions (or changesets) all the way back to the very first one ("very first" being a relatively arbitrary distinction), before you can commit new ones. It's pretty important that you not be required to do so, especially for very large repositories with long histories.
My concept of a repository is a linked list of revisions, each one composing the contents of the working directory at a particular moment in time. Changesets come in because for small changes, having only the information needed to transition between revisions takes up a lot less space than having both revisions. So for this article at least, a revision will be a full copy of all checked in files, and a changeset will be a diff between two revisions.
Example repository:
- revision A - file foo.txt containing the line "foo"
changeset A->B - starting with revision A, add a line after "foo" containing "bar"
- revision B - file foo.txt containing the lines "foo" and "bar"
changeset A->C - starting with revision A, delete the line "foo" and add the line "bar"
- revision C - file foo.txt containing the line "bar"
changeset B->D - starting with revision B, delete the line "foo"
- revision D - file foo.txt containing the line "bar"
changeset B->E - starting with revision B, delete the line "bar"
- revision E - file foo.txt containing the line "foo"
If you have revision A, and all the changesets, then by applying them one at a time you can get to any of the revisions.
If you have revision A, revision B, and B->C, then by applying B->C to B you can get revision C.
If you have revision A, and A->C, but not anything about revision B, you can still get revision C by applying A->C to revision A.
If you have revisions A and C, you already have revision C, so you could check out either one without applying any changesets. B can't be checked out until you have either A->B or C->B.
It is worthy to note that revision D is exactly the same as revision C, so the two would end up having the same hash. Same with versions A and E. Effectively, once you attempted to calculate changeset B->current, it would recognize that the working directory hashes to C, and give you B->C, whereas if the hash does not already exist, it would give you B->D, and a new revision D. It is also worth noting that the revision D is identical to the revision C, but the changesets to reach it, A->C and B->C are not equivalent. Each would be a different diff, with a different hash. Changesets B->E and B->A would be the same, though in that case E would never exist, since you'd just use existing revision A.
So this is a more accurate summation of the example repository:
- revision A - file foo.txt containing the line "foo"
changeset A->B - starting with revision A, add a line after "foo" containing "bar"
- revision B - file foo.txt containing the lines "foo" and "bar"
changeset A->C - starting with revision A, delete the line "foo" and add the line "bar"
- revision C - file foo.txt containing the line "bar"
changeset B->C - starting with revision B, delete the line "foo"
changeset B->A - starting with revision B, delete the line "bar"
Now, where's the initial revision of this repository? It's completely circular! You may as well say A is the root, or B, or C. That's the key to lazy fetching, that there is no initial revision to a repository. If you took the current revision from one repository, and the initial revision from another, you could make the former the parent of the latter, simply by computing the changeset from the first revision, to the last one. Suddenly your "initial revision" is not an initial revision. But despite rebasing in this fashion, the contents of the initial revision itself remain unchanged.
If I guess right, mercurial does something like the following:
- revision A - file foo.txt containing the line "foo"
changeset A->B - starting with revision A, add a line after "foo" containing "bar"
changeset A->C - starting with revision A, delete the line "foo" and add the line "bar"
changeset B->C - starting with revision B, delete the line "foo"
changeset B->A - starting with revision B, delete the line "bar"
and calls it a history of 4 changesets.
If you just chop off the former part and have
changeset B->C - starting with revision B, delete the line "foo"
changeset B->A - starting with revision B, delete the line "bar"
You can't compute any sort of working directory. You don't have B, which is needed to apply any B-> changesets. And you can't get B since you have no revision A, and A->B.
However, wherever you are fetching from that sends you B->C and B->A, it will either have revision B, or a way to compute it. Nobody needs to store any changesets for which the parent revision is missing. So when you pull instead of sending you A, A->B, B->C and B->A, the remote end can calculate B itself, by applying A->B to A. Then it can send you B, B->C and B->A, and you will be able to apply any of the latter changesets, by the fact you have a pristine copy of B. The remote end could keep B around or not, but the important part is it can send it to you, instead of having to send you only revision A, and all changesets emerging from that.
If revisions are nodes, changesets form a graph connecting the nodes.
Given the following repository: * revision A * changeset A->B * changeset B->C * changeset C->D * changeset D->E * changeset E->F * changeset F->G * changeset G->H * changeset H->I
if I only want to work on the latest revision, the remote server does not need to send me A, A->B, B->C, C->D, D->E, E->F, F->G, G->H and H->I. That could be a lot of changes, and take up bandwidth and disk space. Instead the remote end could calculate B by applying A->B to A. Then apply B->C to that and so forth, ending up with a revision representing the current most recent version. And then the remote server only has to send I, and no changesets.
Supposing I make a change I->J, and commit that. Now I've got
* revision I * changeset I->J
If I wanted to publish those changes back to the original repository, doing so would be trivial. Just verifying that they have (or can get) revision I, I send them changeset I->J. They store it. They update tip to J. Simple. And now if anyone fetches they'll get revision J, even though I only sent changeset I->J.
Suppose the above server goes down, and two programmers are collaborating directly on the latest version. My I->J change, and their I->K change. Neither of us know the progression it took to reach I. But both of us have revision I, since that's the tip we requested back when the server was running. And that means by exchanging only our changesets, we can all calculate revision I, J, and K. That's committing, pushing and pulling, without even having access to anything besides shallow copies. If the server comes up again, we can still push our changesets to that server. Nothing would get messed up by doing so. I just send I->J, and they send I->K, and the server now has two new changesets.
Supposing a change I make produces a revision identical to an earlier one. I have only revision I, and my change I->J produces revision J which is coincidentally identical to revision B. These letters for each revision are actually a content hash though. When I create a revision, I store it by its content hash, and when I create a changeset, they list content hashes of the parent and child. So I cannot create revision J, if revision J is identical to revision B, without calling them both by the same content hash. When I say I->J, "J" in this case is always and must be identical to "B". I call it I->J but it's <something like I>-><something like B> and when I push I->J to the central server it will consider that patch (rightly) to be I->B.
Thus I don't have to have a copy of revision B, or any changesets leading up to it, to prevent my revision J from being a second revision of the same thing as B with no connections.