Note:
This page is primarily intended for developers of Mercurial.
Note:
This page appears to contain material that is no longer relevant. Please help improve this page by updating its content.
This page does not meet our wiki style guidelines. Please help improve this page by cleaning up its formatting. |
Note:
This page is no longer relevant but is kept for historical purposes.
State: Abandoned Proponent: PeterArrenbrecht
Shallow Clones
This is my plan for adding what I (PeterArrenbrecht) call shallow clones to Mercurial. These are clones that do not pull the entire history from a server, but only a subset starting at a given revision. The idea is that one often does not need the entire history to work on new features, but only a recent part of it. So shallow clones can save bandwith and disk space.
The work is related to TrimmingHistory and OverlayRepository, but the approach is different. In contrast to the punching in TrimmingHistory, I do not want to have to keep the entire index around. Only the part of it that we're interested in.
The plan is a work in progress and does not have any official blessing (other than mpm having considered the basic idea "interesting" on IRC).
Contents
1. Definitions
We use node and changeset interchangeably.
A shallow clone is a clone that contains only nodes which are descendants of a single node, called its root or root node. The root's descendants are called wanted, all others are called unwanted or absent (to differentiate them from missing ones, which simply haven't been pulled yet).
A non-shallow clone is called a proper clone.
A shallow clone is a child of another if its root is contained in the other's wanted nodes. The other is called the parent. Both are called related. Clones are called unrelated if their wanted node sets are disjunct. They are called twins if their roots are equal. For related clones, the child's root is called the shallower root.
In this context here, a revision is a changeset present in particular clone. It can thus be identified by its number (or rev). A changeset is a single commit in the full repository history. It can be identified by its ID.
A revision is called partial if some of its data has been altered to accommodate references to unwanted data. It is called shallow if either of its parent references has been nulled.
2. End User View
You obtain a shallow clone by running
hg clone {--root <rev>} src tgt
where <rev> is the chosen root (if you specify multiple roots, the effective root is their closest common ancestor.)
A shallow clone appears to you mostly as if the project's history had always started at the given root. If you query history, it will simply end at the root. If the history contains merges with nodes that are not children of the root, the merge nodes will appear to be normal, linear changes against just the parent that is a child of the root (they thus become shallow revisions).
But changeset IDs remain the same as for a proper clone. So you can still pull from and - crucially - push back to the full repository (and they can pull from and push to you).
However, there are a number of restrictions:
- You cannot push (and they cannot pull) shallow revisions to another clone unless that clone has the same root. Otherwise shallow clones could end up with less history than they actually wanted.
You cannot merge nodes where the paths to the closest common ancestor traverse a shallow revision. Otherwise the merge could choose a different parent than it would in a proper clone. You can override this restriction using the --shallow-ancestor option for hg merge.
2.1. Older Servers
The tentative target release for shallow clones is Mercurial 1.4.
Your shallow clone needs a 1.4 server for:
- the initial clone,
- pulling nodes,
- pushing to another shallow clone, and, by consequence of the restrictions listed above,
- pushing shallow revisions.
But it can interact with an older server for pushing proper revisions.
Pulling proper revisions from an older server would be nice. I currently doubt we can do it. But if I'm wrong, I guess it would be hit or miss: The client would simply query the server for an update and might, when unbundling the response, detect that it contains merge nodes that happen to be shallow in the local context. If so, it would then abort.
Similarly, you can only unbundle bundles from 1.4 servers. When bundling proper revisions for an older server, you can specify the older bundle format via hg bundle --version pre1.4.
The upshot is you can obtain a shallow clone from a modern mirror of an older server's repository and still contribute back directly to the project, without going through the mirror again, both with direct push and via bundles.
2.2. Obtaining More History
If you later decide that your shallow clone is too shallow (you need more history), you have to perform a two-step process:
- reclone from a sufficiently deep (and up-to-date) repository at the newly desired root, and
- pull your own commits from your too shallow clone into this new, deeper one.
By sufficiently up-to-date I mean the clone source should contain all of history that your current shallow clone has, except for your own commits. Then this process is guaranteed to work since your own commits can never be shallow. (If necessary, this could be turned into an in-place operation which moves the old .hg/store/ away, then reclones to a new .hg/store/ and pulls from the moved one.)
3. Concepts
3.1. Root
A shallow clone is established by cloning another repository with a non-null root. A special case is cloning another shallow clone, which implies inheriting the latter's root, unless overridden to something even more shallow.
A shallow clone's root ID is stored in the repository and advertised over the wire.
3.2. Handling of Unwanted Changesets
When part of history is unwanted, we are going to encounter references to absent changesets. This happens in parent and link (changeset) references. Left alone, such references will lead to accesses to the absent information in much of Mercurial's code. Rather than teaching all these locations to handle absent information, I chose to transparently change the references so code looking at the theoretically partial data will actually think it's all there is.
A reference to an unwanted changeset is handled as follows:
- Second parent: set to null.
- First parent: set to the second parent; set the second to null.
- Link: redirect to the root of the clone (can only happen in the initial revision of a file).
When changing revision data, we do not change the revision's ID. This, of course, means that the computed ID (the hash over the revision's data) no longer matches the stored ID. Code that tests for this has to be adapted.
We do, however, keep the original values alongside each partial revision. The format for this is yet to be determined. It could be an extension to the revlog's index, or a separate file, or whatever. The original values allow us to still support hg verify. Crucially, they also allow us to always send complete data over the wire.
This means that most of Mercurial's code will continue to work unchanged. One source of confusion is the list of changed files for the root node. It only lists the files that were changed in the original changeset. However, all other files already present in but unchanged by the root node now also point to the root as their linkrev. So they look like they popped into existence out of nowhere. (This might confuse hg bundle --all, for instance.)
4. Pulling and Pushing
Pushing is simply the analogous inverse of pulling. So we consider the cases where we pull into target repository T from source repository S. So T is local, S might not be. We say a revision is source shallow with respect to S's root, or target shallow with respect to T's root.
Guidelines:
- We need to extend commands on the wire and a new bundle format.
- We don't allow pulling from an older server.
- We don't allow pulling source shallow revisions into a deeper target.
- We don't allow pulling from a source that is not related to the target.
4.1. Determine shared root
First we determine the shared root of S and T such that it is the more shallow of the two roots. S advertises its root, root(S), in its capabilities (we need to add a client API to extract data-carrying capabilities for this). T knows its root, root(T). If S is an older server, we know its root is null.
In T:
- If root(S) is null, then root(T) is the shared root.
- If root(S) is present in T, then root(S) is the shared root.
- If root(S) is a parent of root(T), then root(T) is the shared root.
- We ask S for the shared root, passing it root(T).
In S, with known root(T):
- If root(T) is present in S, then root(T) is the shared root.
- If root(T) is a parent of root(S), then root(S) is the shared root.
- S and T are (currently) unrelated.
If S and T are currently unrelated, we cannot pull. I say currently unrelated because we could have the situation where one is conceptually a child of the other, but we cannot know. This happens when there is a repo like -stable and we have a shallow clone of -crew rooted at a node only present in -crew. Then our shallow -crew clone is a child of -stable, but just between -stable and our clone, we cannot tell. It doesn't matter either, because -stable cannot contain nodes that we want (they would have to be descendants of our root, which it does not have).
4.2. Determine incoming nodes
We can now assume that S and T are related and we know the shared root. So we run the normal machinery using T.findincoming() etc, but we tell it about the target and the shared root. Maybe it suffices to pass the shared root as a base, maybe we need to teach T.findincoming() and its counterparts in S to filter unwanted nodes.
4.2.1. Ideas for pulling from older servers
On initial clone, we pass our root in
the base argument of T.findcommonincoming() and
the bases argument of S.changegroupsubset().
On subsequent pulls, we cannot do this as it would violate semantics of base for T.findcommonincoming() and lead to too much data in S.changegroupsubset(). So on pulls, the old server will likely send unwanted data if there is a parallel, unwanted branch to our root. It might work OK if we truncated history at a central merge point (for example, Mercurial at 1.0).
4.3. Send Outgoing Nodes
We tell S.changegroupsubset() about the target and the shared root.
4.3.1. Send the used root in the header
We send the target root in the header of the new bundle format. This allows sanity checks later on.
For explicit hg bundle, we send our own root, unless overridden by the --root option.
4.3.2. Always send real information
Even when sending source shallow revisions, we always send the full, real information. So we never send the nulled parent IDs, but the real ones. The partial view should remain a purely local affair.
If we encounter a source shallow revision and the shared root is not equal to the target root, we abort.
We send a new per-revision flag for each target shallow revision, again for sanity checks in the target. The source assumes a revision is target shallow if it is shallow with respect to the shared root.
4.3.3. Send only deltas against revs known to the target
From the shared root, S builds the set of all wanted nodes it knows. Then, when sending deltas, it only sends deltas against wanted nodes.
If the shared root is S's root, then the wanted set is simply all of its nodes, so there's no need to build it. But it still needs to be careful to only send deltas against parents it knows. A node could have a source absent real parent1, so even if the real parent1's ID is sent, the delta would have to be against the source known parent2 (which, in this case, must be wanted).
We need an extension of the bundle format so we can specify what the delta is against.
4.3.4. Send filerevs which are (possibly) absent in the target
When a filerev was originally introduced by an unwanted changeset, but is now mentioned in the manifest of a wanted changeset, it could possibly be absent in the target. Cases in point:
- The initial clone of the root node. Its manifest will typically list tons of files that weren't changed by the root node's changeset.
- A merge with an unwanted branch that brings in files unchanged from one of the unwanted branch's nodes.
- However, a change that reverts a file to a version we already had in an earlier, possibly unwanted node always results in a new filerev, so no problem there.
More formally, we call a filerev possibly target absent if its linkrev points to a target unwanted changeset.
For the initial clone of the root node, S knows when it is sending the target's root. So it simply sends all filerevs mentioned in the root's manifest, not just those in the changelist.
For non-root nodes getting pulled, only merge nodes can refer to possibly absent filerevs. And of these only merges where one of the parents is unwanted.
Mercurial does not record files taken unchanged from either parent in a merge node. So we have to look at the manifest. We do not want to send all possibly absent filerevs in the merge manifest, as we might end up resending tons of filerevs the target already got in the initial clone.
A simple way around this is to only send possibly absent filerevs that are not mentioned in the target root's manifest. Let's call these still possibly absent filerevs. (This breaks if the target root is absent in the source because the source is a child of the target. But we disallow pulling source shallow revisions from children, and a possibly absent filerev can only be introduced by a source shallow revision.)
We can further restrict the set of still possibly absent filerevs to only those mentioned in the unwanted merge parent's manifest (if we know it) which are not mentioned in the other parent's manifest. This could reduce the set considerably if we first brought in a merge with an unwanted branch with many new files, and then another merge with a later version of the same branch.
Note that we might still end up sending possibly absent filerevs that, in fact, are already present in the target. Mercurial already ignores such duplicate revs when adding revision data. I expect this to be sufficiently rare to not be a serious bandwidth problem. It happens if a different, unrelated merge already brought over some of the files.
When sending possibly absent filerevs, we send the real and the target shallow linkrev in the new bundle format, together with a flag.
Aside: In a previous version of shallow clones I had the receiver explicitly request such absent filerevs when it detected them. This has problems. If another pulled node modifies one of the absent filerevs, then the request comes too late. It adds additional network roundtrips. And it breaks the idea of hg incoming --bundle because the bundle might be incomplete. The current approach avoid this at the cost of sometimes sending superfluous data.
4.4. Add Incoming Nodes
The key changes here are how unwanted references are detected and corrected.
4.4.1. Null out unknown parent IDs
When a filerev mentions a parent ID we don't already know, we abort if the following check fails:
- The bundle's root is the same as ours, and
- the flag indicating a target shallow revision is set in the revision being unbundled.
Otherwise, the node is flagged as being partial, the unwanted parent is set to the null, and the original, real parent ID is recorded.
4.4.2. Handle unknown linkrevs
When we encounter a changed linkrev for a possibly absent filerev, we also test that the bundle's root is the same as ours. We then flag the node as partial, use the changed linkrev and record the original one.
5. Verify
Verify checks partial nodes against the recorded original information. In verbose mode, it notes partial nodes it encounters.
What new attack vectors against verify do partial revisions open up? The goal would have be to sneak in bad code under the umbrella of a trusted changeset ID. So:
- I give you bad code in a partial revision with a trusted ID.
- I have to make sure the bad code checks against the two parent IDs I supply and say are the real ones, and the trusted ID.
- One of them has to be known to you. But the other one I can choose at will. You will never notice unless you try to push this somewhere else where the fake parent ID would be unknown.
- So I must find a fake second parent ID that makes my bad code hash properly to the trusted ID.
The parents are hashed first: hash(p1 + p2 + text) = ID.
So I am able to solve hash(p1 + x + badcode) = trustedID or else hash(x + p2 + badcode) = trustedID for x.
The former certainly suggests that then I would also be able to solve hash(p1 + p2 + '# ' + x + '\n' + badcode) = trustedID for x. Meaning I could attack regular changesets as well, which we have to assume I cannot.
The latter is a bit more dubious, but if controlling the start of the hashed text is an advantage, then controlling its end likely is, too, so I'd likely be able to solve hash(p1 + p2 + badcode + '\n# ' + x) = trustedID to, leading to the same argument.
I am not a cryptographer, but this leads me to believe shallow revisions are no less secure than regular ones.
6. TODOs
- Merge rules
- Proposed wire format
- Proposed revlog format