Problem

Clones of some projects are already in the GB range. For example, the main Mozilla repo is ~2GB, and only storing the latest revision requires ~670MB. Users often don't need the full repo; often a small set of directories is enough. Also, history all the way back may not be useful, especially not for the directories the user does not use.

Solution

Narrow Clone

The user will be able to say something like ‘hg clone --narrow --include browser/’ and get filelogs only for files within the requested directory/directories. With certain filelogs missing, any operation involving them has to be avoided.

Files can not be changed in dirstate outside narrow spec. Copies/renames can not be recorded with outside file as source (nor target).

Copies/renames will not be traced to outside narrow spec. It will instead seem like files moved into the narrow spec were created from scratch at the revision where it was copied/renamed. Similarly, files moved out of the narrow spec will appear to have been deleted. This will also impact merge. For example, for a file that was moved out of the narrow spec remotely and modified locally, we would normally move the modified version to the new location; in a narrow clone, we we would instead report a changed/deleted conflict. If the file was instead copied to outside the narrow spec, we would normally update the copied version; in a narrow clone, we would silently ignore the copy (because we wouldn’t know about it).

‘hg diff’ will not include paths outside of narrow spec. There might instead be just a diff header containing the nodeid before and after. That would also let us make the diff lossless, in the sense that ‘hg import <(hg export $rev)’ would work and successfully apply changes outside the narrow spec, assuming there were no conflicts in these paths. In order to get ‘hg import’ to work, we would need to update dirstate to allow it to store the nodeid and flags to use at commit time.

Much like ‘diff’, with a bundle produced with ‘hg bundle’ or ‘hg strip’, one should be able to losslessly restore the bundled or stripped revisions. Since the necessary information is in the manifest, this may just mean not including the filelogs in the bundle.

Plain 'hg log' will show history touching only files outside narrow spec. So will 'hg log -p', but there will be no diffs for paths outside narrow spec.

Merge should ideally work as long as there are no conflicts outside of the narrow spec. Initially, it will refuse merges that change files outside of narrow spec (compared to '.'). Since 'hg rebase' is usually used for moving local changes onto upstream changes, it usually won't change files outside narrow clone, so it will just work, even initially.

Just like other commands, ‘hg verify’ will have to avoid looking for filelogs for files outside the narrow spec.

The clone can be widened or narrowed by adding more paths to include or exclude. Narrowing simply deletes local filelogs, while widening means getting additional (full) filelogs from the server.

Regarding exchange, the first step would be to make sure that a narrow clone can pull from and push to a full clone. As long as any pushed changesets only touch files within the narrow spec, this should not be a problem in theory. Exchange between two narrow clones gets a little trickier. In general, the sender needs to have the filelogs for any paths involved in the changesets that are also in the receiver’s narrow spec.

With holey history

Plain narrow clone as described above will have fewer filelogs than a full clone, but they will still have the full changelog and manifest revlog. These can be large on their own (~500MB in the case of the Mozilla repo). The holey history extension extends the narrow clone extension and trims the changelog and manifest revlog according to the narrow spec.

The client will pass its narrow spec to the server on clone, push and pull (as it already does for narrow clone). The server will squash sections of history that do not touch the requested paths into "ellipsis" commits. These commits will have the nodeid of their tip-most commit and their parent will be the parent of their bottom-most commit. They are restricted to having a single parent. Their diff will be the combined diff from their parent to the tip-most commit. To the client, these commits will look just like any other commit, except that they will have some indication that they are "ellipsis" commits, and their contents won't verify.

When a plain narrow clone gets widened (or narrowed), it only involves downloading more filelogs (or discarding filelogs). With holey history, the changelog and manifest revlog will also need to be updated in most cases, because ellipsis nodes may become regular nodes or vice versa. This is probably fine to do by simply replacing the changelog and manifest revlog. Existing filelogs’ linkrevs will also need to be updated. For these, it’s seems too costly to download the entire filelogs just to get new linkrevs. Instead, we should be able to replace the linkrev fields in the index file (violating the usual “append-only” assumption).

With narrow manifests

For projects where even a single manifest is too large to work with, the narrow extension can be combined with ManifestShardingPlan. This introduces several new complications, as explained below.

TODO

NarrowClonePlan (last edited 2015-02-18 23:57:55 by MartinVonZweigbergk)