Size: 7279
Comment:
|
Size: 8179
Comment: Refactored "data structures" into their own pages for easier linking
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
= Mercurial's design = | (for details from a source code point of view, see WhatGoesWhere) |
Line 3: | Line 3: |
The guts of Mercurial. | [[TableOfContents]] |
Line 7: | Line 7: |
Mercurial uses a few fundamental objects. | '''Nodeids''' are unique ids that represent the contents of a file ''and'' its position in the project history. See ["Nodeid"]. |
Line 9: | Line 10: |
=== Nodeids === | A '''revlog''', for example {{{.hg/data/somefile.d}}}, is the most important data structure and represents all versions of a file. See ["Revlog"]. |
Line 11: | Line 13: |
Nodeids are unique ids that represent the contents of a file /and/ its position in the project history. For now they are computed using the SHA1 hash function, which generates 160 bits (40 hex digits). If you modify a file, commit the change, and then modify it to restore the original contents, the contents are the same but the history is different, so the file will get a new nodeid. This history-sensitivity is obtained by calculating the nodeid from the concatentation of the parent nodeids with the file's contents. |
A '''manifest''' describes the state of a project by listing each file and its nodeid to specify which version. See ["Manifest"]. |
Line 20: | Line 16: |
=== Revlogs === | A '''changeset''' lists all files changed in a checkin along with a change description and metadata like user and date. See ["ChangeSet"]. |
Line 22: | Line 19: |
A '''revlog''', for example ''.hg/data/somefile.d'', is the most important data structure and represents all versions of a file. Each version is stored compressed in its entirety or stored as a compressed binary delta (difference) relative to the preceeding version in the revlog. Whether to store a full version is decided by how much data would be needed to reconstruct the file. This system ensures that Mercurial does not need huge amounts of data to reconstruct any version of a file, no matter how many versions are stored. |
== Putting it all together == |
Line 31: | Line 21: |
The reconstruction requires a single read, if Mercurial knows when and where to read. Each revlog therefore has an '''index''', for example ''.hg/data/somefile.i'', containing one fixed-size record for each version. The record contains: |
We now have enough information to see how this all works together. To look up a given revision of a file: |
Line 36: | Line 23: |
* the nodeid of the file version * the nodeids of its parents * how many bytes to read from the revlog * the offset in the revlog saying where to begin reading |
* look up the changeset in the changeset index * reconstruct the changeset data from the revlog * look up the manifest nodeid from the changeset in the manifest index * reconstruct the manifest for that revision * find the nodeid for the file in that revision * look up the revision of that file in the file's index * reconstruct the file revision from the revlog |
Line 41: | Line 31: |
With one read of the index to fetch the record and then one read of the revlog, Mercurial can in time proportional to the file size reconstruct any version of a file. |
If we want to go the other way and find the changeset associated with a given file revision, we follow the linkrev. |
Line 45: | Line 33: |
So that adding a new version requires only O(1) seeks, the revlogs and their indices are append-only. |
{{{ |
Line 48: | Line 35: |
Revlogs are also used for '''manifests''' and '''changesets'''. | .--------linkrev-------------. v | .---------. .--------. .--------. |changeset| .->|manifest| .->|file |---. |index | | |index | | |index | |--. `---------' | `--------' | `--------' | | | | | | | `-------' | V | V | V `-------' .---------. | .--------. | .---------. |changeset|-' |manifest|-' |file | |data | |data | |revision | `---------' `--------' `---------' |
Line 50: | Line 48: |
=== Manifests === | }}} |
Line 52: | Line 50: |
A manifest describes the state of a project by listing each file and its nodeid to specify which version. Recreating a particular state means simply looking up its manifest and reconstructing the listed file versions from their revlogs. The manifest is conceptually a file. All of its versions, which collectively represent the entire project history -- are stored in a revlog (see the file ''.hg/00manifest.d'') and an associated index (''.hg/00manfesti.i''). |
== Tracking Working Directory State == |
Line 60: | Line 52: |
=== Changesets === | The other piece of Mercurial is the working directory. Mercurial tracks various information about the working directory: |
Line 62: | Line 54: |
A changeset lists all files changed in a checkin along with a change description and metadata like user and date. It also contains a nodeid of the resulting manifest. Given enough history information, a changeset can be converted into a manifest. Given enough history information and the metadata, a manifest can be converted into a changeset. So manifests are reundant, but maintaining both representations of the project history speeds up Mercurial. |
* what revision(s) are currently checked out * what files have been copied or renamed * what files are controlled by Mercurial For each file that Mercurial controls, we record the following information: * its size * its mode * its modification time * its "state" The states that are tracked are: * n - normal * a - added * r - removed * m - 3-way merged With this information, we can quickly determine what files in the working directory have changed. Here's an example of the dirstate: {{{ $ hg parents -q 1224:cc61d366bc3b $ hg debugstate n 644 168 08/19/05 17:42:17 .hgignore n 644 412 08/20/05 01:46:57 .hgtags n 644 1328 08/26/05 23:22:20 CONTRIBUTORS n 644 17992 06/30/05 10:19:51 COPYING n 644 459 08/24/05 00:38:20 MANIFEST.in n 644 512 08/24/05 00:35:02 Makefile n 644 232 06/30/05 10:19:51 PKG-INFO n 644 2736 08/20/05 00:48:33 README n 644 1676 06/30/05 10:19:51 comparison.txt n 644 3711 12/31/69 15:59:59 contrib/bash_completion n 711 1305 07/01/05 15:04:52 contrib/buildrpm n 755 8300 08/16/05 16:03:59 contrib/convert-repo n 644 69 06/30/05 10:19:51 contrib/git-viz/git-cat-file n 644 69 06/30/05 10:19:51 contrib/git-viz/git-diff-tree n 644 69 06/30/05 10:19:51 contrib/git-viz/git-rev-list n 644 69 06/30/05 10:19:51 contrib/git-viz/git-rev-tree n 644 457 06/30/05 10:19:51 contrib/git-viz/hg-viz n 755 8039 08/16/05 16:03:59 contrib/hgit n 755 40043 06/30/05 10:19:51 contrib/hgk ... }}} |
Line 72: | Line 104: |
These data structures are designed for '''merging''', the fundamental operation in a distributed SCM that encourages branching. |
=== Branching and merging === Every working directory is potentially a branch and every user effectively works in their own branch. When Mercurial checks out a branch, it remembers the changeset that directly led to it so that the next checkin will have the correct parent. To merge two branches, you pull their heads into the same repository, check out one of them and merge the other, and then check in (commit) the result once you're happy with the merge. The resulting checkin has two parents. Mercurial decides when a merge is necessary by first determining whether the working directory contains uncommitted changes. This determination effectively turns the working directory into a branch of the checked-in version on which it is based. If the working directory is a direct ancestor or descendent of the second version that we're attempting to checkout, Mercurial replaces the working-directory version with the new version. Otherwise it merges the two versions. |
Line 88: | Line 137: |
=== Branching and merging === Every working directory is potentially a branch and every user effectively works in their own branch. When Mercurial checks out a branch, it remembers the changeset that directly led to it so that the next checkin will have the correct parent. To merge two branches, you check out their heads into the same working directory, which also performs a merge, and then check in (commit) the result once you're happy with the merge. The resulting checkin has two parents. Mercurial decides when a merge is necessary by first determining whether the working directory contains uncommitted changes. This determination effectively turns the working directory into a branch of the checked-in version on which it is based. If the working directory is a direct ancestor or descendent of the second version that we're attempting to checkout, Mercurial replaces the working-directory version with the new version. Otherwise it merges the two versions. |
|
Line 119: | Line 148: |
}}} | |
Line 121: | Line 149: |
{{{ | |
Line 138: | Line 165: |
Line 158: | Line 184: |
/1. Searching for new roots/. This part begins by finding all new heads and searching backwards from those heads to the first unknown nodes in their respective branches. These nodes are the roots used to calculate the '''changegroup''': the set of all changesets starting at those roots. Mercurial takes pains to make this search efficient in both bandwidth and round-trips. |
1. Searching for new roots. * This part begins by finding all new heads and searching backwards from those heads to the first unknown nodes in their respective branches. These nodes are the roots used to calculate the '''changegroup''': the set of all changesets starting at those roots. Mercurial takes pains to make this search efficient in both bandwidth and round-trips. 1. Pulling a changegroup. * Once the roots are found, the changegroup can be transferred as a single streaming transfer. This transfer is organized as an ordered set of deltas for changesets, manifests, and files. Large chunks of deltas can be directly added to the repository without unpacking so the pull is quick. |
Line 165: | Line 189: |
/2. Pulling a changegroup/. Once the roots are found, the changegroup can be transferred as a single streaming transfer. This transfer is organized as an ordered set of deltas for changesets, manifests, and files. Large chunks of deltas can be directly added to the repository without unpacking so the pull is quick. |
See WireProtocol for more details. == References == * ''"Towards a Better SCM: Revlog and Mercurial"'', Matt Mackall ([attachment:Presentations/ols-mercurial-paper.pdf PDF]) * ''[http://hgbook.red-bean.com/hgbookch4.html#x8-640004 "Behind the scenes"]'' in ''[http://hgbook.red-bean.com/hgbook.html "Distributed revision control with Mercurial"]'', Bryan O’Sullivan see also: ["Presentations"] |
(for details from a source code point of view, see WhatGoesWhere)
Data structures
Nodeids are unique ids that represent the contents of a file and its position in the project history. See ["Nodeid"].
A revlog, for example .hg/data/somefile.d, is the most important data structure and represents all versions of a file. See ["Revlog"].
A manifest describes the state of a project by listing each file and its nodeid to specify which version. See ["Manifest"].
A changeset lists all files changed in a checkin along with a change description and metadata like user and date. See ["ChangeSet"].
Putting it all together
We now have enough information to see how this all works together. To look up a given revision of a file:
- look up the changeset in the changeset index
- reconstruct the changeset data from the revlog
- look up the manifest nodeid from the changeset in the manifest index
- reconstruct the manifest for that revision
- find the nodeid for the file in that revision
- look up the revision of that file in the file's index
- reconstruct the file revision from the revlog
If we want to go the other way and find the changeset associated with a given file revision, we follow the linkrev.
.--------linkrev-------------. v | .---------. .--------. .--------. |changeset| .->|manifest| .->|file |---. |index | | |index | | |index | |--. `---------' | `--------' | `--------' | | | | | | | `-------' | V | V | V `-------' .---------. | .--------. | .---------. |changeset|-' |manifest|-' |file | |data | |data | |revision | `---------' `--------' `---------'
Tracking Working Directory State
The other piece of Mercurial is the working directory. Mercurial tracks various information about the working directory:
- what revision(s) are currently checked out
- what files have been copied or renamed
- what files are controlled by Mercurial
For each file that Mercurial controls, we record the following information:
- its size
- its mode
- its modification time
- its "state"
The states that are tracked are:
- n - normal
- a - added
- r - removed
- m - 3-way merged
With this information, we can quickly determine what files in the working directory have changed.
Here's an example of the dirstate:
$ hg parents -q 1224:cc61d366bc3b $ hg debugstate n 644 168 08/19/05 17:42:17 .hgignore n 644 412 08/20/05 01:46:57 .hgtags n 644 1328 08/26/05 23:22:20 CONTRIBUTORS n 644 17992 06/30/05 10:19:51 COPYING n 644 459 08/24/05 00:38:20 MANIFEST.in n 644 512 08/24/05 00:35:02 Makefile n 644 232 06/30/05 10:19:51 PKG-INFO n 644 2736 08/20/05 00:48:33 README n 644 1676 06/30/05 10:19:51 comparison.txt n 644 3711 12/31/69 15:59:59 contrib/bash_completion n 711 1305 07/01/05 15:04:52 contrib/buildrpm n 755 8300 08/16/05 16:03:59 contrib/convert-repo n 644 69 06/30/05 10:19:51 contrib/git-viz/git-cat-file n 644 69 06/30/05 10:19:51 contrib/git-viz/git-diff-tree n 644 69 06/30/05 10:19:51 contrib/git-viz/git-rev-list n 644 69 06/30/05 10:19:51 contrib/git-viz/git-rev-tree n 644 457 06/30/05 10:19:51 contrib/git-viz/hg-viz n 755 8039 08/16/05 16:03:59 contrib/hgit n 755 40043 06/30/05 10:19:51 contrib/hgk ...
Merging
Branching and merging
Every working directory is potentially a branch and every user effectively works in their own branch. When Mercurial checks out a branch, it remembers the changeset that directly led to it so that the next checkin will have the correct parent.
To merge two branches, you pull their heads into the same repository, check out one of them and merge the other, and then check in (commit) the result once you're happy with the merge. The resulting checkin has two parents.
Mercurial decides when a merge is necessary by first determining whether the working directory contains uncommitted changes. This determination effectively turns the working directory into a branch of the checked-in version on which it is based. If the working directory is a direct ancestor or descendent of the second version that we're attempting to checkout, Mercurial replaces the working-directory version with the new version. Otherwise it merges the two versions.
Graph merging
Merging a pair of directed acyclic graphs (DAGs) -- the family tree of the file history -- requires determining whether nodes in different graphs correspond. Comparing the node contents (or hashes of the contents) is incorrect because it ignores the history.
However, using the nodeid avoids this error because the nodeid describes the file's contents and its graph position relative to the root. A merge simply checks whether each nodeid in graph A is in graph B and vice versa (for example using a hash table), and Mercurial adds the new nodes to the append-only revlog.
Merging manifests
To merge manifests, first compare them and decide which files need to be added, deleted, and merged.
For each file to be merged, perform a graph merge and resolve conflicts as above. It's important to merge files using per-file DAGs rather than just changeset-level DAGs as this diagram illustrates:
M M1 M2 AB |`-------v M2 clones M (mainline) aB AB file A is change in mainline |`---v AB' file B is changed in M2 | aB / | M1 clones M | ab/ | M1 changes B | ab' | M1 merges from M2, changes to B conflict | | A'B' M2 changes A `---+--.| | a'B' M2 merges from mainline, changes to A conflict `--.| ??? depending on which ancestor we choose, we will have to redo A hand-merge, B hand-merge, or both but if we look at the files independently, the merge is easy
The result is a merged version in the working directory, waiting for checkin.
Rollback
When committing or merging, Mercurial adds the changeset entry last. Mercurial keeps a transaction log of the name of each file touched and its length prior to the transaction. On abort, it truncates each file to its prior length. This simplicity is one benefit of making revlogs append-only. The transaction journal also allows an undo operation.
Merging between repositories
A key feature of Mercurial is its ability to merge between independent repositories in a decentralized fashion. Each repository can act as a read-only server or as a client. A client pulls from the server all branches that it has not seen and adds them to its graph. This pull is done in two steps:
- Searching for new roots.
This part begins by finding all new heads and searching backwards from those heads to the first unknown nodes in their respective branches. These nodes are the roots used to calculate the changegroup: the set of all changesets starting at those roots. Mercurial takes pains to make this search efficient in both bandwidth and round-trips.
- Pulling a changegroup.
- Once the roots are found, the changegroup can be transferred as a single streaming transfer. This transfer is organized as an ordered set of deltas for changesets, manifests, and files. Large chunks of deltas can be directly added to the repository without unpacking so the pull is quick.
See WireProtocol for more details.
References
"Towards a Better SCM: Revlog and Mercurial", Matt Mackall ([attachment:Presentations/ols-mercurial-paper.pdf PDF])
[http://hgbook.red-bean.com/hgbookch4.html#x8-640004 "Behind the scenes"] in [http://hgbook.red-bean.com/hgbook.html "Distributed revision control with Mercurial"], Bryan O’Sullivan
see also: ["Presentations"]