What limits does Mercurial have?
Mercurial currently assumes that single files, indices, and manifests can fit in memory for efficiency.
There should otherwise be no limits on file name length, file size, file contents, number of files, or number of revisions (see also BigRepositories for the sizes of some example repositories.)
The network protocol is big-endian.
File names cannot contain the null character or newlines. Committer addresses cannot contain newlines.
Mercurial is primarily developed for UNIX systems, so some UNIXisms may be present in ports.
Mercurial encodes filenames (see CaseFolding, CaseFoldingPlan, fncacheRepoFormat) when storing them in the repository. Most notably, uppercase characters in filenames are encoded as two characters in the filename in the repository ("FILE" → "_f_i_l_e").
How does Mercurial store its data?
The fundamental storage type in Mercurial is a revlog. A revlog is the set of all revisions of a named object. Each revision is either stored compressed in its entirety or as a compressed binary delta against the previous version. The decision of when to store a full version is made based on how much data would be needed to reconstruct the file. This lets us ensure that we never need to read huge amounts of data to reconstruct a object, regardless of how many revisions of it we store.
In fact, we should always be able to do it with a single read, provided we know when and where to read. This is where the index comes in. Each revlog has an index containing a special hash (nodeid) of the text, hashes for its parents, and where and how much of the revlog data we need to read to reconstruct it. Thus, with one read of the index and one read of the data, we can reconstruct any version in time proportional to the object size.
Similarly, revlogs and their indices are append-only. This means that adding a new version is also O(1) seeks.
Revlogs are used to represent all revisions of files, manifests, and changesets. Compression for typical objects with lots of revisions can range from 100 to 1 for things like project makefiles to over 2000 to 1 for objects like the manifest.
How does Mercurial handle binary files?
Core Mercurial tracks but never modifies file content, and it is thus binary safe. See BinaryFiles for more discussion of commands which interpret file content, e.g. merge, diff, export and annotate.
What about Windows line endings vs. Unix line endings?
See Win32TextExtension for techniques which automatically convert Windows line endings into Unix line endings when committing files to the repository, and convert back again when updating the workspace. This is not default Mercurial behaviour, and requires users to edit their configuration files to turn it on. Adopting this policy on line endings probably implies enabling a hook to prevent non-compliant commits from getting into your repository, which in turn forces people contributing code to enable the extension.
What about keyword replacement (i.e. $Id$)?
See KeywordExtension.
How are Mercurial diffs and deltas calculated?
Mercurial diffs are calculated rather differently than those generated by the traditional diff algorithm (but with output that's completely compatible with patch of course). The algorithm is an optimized C implementation based on Python's difflib, which is intended to generate diffs that are easier for humans to read rather than be 'minimal'. This same algorithm is also used for the internal delta compression.
In the course of investigating delta compression algorithms, we discovered that this implementation was simpler and faster than the competition in our benchmarks and also generated smaller deltas than the theoretically 'minimal' diffs of the traditional diff algorithms. This is because the traditional algorithm assumes the same cost for insertions, deletions, and unchanged elements.
How are manifests and changesets stored?
A manifest is simply a list of all files in a given revision of a project along with the nodeids of the corresponding file revisions. So grabbing a given version of the project means simply looking up its manifest and reconstructing all the file revisions pointed to by it.
A changeset is a list of all files changed in a check-in along with a change description and some metadata like user and date. It also contains a nodeid to the relevant revision of the manifest.
How do Mercurial hashes get calculated?
Mercurial hashes both the contents of an object and the hash of its parents to create an identifier that uniquely identifies an object's contents and history. This greatly simplifies merging of histories because it avoid graph cycles that can occur when a object is reverted to an earlier state.
All file revisions have an associated hash value (the nodeid). These are listed in the manifest of a given project revision, and the manifest hash is listed in the changeset. The changeset hash (the changeset ID) is again a hash of the changeset contents and its parents, so it uniquely identifies the entire history of the project to that point.
What checks are there on repository integrity?
Every time a revlog object is retrieved, it is checked against its hash for integrity. It is also incidentally doublechecked by the Adler32 checksum used by the underlying zlib compression.
Running 'hg verify' decompresses and reconstitutes each revision of each object in the repository and cross-checks all of the index metadata with those contents.
But this alone is not enough to ensure that someone hasn't tampered with a repository. For that, you need cryptographic signing.
How does signing work with Mercurial?
Take a look at the hgeditor script for an example. The basic idea is to use GPG to sign the manifest ID inside that changelog entry. The manifest ID is a recursive hash of all of the files in the system and their complete history, and thus signing the manifest hash signs the entire project contents.
What about hash collisions? What about weaknesses in SHA1?
The SHA1 hashes are large enough that the odds of accidental hash collision are negligible for projects that could be handled by the human race. The known weaknesses in SHA1 are currently still not practical to attack, and Mercurial will switch to SHA256 hashing before that becomes a realistic concern.
Collisions with the "short hashes" are not a concern as they're always checked for ambiguity and are still long enough that they're not likely to happen for reasonably-sized projects (< 1M changes).
See also: https://www.mercurial-scm.org/pipermail/mercurial/2009-April/025526.html by Matt Mackall.
How does "hg commit" determine which files have changed?
If hg commit is called without file arguments, it commits all files that have "changed" (see commit). Note however, that Mercurial doesn't detect changes that change neither the file time nor its size (This is by design. See also issue618 and DirState).
What is the difference between rollback and strip?
They overlap a bit, but are really quite different:
rollback will remove the last transaction.
- Transactions are a concept often found in databases. In Mercurial we start a transaction when certain operations are run, such as commit, push, pull... When the operation finishes successfully, the transaction is marked as complete. If an error occurs, the transaction is "rolled back" and the repository is left in the same state as before.
You can manually trigger a rollback with hg rollback. This will undo the last transactional command. If a pull command brought 10 new changesets into the repository on different branches, then hg rollback will remove them all.
Please note: there is no backup when you rollback a transaction!
- Transactions are a concept often found in databases. In Mercurial we start a transaction when certain operations are run, such as commit, push, pull... When the operation finishes successfully, the transaction is marked as complete. If an error occurs, the transaction is "rolled back" and the repository is left in the same state as before.
strip will remove a changeset and all its descendants.
- The changesets are saved as a bundle, which you can apply again if you need them back.