Scaling Mercurial
Mercurial can scale from single-developer projects up to massive codebases and huge developer teams.
For an example of a large-scale deployment, you can check the recent writeup by Durham Goode from Facebook: Scaling Mercurial at Facebook.
Scaling is not a problem with a single root cause. Instead, there are various patterns that can lead to separate scaling issues.
See BigRepositories for a list of large repositories.
Contents
1. Concern: Many commits
For active repositories, the number of commits/changesets over time approaches infinity. This poses some scaling problems.
A standard Mercurial install and clone maintain a full copy of the repository and all of its history. This is similar to how other distributed version control systems (like Git) work.
1.1. Impact
- Repositories take longer to clone and pull (because they have more data)
- Iterating over all commits takes longer (because there are more)
1.2. When to expect problems
Scaling due to number of commits alone likely won't be a significant issue by itself. Instead, you'll likely hit issues dealing with manifests or file data size first.
Mercurial repositories with a few hundred thousand commits exist. As of April 2014, Mozilla's mozilla-central repository is close to 200,000 commits with no apparent scaling problems due to commit volume alone. There are known to be private repositories at other companies that have over 100,000 additional commits and they don't have scaling problems.
1.3. Solutions
Use theremotefilelog extension on the server and client to help reduce clone and pull times.
- Use a modern Mercurial version. Scaling issues are always being addressed and running the newest stable release is a good bet to have the best performance.
- Mercurial has planned support for shallow clone, which will enable clients to only fetch N changesets as opposed to all of them.
2. Concern: Many files
Repositories with tens or hundreds of thousands of files pose different scaling challenges compared to repositories with tens, hundreds, or even thousands of files.
2.1. Impact
- Clone and pull times will increase (more data to transfer)
- Filesystem has to manage more files (because Mercurial uses a separate file for each file under version control)
hg status takes longer (more files to check)
- Repository operations become filesystem and I/O bound
- Manifest revision resolution becomes slow (impacts some operations that needs to iterate over the sets of files in each commit)
2.2. When to expect problems
Mercurial should scale to tens of thousands of files without any modification (provided your system has a decent filesystem and I/O performance).
Mozilla's mozilla-central repository has close to 94,000 files as of April 2014 and is growing steadily.
2.3. Solutions
Use theremotefilelog extension on the server and client to help reduce clone and pull times.
Use the hgwatchman extension on clients to make hg status faster.
- Mercurial has planned support for narrow clone. This will allow clients to clone only a subset of files.
3. Concern: Large files
Repositories with large files (measured in the megabytes, tens of megabytes, or even hundreds of megabytes) may pose scaling challenges.
With a standard Mercurial configuration, the entire content of the repository is cloned to all clients. That means if you check in a incompressible 100 MB binary file, each client will need to transfer that 100 MB file on every clone. If you check in a completely new version of that file that varies completely from the previous version (read: delta compression won't work), clients will need to pull the original version and the new version - pulling a total of 200 MB.
3.1. Impact
- Repository size on disk can grow significantly
- Clone and pull times take longer (more data to transfer)
hg update takes longer (due to managing more bytes)
- I/O becomes more of a bottleneck
3.2. When to expect problems
This depends on your environment. If all the clients of a repository are on a fast ethernet (100 mbps or faster) and have ample and fast storage, the impact of large files will not be felt as much as they would if you are trying to support clients on dial-up Internet connections.
But one thing is certain: old versions of binary files are arguably unnecessary and wasteful.
3.3. Solutions
Use the LargeFiles extension on the server and client to minimize data transfer during clones and pulls.
Use theremotefilelog extension on the server and client to help reduce clone and pull times.
- Move all clients to faster networks (if possible)
- Consider not storing large files in Mercurial (the ultimate feature of last resort)
For many consumers, the largefiles or remotefilelog extensions should suffice.
4. Concern: Many heads
If your repository has many heads (bookmarks, branches, or anonymous), this could impose scaling problems.
4.1. Impact
- Manifest revlog could grow in size faster than with linear repository history.
- Clones and pull times may take longer due to larger manifests.
- Manifest resolution could take longer, impacting some operations examining the sets of files in each commit.
4.2. When to expect problems
Running into problems with many heads depends on the number of head and how they are used. A repository with short-lived, non-updated heads will vary in behavior from a repository with a dozen heads that are continuously being committed to.
Mozilla's Try repository (a repository where Firefox developers push changes to test in Mozilla's test automation infrastructure) can reach over 10,000 heads with few problems. That is on a repository that already has 200,000+ changesets and near 100,000 files. However, certain repository operations slow down as the number of heads go past 10,000. Heads in this repository are typically created and then are idle forever. If you think of the repository as a tree (as in nature), the Try repository has a large trunk with thousands of very small twigs branching out for 1 to 10 commits (on average). The scaling issues with this repository seem to mostly deal with the raw number of heads as opposed to the impact heads have elsewhere in Mercurial.
A repository with many actively-developed heads runs into different scaling concerns. Due to the way Mercurial encodes its data, having many concurrently-developed heads may lead to manifests blowing up in size. This will increase clone and pull times and consume additional disk space.
4.3. Solutions
- Upgrade to the latest Mercurial client. Scaling issues are always being addressed. Changes in Mercurial 3.0 addressed some problems related to many numbers of heads.
Consider generaldelta encoding. Please be advised of potential impact on clone and pull times.
- Consider using multiple repositories. Instead of having 1 repository with all your heads, split out related heads into their own repositories. Clients can always pull from each repository and unify the repositories if they need to. This is the approach Mozilla takes.
- Consider transferring and storing repository heads as bundle files instead of pushing them directly to a repository. This is a solution to the 10,000 heads problems. Unfortunately, as of April 2014, there aren't any listed extensions to make this plug-and-play, so you may need to roll your own blob storage and Mercurial extensions.