Scaling Mercurial

Mercurial can scale from single-developer projects up to massive codebases and huge developer teams.

For an example of a large-scale deployment, you can check the recent writeup by Durham Goode from Facebook: Scaling Mercurial at Facebook.

Scaling is not a problem with a single root cause. Instead, there are various patterns that can lead to separate scaling issues.

See BigRepositories for a list of large repositories.

1. Concern: Many commits

For active repositories, the number of commits/changesets over time approaches infinity. This poses some scaling problems.

A standard Mercurial install and clone maintain a full copy of the repository and all of its history. This is similar to how other distributed version control systems (like Git) work.

1.1. Impact

1.2. When to expect problems

Scaling due to number of commits alone likely won't be a significant issue by itself. Instead, you'll likely hit issues dealing with manifests or file data size first.

Mercurial repositories with a few hundred thousand commits exist. As of April 2014, Mozilla's mozilla-central repository is close to 200,000 commits with no apparent scaling problems due to commit volume alone. There are known to be private repositories at other companies that have over 100,000 additional commits and they don't have scaling problems.

1.3. Solutions

2. Concern: Many files

Repositories with tens or hundreds of thousands of files pose different scaling challenges compared to repositories with tens, hundreds, or even thousands of files.

2.1. Impact

2.2. When to expect problems

Mercurial should scale to tens of thousands of files without any modification (provided your system has a decent filesystem and I/O performance).

Mozilla's mozilla-central repository has close to 94,000 files as of April 2014 and is growing steadily.

2.3. Solutions

3. Concern: Large files

Repositories with large files (measured in the megabytes, tens of megabytes, or even hundreds of megabytes) may pose scaling challenges.

With a standard Mercurial configuration, the entire content of the repository is cloned to all clients. That means if you check in a incompressible 100 MB binary file, each client will need to transfer that 100 MB file on every clone. If you check in a completely new version of that file that varies completely from the previous version (read: delta compression won't work), clients will need to pull the original version and the new version - pulling a total of 200 MB.

3.1. Impact

3.2. When to expect problems

This depends on your environment. If all the clients of a repository are on a fast ethernet (100 mbps or faster) and have ample and fast storage, the impact of large files will not be felt as much as they would if you are trying to support clients on dial-up Internet connections.

But one thing is certain: old versions of binary files are arguably unnecessary and wasteful.

3.3. Solutions

For many consumers, the largefiles or remotefilelog extensions should suffice.

4. Concern: Many heads

If your repository has many heads (bookmarks, branches, or anonymous), this could impose scaling problems.

4.1. Impact

4.2. When to expect problems

Running into problems with many heads depends on the number of head and how they are used. A repository with short-lived, non-updated heads will vary in behavior from a repository with a dozen heads that are continuously being committed to.

Mozilla's Try repository (a repository where Firefox developers push changes to test in Mozilla's test automation infrastructure) can reach over 10,000 heads with few problems. That is on a repository that already has 200,000+ changesets and near 100,000 files. However, certain repository operations slow down as the number of heads go past 10,000. Heads in this repository are typically created and then are idle forever. If you think of the repository as a tree (as in nature), the Try repository has a large trunk with thousands of very small twigs branching out for 1 to 10 commits (on average). The scaling issues with this repository seem to mostly deal with the raw number of heads as opposed to the impact heads have elsewhere in Mercurial.

A repository with many actively-developed heads runs into different scaling concerns. Due to the way Mercurial encodes its data, having many concurrently-developed heads may lead to manifests blowing up in size. This will increase clone and pull times and consume additional disk space.

4.3. Solutions

ScaleMercurial (last edited 2015-03-31 12:59:02 by rcl)