This document represents a proposal only.
See also AtomicRepositoryLayoutPlan and ReadLockPlan for related topics.
Problem Statement
Mercurial's existing store model is based on separate file(s) per logical tracked path. If you have 100,000 files under version control, there will be at least 100,000 files (plus directory entries) in the .hg/store directory. The unbound growth of number of files in the .hg directory poses a number of problems:
higher probability of stat cache eviction -> lowered performance
various filesystems don't efficiently create thousands of files (most notably NTFS - see issue4889)
- various filesystems don't efficiently delete thousands of files
- In server environments where dozens, hundreds, or even thousands of repos are stored, extreme inode count can be problematic
It's worth explicitly calling out performance problems on Windows. On that platform, unbundling mozilla-central is several minutes slower than on other platforms because closing file handles that have been appended to is slow on NTFS. This can be mitigated by doing I/O on separate threads, but the underlying problem is still there.
There are also related problems from the separate file per logical track path storage model and the (append-only) store format of today:
- redundancy in storage for copied/moved files
obtaining a consistent view/snapshot of the store (for e.g. stream bundle generation) requires walking and stat()ing the entire store while a write lock is obtained. More files -> longer lock
- this is related to read locks
- this is related to how the changelog files are updated last as part of the transaction
- data from obsolete/hidden changesets lingers forever
- wastes space
- provides little to no value (value reaches 0 over time)
- adds overhead because repository filter needs to track more and more omitted revisions
- requires linkrev adjustment, which adds overhead
The goal of this plan are to establish a new repository store format that relies on fewer files and doesn't suffer from some of the inefficiencies and scaling concerns of the existing model