⇤ ← Revision 1 as of 2015-11-26 19:11:37
Size: 2244
Comment: initial
|
Size: 3840
Comment: document advantages of existing store format
|
Deletions are marked like this. | Additions are marked like this. |
Line 26: | Line 26: |
* transaction close overhead is proportional to number of files being modified | |
Line 27: | Line 28: |
The goal of this plan are to establish a new repository store format that relies on fewer files and doesn't suffer from some of the inefficiencies and scaling concerns of the existing model | The goal of this plan is to establish a new repository store format that relies on fewer files and doesn't suffer from some of the inefficiencies and scaling concerns of the existing model. = Advantages of Existing Store Format = Before we design a new store format, it's worth calling out what the existing store model gets right: * Constant time lookup * We know which file(s) data is in because the tracked path under version control maps to an explicit filename in the store. There is little to no cost to finding where data is located. * Bound lookup time has performance implications for nearly every repository operation. This is especially true for operations that read the entire repository, such as clones or heavy pulls. * Data from the changelog and manifests is generally more important to performance than files. But files are important for operations like diff generation and bundling (which is required server side for servicing pulls). * Git's storage model (loose object files + multiple packfiles) has scaling problems because data for a specific object could be in an unbound number of locations. As the number of storage locations grows, performance worsens. Periodic repacks consolidate the storage into packfiles or even a single packfile. But on large repositories, repacks are time and resource prohibitive.They can be extremely painful for server operators especially. * Append only files and sequential I/O * Writes are append only. Reads are mostly sequential. This is good for I/O performance (especially on magnetic disks) and is good for caching. * No garbage collection * Scaling garbage collection is hard and resource intensive. See Git. |
This document represents a proposal only.
See also AtomicRepositoryLayoutPlan and ReadLockPlan for related topics.
Problem Statement
Mercurial's existing store model is based on separate file(s) per logical tracked path. If you have 100,000 files under version control, there will be at least 100,000 files (plus directory entries) in the .hg/store directory. The unbound growth of number of files in the .hg directory poses a number of problems:
higher probability of stat cache eviction -> lowered performance
various filesystems don't efficiently create thousands of files (most notably NTFS - see issue4889)
- various filesystems don't efficiently delete thousands of files
- In server environments where dozens, hundreds, or even thousands of repos are stored, extreme inode count can be problematic
It's worth explicitly calling out performance problems on Windows. On that platform, unbundling mozilla-central is several minutes slower than on other platforms because closing file handles that have been appended to is slow on NTFS. This can be mitigated by doing I/O on separate threads, but the underlying problem is still there.
There are also related problems from the separate file per logical track path storage model and the (append-only) store format of today:
- redundancy in storage for copied/moved files
obtaining a consistent view/snapshot of the store (for e.g. stream bundle generation) requires walking and stat()ing the entire store while a write lock is obtained. More files -> longer lock
- this is related to read locks
- this is related to how the changelog files are updated last as part of the transaction
- data from obsolete/hidden changesets lingers forever
- wastes space
- provides little to no value (value reaches 0 over time)
- adds overhead because repository filter needs to track more and more omitted revisions
- requires linkrev adjustment, which adds overhead
- transaction close overhead is proportional to number of files being modified
The goal of this plan is to establish a new repository store format that relies on fewer files and doesn't suffer from some of the inefficiencies and scaling concerns of the existing model.
Advantages of Existing Store Format
Before we design a new store format, it's worth calling out what the existing store model gets right:
- Constant time lookup
- We know which file(s) data is in because the tracked path under version control maps to an explicit filename in the store. There is little to no cost to finding where data is located.
- Bound lookup time has performance implications for nearly every repository operation. This is especially true for operations that read the entire repository, such as clones or heavy pulls.
- Data from the changelog and manifests is generally more important to performance than files. But files are important for operations like diff generation and bundling (which is required server side for servicing pulls).
- Git's storage model (loose object files + multiple packfiles) has scaling problems because data for a specific object could be in an unbound number of locations. As the number of storage locations grows, performance worsens. Periodic repacks consolidate the storage into packfiles or even a single packfile. But on large repositories, repacks are time and resource prohibitive.They can be extremely painful for server operators especially.
- Append only files and sequential I/O
- Writes are append only. Reads are mostly sequential. This is good for I/O performance (especially on magnetic disks) and is good for caching.
- No garbage collection
- Scaling garbage collection is hard and resource intensive. See Git.