Note:
This page is primarily intended for developers of Mercurial.
Adding a cache for dirstate
Status: Project
Main proponents: Raphaël Gomès
This is a speculative project and does not represent any firm decisions on future behavior.
This page describes the plan and status for adding a new caching mechanism for dirstate.status.
1. Goal
Doing the opendir/readdir/closedir loop on the tracked files and the rest of the working directory is the most expensive operation in status. Reducing the number of readdirs has a huge potential for speed improvement.
In addition, the underlying status operation is used by various core commands like hg commit or hg diff, so any performance improvement to dirstate.status will have a wide impact.
Case in point, Valentin Gatien-Baron introduced the first prototype for hg status in Rust in early 2019, which had such a cache: it removed about 80% of the status time in his use-case.
2. Detailed description
2.1. General principle
We cache all traversed directories' mtimes at a nanosecond precision and whether they are cache-able (see "What to cache") to avoid having to check them twice, once during dirstate stat'ing which is inevitable, and once during the unknown phase. See "Caveats" for its limitations.
2.2. Expected performance gains
The difference can already be seen on a clean repo (Netbeans' at 100k files):
Without cache: avg 230ms
With cache: avg 166ms
This was run on a laptop with 4 physical cores, which is a good representation of the average user's configuration.
On a different repository with a lot of ignored files and a beefier 16-core machine:
Without cache: 430ms
With cache: 90ms
This is a massive improvement that scales better the larger the repository. It should also scale better with repos that have more files, especially many files in few directories.
2.3. Overview
2.3.1. Caveats
This cache is of limited portability since it relies on the mtime of directories changing when files are added/removed in them, which does not work on all pairs of OS/filesystem.
It also relies on fs-level nanosecond precision of mtime which - while not fundamental - makes it possible to assume that every change results in a different mtime (which is not the case at second granularity). This simplifies the code, and makes the performance simpler to understand.
We need to have a reliable (and hopefully fast) way of checking those rules. Merely checking the mtime is not sufficient because the OS may have nanosecond precision in the disk cache irrespective of the precision on disk.
The simplest solution might be to just keep an array of compatible filesystems (for now we only support Linux in Rust anyway) and make a statvfs call to check.
2.3.2. What to cache
This cache will store directories that only contain files that are:
- in the dirstate
- not in "added" state
- ignored
We cannot cache files in "added" state since we can just hg forget them, they disappear from the dirstate without invalidating the cache.
Caching "removed" files is fine, since running hg add on a removed file will only put it back in its previous state in the dirstate.
Caching modified (merged), and normal files is fine, since any update to them has to touch the mtime.
We *could* cache the unknown files, but it would make the format more complicated, and probably make the performance worse overall for a limited interest. Keep a clean working directory and your .hgignore up-to-date and you'll be fine.
To recap: we are caching directories containing only ignored, normal, removed, or merged (modified) files.
2.3.3. Operation
This cache is used when listing unknown files (bare hg status) and when *not* listing ignored files. It is updated during the traversal recursively
Creating/updating a directory cache entry is storing:
- its mtime at nanosecond precision
- its name
- whether it only contains cache-able files (see "What to cache")
- its direct sub-directories in the same fashion, tree-style
The cache is updated on-disk all at once, this is not an append-only cache.
During traversal, each directory is looked up in the cache. If its mtime is unchanged and it only contains cache-able files, it has already been stat'ed during the dirstate phase. Otherwise fall back to stat'ing and update the cache entry.
2.4. Implementation
For now, I intend for this cache to be Rust-only. It's an optimization, only has local impact and is not extremely portable (see above). Porting it to the Python implementation later might be worth the effort.
2.4.1. Cache integrity and invalidation
The cache should be still valid if:
- the dirstate parents haven't changed
the contents of all ignore files hasn't changed
- the root folder hasn't changed filesystems (statvfs call?)
We can check this with a (probably SHA2-blake2) hash of everything in this list, although it might be interesting to store the fs separately.
2.4.2. Proposed binary format
Everything in big-endian.
- 1 byte: version of the cache
- The literal bytes "dirs-traversal-cache": more obvious hexdumps
- 32 bytes: integrity hash
- Recursively:
- 2 bytes: length of the node path
<length> bytes: said path
- 8 bytes: mtime
- 1 byte: used as a boolean to signify if the node should be skipped
- 4 bytes: number of sub-directories
- for each sub-directory in the node:
- recurse until no sub-directories
3. Roadmap
Write code to serialize/de-serialize the cache on disk (rewrite in progress)
Log debug information if there are issues with the cache
- {./} Write Rust unit tests for empty/good/corrupt cases
- {./} Use the cache inside traversal code
Check that test suite passes, possibly write more integration tests (in progress)
Do some benchmarks to see the performance improvements
Add config options to enable/disable the cache
Add debug command to print the cache (in progress)
Pass the config option to Rust
4. See Also
Git has an untracked cache that caches... untracked files. While its purpose is different (we do not cache any files, much less untracked ones), maybe some of their bug fixes and implementation details can be borrowed from.