#pragma section-numbers 2

<<Include(A:dev)>>

= Adding a cache for dirstate =

'''Status: Project'''

'''Main proponents: Raphaël Gomès'''

/!\ This is a speculative project and does not represent any firm decisions on future behavior.

This page describes the plan and status for adding a new caching mechanism for
`dirstate.status`.

== Goal ==

Doing the `opendir`/`readdir`/`closedir` loop on the tracked files and the rest of the working directory is the most
expensive operation in `status`. Reducing the number of readdirs has a
huge potential for speed improvement.

In addition, the underlying `status` operation is used by various core commands
like `hg commit` or `hg diff`, so any performance improvement to
`dirstate.status` will have a wide impact.

Case in point, Valentin Gatien-Baron introduced the first prototype for
`hg status` in Rust in early 2019, which had such a cache: it removed
about 80% of the status time in his use-case.

== Detailed description ==

=== General principle ===

We cache all traversed directories' mtimes at a nanosecond precision and
whether they are cache-able (see "What to cache") to avoid having to check them
twice, once during dirstate stat'ing which is inevitable, and once during the
unknown phase. See "Caveats" for its limitations.

=== Expected performance gains ===

The difference can already be seen on a clean repo (Netbeans' at 100k files):

  * Without cache: `avg 230ms`
  * With cache: `avg 166ms`

This was run on a laptop with 4 physical cores, which is a good representation
of the average user's configuration.

On a different repository with a lot of ignored files and a beefier 16-core
machine:

  * Without cache: `430ms`
  * With cache: `90ms`

This is a massive improvement that scales better the larger the repository. It
should also scale better with repos that have more files, especially many files
in few directories.

=== Overview ===

==== Caveats ====
This cache is of limited portability since it relies on the `mtime` of
directories changing when files are added/removed in them, which does not
work on all pairs of OS/filesystem.

It also relies on fs-level nanosecond precision of `mtime` which - while not
fundamental - makes it possible to assume that every change results
in a different `mtime` (which is not the case at second granularity).
This simplifies the code, and makes the performance simpler to understand.

We need to have a reliable (and hopefully fast) way of checking those rules.
Merely checking the mtime is not sufficient because the OS may have nanosecond
precision in the disk cache irrespective of the precision on disk.

The simplest solution might be to just keep an array of compatible filesystems
(for now we only support Linux in Rust anyway) and make a `statvfs` call to
check.

==== What to cache ====

This cache will store directories that only contain files that are:
  * in the dirstate
  * not in "added" state
  * ignored

We cannot cache files in "added" state since we can just `hg forget` them,
they disappear from the dirstate without invalidating the cache.

Caching "removed" files is fine, since running `hg add` on a removed file will
only put it back in its previous state in the dirstate.

Caching modified (merged), and normal files is fine, since any update
to them has to touch the mtime.

We *could* cache the unknown files, but it would make the format more
complicated, and probably make the performance worse overall for a limited
interest. Keep a clean working directory and your `.hgignore` up-to-date and
you'll be fine.

To recap: we are caching directories containing only ignored, normal, removed,
or merged (modified) files.

==== Operation ====

This cache is used when listing unknown files (bare `hg status`) and when *not*
listing ignored files. It is updated during the traversal recursively

Creating/updating a directory cache entry is storing:
  * its mtime at nanosecond precision
  * its name
  * whether it only contains cache-able files (see "What to cache")
  * its direct sub-directories in the same fashion, tree-style

The cache is updated on-disk all at once, this is not an append-only cache.

During traversal, each directory is looked up in the cache. If its mtime is
unchanged and it only contains cache-able files, it has already been stat'ed
during the dirstate phase. Otherwise fall back to stat'ing and update the cache
entry.

=== Implementation ===
For now, I intend for this cache to be Rust-only. It's an optimization, only
has local impact and is not extremely portable (see above). Porting it to the
Python implementation later might be worth the effort.

==== Cache integrity and invalidation ====

The cache should be still valid if:
  * the dirstate parents haven't changed
  * the contents of all `ignore` files hasn't changed
  * the root folder hasn't changed filesystems (statvfs call?)

We can check this with a (probably SHA2-blake2) hash of everything in this
list, although it might be interesting to store the fs separately.

==== Proposed binary format ====
Everything in big-endian.

  * 1 byte: version of the cache
  * The literal bytes "dirs-traversal-cache": more obvious hexdumps 
  * 32 bytes: integrity hash
  * Recursively:
      * 2 bytes: length of the node path
      * <length> bytes: said path
      * 8 bytes: mtime
      * 1 byte: used as a boolean to signify if the node should be skipped
      * 4 bytes: number of sub-directories
      * for each sub-directory in the node:
          * recurse until no sub-directories

== Roadmap ==

 * (./) Write code to serialize/de-serialize the cache on disk (rewrite in progress)
 * (./) Log debug information if there are issues with the cache
 * {./} Write Rust unit tests for empty/good/corrupt cases
 * {./} Use the cache inside traversal code
 * {X} Check that test suite passes, possibly write more integration tests (in progress)
 * {X} Do some benchmarks to see the performance improvements
 * {X} Add config options to enable/disable the cache
 * {X} Add debug command to print the cache (in progress)
 * {X} Pass the config option to Rust

== See Also ==

 * Git has an [[https://git-scm.com/docs/git-update-index#_untracked_cache|untracked cache]] that caches... untracked files. While its purpose is different (we do not cache any files, much less untracked ones), maybe some of their bug fixes and implementation details can be borrowed from.

----
CategoryDeveloper CategoryNewFeatures