Note:
This page is primarily intended for developers of Mercurial.
This document represents a proposal only. This proposal can also be considered only applicable for repositories that require it for performance/scaling reasons.
The current dirstate format is very simple - it's just a list containing for every file filename, state of the file, modification date and mode. It also stores the source for tracked copies and moves. The handling of such simple format is also simple - we are reading the whole dirstate to memory process it there and when we want to write it we rewrite the whole file.
Unfortunately for big repos the dirstate can easily exceed 50 megabytes. For some operations involving multiple dirstate reads and writes like rebase the dirstate operations contribute to 30% of the execution time. While there are other efforts to limit the working directory size (sparse extension, narrow clones) sometimes the user may want to have the whole big working directory.
Before hgwatchman was created there was no need for much better format as every status operation involved iterating over all files in the repo anyway. Now when we know which files had changed we can just check those files in dirstate.
To handle such a big working dirstate faster we need to store dirstate in more organized format:
- we should be able to modify it without rewriting the whole distate as many operations are touching only a small portion of files in big repo
- it should be sorted to allow us to do many read operations faster (for example finding files with nonnormal status or files without mtime in dirstate)
- it should also incorporate other lists of all files that we generate from dirstate so we don't have to regenerate them on every disrstate read (filefolmap, dirs, copymap etc)
I'm trying to implement a prototype of such dirstate with sqllite as backend storage. I'm currently implementing is as extension which will be soon available in https://bitbucket.org/facebook/hg-experimental repo