Note:
This page is primarily intended for developers of Mercurial.
Dirstate Format Improvements Plan
1. Current state:
- The dirstate is stored in a file and reflects the state of all the files in the repository tracked by hg
- The dirstate is written in a random order (Python dict iteration)
- Status spends most of its time reading / parsing / using the dirstate and status is used in many highly used commands, here it what it does:
- Get the list of files that changed recently from hg watchman
- Read the entire dirstate, store it in memory
- Iterate through the dirstate and check the status of the file returned from watchman, we are interested in two questions:
- is the file in the dirstate?
- is the file modified/added/removed?
- For a dirstate of about ~100Mb, it takes 5s to build and write it and 350ms to read it
2. Improvement plan:
We can improve 3.2, 3.3.1, and 3.3.2.
3.1 is not improvable easily.
3.3.1
- Build a bloom filter with all the filenames, stored it at the beginning of the dirstate file, use it to check if files are in dirstate.
The bloom filter takes about 5s to be built on a large repository at Facebook with 0.001% precision (building the dirstate takes about 5s as well). We can rebuild the bloom filter semi-regularly (every day/week?) to ensure that we don't have too many false positives stemming from files getting untracked.
This requires a format change in dirstate => need discussion with other people who want dirstate format change to bundle all changes at once
We can sort the dirstate to answer the 3.3.1 question without 3.2, need data to know if it is worth it
3.3.2
- We can store the modified/added/remove files entries on top of the dirstate to shortcircuit the iteration over all the entries. Diff is out for review. Improvement seems to yield to 50% hg status time on a large repository at Facebook.
3.2
- If we improve all the the changes described above we no longer need to read the entire dirstate
3. New on-disk format
The new dirstate format will look like the previous format with the addition of:
- Version of the format (as we will have more than one type of dirstate format)
- Awareness of directories / tree-structured / stem compression
- Checksums for files in lookup state so we don't have to visit revlogs
- Sorted order
- Number of entries (to avoid guessing the size of the dictionary to hold the dirstate)