Note:
This page is primarily intended for developers of Mercurial.
Computed Index
Status: Project
Main proponents: Pierre-YvesDavid
This is a speculative project and does not represent any firm decisions on future behavior.
Add an intermediate layer on data storage, computed from source of truth, but with more guarantee than the current caches.
The page have light content to start the discussion, some of the implementation details are more advanced than what described here.
1. Goal
1.1. The problem
Currently there are two make place where we record data: the store and the cache. The store is the source of truth, and the cache are some data computed from the store that might and might not be present and relevant to the current. We never guarantee the data in cache to be valid, which means we can freely add new cache entries and update their formats without requiring a repository format upgrade and new entries in requires. However this also means that each cache file needs to implement its own validation mechanism and that some important cache can remain out of date.
This last part is problematic for multiple reasons:
* For some repositories, some data are just too expensive to re-compute all if the cache data are not up to date. For example, invalid branchmap and tag's node cache can cost minutes to recompute for some large repositories. Read-only operations will typically not update the cache on disk and pay this cost for each invocation, making the repository practically unusable.
* The most common cache validation key (tiprev, tipnode) is flawed, so cache can appear valid without being valid.
* The item necessary to validate the cache can lead to increased storage (eg: branchcache) or extra computation time (eg: branchmap).
1.2. The Proposed Solution
We introduce a third space the "Index" space. The index contains data fully derived from the store, but are guaranteed to be in sync at the end of the transaction. Each change to the index needs to come with an associated change to requires to make sure client will keep it up to date.
Note: since the data will be spread across multiple files, we'll still need some way to validate we read consistent data (all from the same transaction). However the mechanism can be much simpler.
2. Detailed description
We want to add an index and windex directory, with the associate vfs. Some of the existing cache could be migrated there list TBD. Some of the new feature we write could go directly there.
We want to use append only friendly storage as much as possible, this make the transaction consistently easier. Having extra data (from inprogress/later transaction) at the end of a file can be harmless if properly detected. This is also a good opportunity to introduce a repository wide identifier of the current state of the repository.
Some of the data currently in cache could directly move inside the revlog indexes.
If we use more append only files, we need good handling of strip and rollback.
3. Roadmap
indexvfs and windexvfs
having a "pointer files" atomically updated by transaction to get a consistent view of the repository.
investigate current cache that could become indexes
either new files in index
- directly into the changelog indexes.