Note:
This page is primarily intended for developers of Mercurial.
Revlog version 2
Status: Project
Main proponents: Raphaël Gomès, Pierre-Yves David, Simon Sapin
This is a speculative project and does not represent any firm decisions on future behavior.
There is already a planned V2 format, but it's been left untouched for a while, and I'm struggling to find much info about what it was trying to achieve. If any of the developers involved in that effort want to join in and tell us more about their findings, it would be very much appreciated.
1. Goal
The current revlog format (RevlogNG, henceforth Revlogv1 or v1) has some known limitations and carries too little information for some of the features Mercurial tries to offer to perform well enough. A new version of the revlog will try to add new information and fix as many of the identified performance issues, while not introducing new ones.
2. Detailed description
2.1. Identified issues with v1
- No support for hash version
For the SHA1 -> SHA2 (and future) hash transitions, we need to store a version number. Is it worth having it per revision, or per revlog, or per repo?
- No support for larger files than 4GB (will this always be LFS territory?)
They was also previous discussion on the subject:
https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-February/093657.html
https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-May/097960.html
2.2. New features
- Support for sidedata
- Initial testing shows a great deal better performance in copytracing workflow when storing copytracing info in sidedata directly in the revlog. Other mechanism might want to leverage sidedata, so we need a sound base for future work.
- Support for unified revlog
Like for the hash version and sidedata, this is not something that will necessarily need to be tackled at the same time as RevlogV2, but we need to allow for this to be implemented without a separate format change.
- Having a revlog-level way of signaling persistent nodemap usage would be nice.
2.3. Implied requirements
V2 implies generaldelta and sparse-revlog
2.4. V1 format
For reference purpose
- a: 6 bytes: offset -- This is how far into the data file we need to go to find the appropriate delta
- b: 2 bytes: flags
- c: 4 bytes: compressed length -- Once we are offset into the data file, this is how much we read to get the delta
- d: 4 bytes: uncompressed length -- This is just an optimization. It's the size of the file at this revision
- e: 4 bytes: base revision -- The last revision where the entire file is stored.
- f: 4 bytes: link revision -- Another optimization. Which revision is this? Which commit is this?
- g: 4 bytes: parent 1 revision -- Revision of parent 1 (e.g., 12, 122)
- h: 4 bytes: parent 2 revision -- Revision of parent 2
- i: 32 bytes: nodeid -- A unique identifier, also used in verification (hash of content + parent IDs)
flatten out we have:
aaaaaa bb cccc dddd eeee ffff gggg hhhh iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
64 bytes total
2.5. V2 format
There is no information we want to get rid of from v1 in v2 that I know of, so this change will be purely additive (modulo potential reordering):
- a: 6 bytes: offset -- This is how far into the data file we need to go to find the appropriate delta (so revlog are currently caped at 281TB max)
- b: 2 bytes: flags (currently we use ###)
- c: 4 bytes: compressed length -- Once we are offset into the data file, this is how much we read to get the delta
- d: 4 bytes: uncompressed length -- This is just an optimization. It's the size of the file at this revision
- e: 4 bytes: base revision -- The last revision where the entire file is stored.
- f: 4 bytes: link revision -- Another optimization. Which revision is this? Which commit is this? (used by manifestlog and filelog)
- g: 4 bytes: parent 1 revision -- Revision of parent 1 (e.g., 12, 122)
- h: 4 bytes: parent 2 revision -- Revision of parent 2
- i: 32 bytes: nodeid -- A unique identifier, also used in verification (hash of content + parent IDs)
j: 8 bytes: UnifiedRevlog identifier
- k: 4 bytes: rank (number of changesets under this one (this one included)) useful for various graph algorithm
- l: 8 bytes: sidedata offset -- How far into the data file we need to go to find the sidedata (we also use a+c, but that make things more constrained)
- m: 4 bytes: compressed sidedata length -- Once we are offset into the data file, this is how much we read to get the sidedata
- 88 bytes total
- adding 8 bytes of padding might be useful to align an entry to 96 bytes. Maybe it would be useful to have more flags too.
aaaaaa bb cccc dddd eeee ffff gggg hhhh iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii jjjjjjjj kkkk llllllll mmmm xxxxxxxx
3. Roadmap
TODO
various
step
That
need
to
be
performed