Size: 4275
Comment:
|
Size: 4954
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 25: | Line 25: |
* No support for larger files than 4GB (will this always be LFS territory?) === New features === |
|
Line 29: | Line 33: |
* No support for larger files than 4GB (will this always be LFS territory?) |
|
Line 37: | Line 39: |
=== V1 === | === V1 format === |
Line 39: | Line 41: |
* 6 bytes: offset -- This is how far into the data file we need to go to find the appropriate delta * 2 bytes: flags * 4 bytes: compressed length -- Once we are offset into the data file, this is how much we read to get the delta * 4 bytes: uncompressed length -- This is just an optimization. It's the size of the file at this revision * 4 bytes: base revision -- The last revision where the entire file is stored. * 4 bytes: link revision -- Another optimization. Which revision is this? Which commit is this? * 4 bytes: parent 1 revision -- Revision of parent 1 (e.g., 12, 122) * 4 bytes: parent 2 revision -- Revision of parent 2 * 32 bytes: nodeid -- A unique identifier, also used in verification (hash of content + parent IDs) |
For reference purpose * a: 6 bytes: offset -- This is how far into the data file we need to go to find the appropriate delta * b: 2 bytes: flags * c: 4 bytes: compressed length -- Once we are offset into the data file, this is how much we read to get the delta * d: 4 bytes: uncompressed length -- This is just an optimization. It's the size of the file at this revision * e: 4 bytes: base revision -- The last revision where the entire file is stored. * f: 4 bytes: link revision -- Another optimization. Which revision is this? Which commit is this? * g: 4 bytes: parent 1 revision -- Revision of parent 1 (e.g., 12, 122) * h: 4 bytes: parent 2 revision -- Revision of parent 2 * i: 32 bytes: nodeid -- A unique identifier, also used in verification (hash of content + parent IDs) flatten out we have: {{{ aaaaaa bb cccc dddd eeee ffff gggg hhhh iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii }}} |
Line 55: | Line 66: |
* XXX bytes: offset -- This is how far into the data file we need to go to find the appropriate delta * XXX bytes: flags * XXX bytes: compressed length -- Once we are offset into the data file, this is how much we read to get the delta * XXX bytes: uncompressed length -- This is just an optimization. It's the size of the file at this revision * XXX bytes: base revision -- The last revision where the entire file is stored. * XXX bytes: link revision -- Another optimization. Which revision is this? Which commit is this? * XXX bytes: parent 1 revision -- Revision of parent 1 (e.g., 12, 122) * XXX bytes: parent 2 revision -- Revision of parent 2 * 32 bytes: nodeid -- A unique identifier, also used in verification (hash of content + parent IDs) * XXX bytes: sidedata offset -- How far into the data file we need to go to find the sidedata * XXX bytes: compressed sidedata length -- Once we are offset into the data file, this is how much we read to get the sidedata * XXX bytes: uncompressed sidedata length (?) -- Optimization (is it useful?) * ... TODO * XXX bytes total (try to do 128 bytes for alignment purposes?) |
* a: 6 bytes: offset -- This is how far into the data file we need to go to find the appropriate delta (so revlog are currently caped at 281TB max) * b: 2 bytes: flags (currently we use ###) * c: 4 bytes: compressed length -- Once we are offset into the data file, this is how much we read to get the delta * d: 4 bytes: uncompressed length -- This is just an optimization. It's the size of the file at this revision * e: 4 bytes: base revision -- The last revision where the entire file is stored. * f: 4 bytes: link revision -- Another optimization. Which revision is this? Which commit is this? (used by manifestlog and filelog) * g: 4 bytes: parent 1 revision -- Revision of parent 1 (e.g., 12, 122) * h: 4 bytes: parent 2 revision -- Revision of parent 2 * i: 6 bytes: sidedata offset -- How far into the data file we need to go to find the sidedata (we also use a+c, but that make things more constrained) * j: 4 bytes: compressed sidedata length -- Once we are offset into the data file, this is how much we read to get the sidedata * k: 4 bytes: rank (number of changesets under this one (this one included)) useful for various graph algorithm * x: 32 bytes: nodeid -- A unique identifier, also used in verification (hash of content + parent IDs) * y: 8 bytes: UnifiedRevlog identifier (maybe only 5 bytes ?) * 86 bytes total (try to do 128 bytes for alignment purposes?) * adding 10 bytes a padding might be useful to align thing on 96 bytes. This seems useful to have more flag too. {{{ aaaaaa bb cccc dddd eeee ffff gggg hhhh iiiiii jjjj kkkk xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx yyyyyyyy }}} |
Note:
This page is primarily intended for developers of Mercurial.
Revlog version 2
Status: Project
Main proponents: Raphaël Gomès, Pierre-Yves David, Simon Sapin
This is a speculative project and does not represent any firm decisions on future behavior.
There is already a planned V2 format, but it's been left untouched for a while, and I'm struggling to find much info about what it was trying to achieve. If any of the developers involved in that effort want to join in and tell us more about their findings, it would be very much appreciated.
1. Goal
The current revlog format (RevlogNG, henceforth Revlogv1 or v1) has some known limitations and carries too little information for some of the features Mercurial tries to offer to perform well enough. A new version of the revlog will try to add new information and fix as many of the identified performance issues, while not introducing new ones.
2. Detailed description
2.1. Identified issues with v1
- No support for hash version
For the SHA1 -> SHA2 (and future) hash transitions, we need to store a version number. Is it worth having it per revision, or per revlog, or per repo?
- No support for larger files than 4GB (will this always be LFS territory?)
2.2. New features
- No support for sidedata
- Initial testing shows a great deal better performance in copytracing workflow when storing copytracing info in sidedata directly in the revlog. Other mechanism might want to leverage sidedata, so we need a sound base for future work.
- No support for unified revlog
Like for the hash version and sidedata, this is not something that will necessarily need to be tackled at the same time as RevlogV2, but we need to allow for this to be implemented without a separate format change.
- Having a revlog level to signal persistent nodemap usage would be nice.
2.3. Implied requirements
V2 implies generaldelta and sparse-revlog
2.4. V1 format
For reference purpose
- a: 6 bytes: offset -- This is how far into the data file we need to go to find the appropriate delta
- b: 2 bytes: flags
- c: 4 bytes: compressed length -- Once we are offset into the data file, this is how much we read to get the delta
- d: 4 bytes: uncompressed length -- This is just an optimization. It's the size of the file at this revision
- e: 4 bytes: base revision -- The last revision where the entire file is stored.
- f: 4 bytes: link revision -- Another optimization. Which revision is this? Which commit is this?
- g: 4 bytes: parent 1 revision -- Revision of parent 1 (e.g., 12, 122)
- h: 4 bytes: parent 2 revision -- Revision of parent 2
- i: 32 bytes: nodeid -- A unique identifier, also used in verification (hash of content + parent IDs)
flatten out we have:
aaaaaa bb cccc dddd eeee ffff gggg hhhh iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
64 bytes total
2.5. V2 format
There is no information we want to get rid of from v1 in v2 that I know of, so this change will be purely additive (modulo potential reordering):
- a: 6 bytes: offset -- This is how far into the data file we need to go to find the appropriate delta (so revlog are currently caped at 281TB max)
- b: 2 bytes: flags (currently we use ###)
- c: 4 bytes: compressed length -- Once we are offset into the data file, this is how much we read to get the delta
- d: 4 bytes: uncompressed length -- This is just an optimization. It's the size of the file at this revision
- e: 4 bytes: base revision -- The last revision where the entire file is stored.
- f: 4 bytes: link revision -- Another optimization. Which revision is this? Which commit is this? (used by manifestlog and filelog)
- g: 4 bytes: parent 1 revision -- Revision of parent 1 (e.g., 12, 122)
- h: 4 bytes: parent 2 revision -- Revision of parent 2
- i: 6 bytes: sidedata offset -- How far into the data file we need to go to find the sidedata (we also use a+c, but that make things more constrained)
- j: 4 bytes: compressed sidedata length -- Once we are offset into the data file, this is how much we read to get the sidedata
- k: 4 bytes: rank (number of changesets under this one (this one included)) useful for various graph algorithm
- x: 32 bytes: nodeid -- A unique identifier, also used in verification (hash of content + parent IDs)
y: 8 bytes: UnifiedRevlog identifier (maybe only 5 bytes ?)
- 86 bytes total (try to do 128 bytes for alignment purposes?)
- adding 10 bytes a padding might be useful to align thing on 96 bytes. This seems useful to have more flag too.
aaaaaa bb cccc dddd eeee ffff gggg hhhh iiiiii jjjj kkkk xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx yyyyyyyy
3. Roadmap
TODO
various
step
That
need
to
be
performed