Note:
This page is primarily intended for developers of Mercurial.
'fncache2' repo format
This is a proposed evolution of the fncache format that addresses some of its shortcomings and complexity.
Contents
1. Some background: how fncache works, and what's tricky
The fncache format uses two encoding mechanisms to safely and portably handle metadata names.
- The "basic" encoding is used for pathnames below a limit of 120 bytes.
- The "hashed" encoding is used for longer names.
Basic encoding is quite easy to implement efficiently, but hashed encoding is complex and difficult to understand.
It's not immediately easy to see from the Python implementation of the hashed encoder, but a major difficulty with the scheme is that it *must* be implemented using multiple passes, extra code paths, and copies, due to its data dependencies and differences from basic encoding.
- The SHA-1 hash must be computed over the dir-encoded name, which means that we have to dir-encode a path before we do anything else.
- We need a separate code path specially for dir-encoding.
- The lower case encoding scheme is different than the one used for basic encoding, so it too must have a separate implementation.
- We then aux-encode the lower-encoded text.
Once that's done, there's an additional very complicated copy-and-fixup step (named "hashmangle" in the patch that implements this algorithm in C). This may truncate and tweak every path component, and it then has to do lots of further surgery to glue together all the parts together with the SHA-1 hash and the original suffix.
We have five passes over the name here: direncode, hash, lowerencode, auxencode, mangle. When implemented in C with an eye on efficiency, the final "mangle" step is particularly tricky.
2. A somewhat simpler hashing scheme
We retain the existing basic encoding scheme. For longer names:
- Compute the hash (presumably still SHA-1) of the original pathname, probably without its extension (otherwise ".i" and ".d" files will hash differently, and will be less likely to be laid out contiguously on disk.)
- Basic-encode the original pathname, using the encoding code we already have. Either stop or truncate at 200 bytes.
- Truncate each path component and fix up any dangling spaces or dots that arise.
- Append the hash to the end of the result of step 3.
- Tack the original extension (".i" or ".d") back onto the end.
This gives us three passes: hash, basic encode, fix up; and then 42 bytes of extra work to append the hash and extension.
3. An even simpler hashing scheme
_maxstorepathlen = 120 _winres3 = ('aux', 'con', 'prn', 'nul' _winres4 = ('com', 'lpt') _encchar = ("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" "~!~#$%&'()~+,-~~0123456789~;~=~~" "@abcdefghijklmnopqrstuvwxyz[~]^_" "`abcdefghijklmnopqrstuvwxyz{~}~~" "~abcdefghijklmnopqrstuvwxyz{~}~~" "~!~#$%&'()~+,-~~0123456789~;~=~~" "@abcdefghijklmnopqrstuvwxyz[~]^_" "`abcdefghijklmnopqrstuvwxyz{~}~~") def _foldencode(f): # preserves size f = ''.join([_encchar[ord(c)] for c in f]) l = len(f) if l == 3 and f in _winres3: f = f[:2] + '~' if (l == 4 and f[3] <= '9' and f[3] >= '0' and f[:3] in _winres4): f = f[:3] + '~' return f def _cutdirs(path): spaceleft = _maxstorepathlen - 45 parts = [] for s in path.split('/'): if len(s) > 8: s = s[:8] if parts: spaceleft -= 1 # for '/' spaceleft -= len(s) if spaceleft < 0: break parts.append(s) return '/'.join(map(_foldencode, parts)) def _hybridencode2(f): ef = _pathencode(f) if ef is None: f = f[5:] # cut off "data/" prefix, suffix = f[:-2], f[-2:] digest = _sha(prefix).hexdigest() ef = 'dh/' + _cutdirs(prefix) + digest + suffix return ef
The cutdirs function in C
static const Py_ssize_t maxstorepathlen = 120; static const char encchar[256] = "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" "~!~#$%&'()~+,-~~0123456789~;~=~~" "@abcdefghijklmnopqrstuvwxyz[~]^_" "`abcdefghijklmnopqrstuvwxyz{~}~~" "~abcdefghijklmnopqrstuvwxyz{~}~~" "~!~#$%&'()~+,-~~0123456789~;~=~~" "@abcdefghijklmnopqrstuvwxyz[~]^_" "`abcdefghijklmnopqrstuvwxyz{~}~~"; /* this encoding folds */ static inline char encodechar(char c) { return encchar[0xff & c]; } static Py_ssize_t _cutdirs(char *dest, Py_ssize_t destlen, size_t destsize, const char *src, Py_ssize_t len) { Py_ssize_t i = 0, spaceleft = maxstorepathlen - 45 + 1; char seg[8]; int seglen = 0; uint32_t cmp; while (i < len && spaceleft > 0) { if (src[i] == '/' || src[i] == '\0') { if (seglen != 0) { if (seglen == 3) { cmp = seg[0] << 16 | seg[1] << 8 | seg[2]; if ( cmp == 0x617578 /* aux */ || cmp == 0x636f6e /* con */ || cmp == 0x70726e /* prn */ || cmp == 0x6e756c /* nul */) seg[2] = '~'; } else if (seglen == 4 && seg[3] <= '9' && seg[3] >= '0') { cmp = seg[0] << 16 | seg[1] << 8 | seg[2]; if ( cmp == 0x636f6d /* com0..9 */ || cmp == 0x6c7074 /* lpt0..9 */) seg[3] = '~'; } memcopy(dest, &destlen, destsize, &seg, seglen); seglen = 0; } charcopy(dest, &destlen, destsize, src[i++]); spaceleft--; } else if (seglen == sizeof(seg)) { i++; } else { seg[seglen++] = encodechar(src[i++]); spaceleft--; } } return destlen; }
Patches: http://selenic.com/pipermail/mercurial-devel/2012-September/044718.html