Note:
This page is primarily intended for developers of Mercurial.
Perfect Hash Index Plan
Status: Proof of Concept
Main proponents: JoergSonnenberger
This is a speculative project and does not represent any firm decisions on future behavior.
Plan to provide a general purpose read-only persistent hash table.
Goal
A persistent hash table can be used to provide O(1) access for exact lookups and O(log N) for prefix matching. This provides a possible alternative to tries or transactional btrees.
High-level concept
A perfect hash function (PHF) is derived from a list of non-repeating keys. An order-preserving PHF will map an arbitrary string to an item of the list, so that every key is mapped to itself. A non-order-preserving PHF allows the construction algorithm to shuffle the list. This degree of freedom allows for significantly smaller PHF. A PHF is minimal, if creates a 1:1 mapping of the keys and non-minimal if it can add arbitrary keys. The construction of the PHF doesn't include the keys. It is the responsibility of the user to verify that the found key matches the input if non-key arguments are valid. For the nodemap case, the PHF gives a candidate revision and that revision has to be looked up in the revlog index to compare to the stored nodeid.
The modern construction techniques are probabilistic algorithms based on certain properties of hypergraphs. The input is normally hashed with a universal hash function into two or three hashes and those are combined with a pre-computed table to derive the actual index. For the nodemap, the input is a cryptographic hash and picking disjunct parts is normally good enough to qualify, but a fallback can be used to hash the key again to deal with "tricky" cases. The properties of the hypergraph essentially say that the construction of the PHF can be done in O(n) time in O(1) tries and with a few harmless restrictions of the hash functions, it will essentially be one.
Detailed description
For the nodemap problem, a minimal order-preserving PHF can be used to map the (potential) node id to a 32bit revision numbers using 5 Bytes per revision. It requires three accesses into a lookup data structure.
For the prefix search, a non-order-preserving PHF can be used to map a (prefix, prefix len) pair to (revision, prefix len, is unique) entries. For every revision, the number of entries necessary can be up to size of the unique prefix, where the shorter prefixes have entries shared between the revisions with common prefix of that length. Such an entry needs 5 Bytes. The PHF construction adds 25% artifical entries. For each entry an addition 3bit need to be stored. Assuming that almost all entries require 6 or less hex digits, that's on average 3 entries per revision or about 20 Bytes per revision storage. A common approach would limit the initial prefix length tested to 6 (even if the input prefix is longer). If there is no matching entry in the table, a shorter prefix (e.g. 4) will be tried. If there is an entry and it is unique, the prefix of the input is matched against the full node to see if it is a match or a non-match returned otherwise. If the entry was not unique and the input prefix is longer, the table is probed again with a prefix length of 7 etc. Based on the distribution of the NetBSD src tree and the Mozilla autoland repository, for >95% of all prefixes, two steps are enough and the worst case is 5 steps. More elaborate schemes could be used, but are rarely worth it.
A sample implementation for the nodemap problem can be found in D9878. This includes a Python version of the node lookup, a Python version of the PHF creation logic and a C version of the node lookup. It uses a generational approach for updating the on-disk table where most entries are kept in one table and the last up to 1000 revisions are kept in a smaller current generation which gets rewritten on every commit. Performance testing is a bit challenging: 1. The Python revlog implementation reads the full index and builds a Python dictionary for the node. The native version of the dictionary getitem is about 20 times faster than the PHF based logic, but this doesn't factor in the time for building the dictionary in first place nor the option of caching the result. 2. The C implementation currently wraps the index object to intercept rev/get_rev/has_node. The overhead of the Python function call is comparable or higher than the total runtime of the C backend. This is matters comparing the absolute times with the Rust nodemap implementation. 3. The creation time for the Python version is experimentally about two orders of magnitude slower than what optimized C (or Rust) code would do. The initial prototype using a hacked version of NetBSD's nbperf required 0.4s to compute the PHF for a roughly comparable input where the Python code needs 40s, including pipe overhead.
The PHF index files in the prototype are using memory mapping; rewrites are done using new file + rename approach.
Performance of initial hack
The initial prototype uses a Python glue module to dump all nodes via a pipe to a custom version of nbperf. For the nodemap use case, the hash can be truncated and expected to be well distributed enough to be directly usable. A fallback case with another hash function on top can follow later. The resulting perfect index is 22% of the size of the nodemap trie, but only covers exact match. The creation takes ~0.4s for 946196 nodes with 105MB peak RSS. The access cost includes the verification of the entry.
Policy |
New repository and lookup of rev 5000 |
Existing repository and look of rev 5000 |
py-baseline |
1495ms |
0.34μs |
py+phash |
17.34ms |
5.3μs |
c-baseline |
23.0ms |
0.36μs |
c+phash |
16.9ms |
0.28μ |
rust-baseline |
17.0ms |
0.52μs |
"py-baseline" is the current pure implementation. It keeps a dictionary of nodes, so the repeated lookup is essentially the Python dictionary performance. "py+phash" uses the on-disk data structure, cost is a mix of general Python execution and struct performance. Given all that, a factor of 10 compared to the raw dictionary is still very good and quite competitive. The "c-baseline" includes the single scan of the C revlog module and the in-memory trie. This is logically quite near to the rust-implementation's on-disk format. The "c+phash" case is consistently faster, the cost of doing two table checks still needs to be verified. "rust-baseline" is the current permanent nodemap. It is quite a bit slower than than the phash case. No testing was done for rust+phash for technical reasons.