SHA-1 is cryptographically weakened. Mercurial needs to switch to a strong hash function.
Goals
- New hash algorithm should be cryptographically secure.
- New hash algorithm should be fast, if possible (SHA-1 hashing is already a bottleneck in some operations).
- Mercurial should support N hash algorithms without requiring invasive changes to storage data structures, wire protocol communication is. (This is because whatever we replace SHA-1 with will presumably be broken in several years anyway and we shouldn't need to retool everything to roll out a new hash algorithm.)
- Transition plan will be up to repository owner, not a strict requirement for a specific version of Mercurial
- Repos and servers will be able to have a flag day where all new commits are a specific hash
Non-Goals
Commit signing implications. Commit signing and cryptographic chain of custody is an independent (but related to repo security) topic. See CommitSigningPlan for more.
Goals Not Yet Classified
- Do we support a repo owner deciding to rehash to a new algorithm? If so, how do we allow old hashes to be used for lookups (i.e. links to hgweb to old hashes can't stop working)? Also, how do we mitigate downgrade attacks in this scenario?
Selection of a Hash Algorithm
Mostly TODO. Blake2b at 30 or 31 bytes currently has the inside track.
Storage / Requirements Changes
A new repository requirement will need to be created to specify support for non-SHA-1 hashes.
There may need to be a repository requirement to specify the *primary* hash for new commits.
Revlogs already support 32 bytes for hash storage but only use 20 bytes for SHA-1. Assuming we use the existing revlog for storage, we'll reserve 1 or 2 bytes in the hash field to record the hash type then use the remaining bytes for hash storage. This allows multiple hash formats to be stored in the hash entry.
Future: in next revlog design, hash field should be variable width per revlog. This will allow using full 32 byte hashes and allow >32 byte hashes in the future. The revlog/store will need to be rewritten/upgraded to support wider hashes. But this one-time operation is acceptable because hash transitions should be rare.
Future: consider something like https://github.com/multiformats/multihash for declaring which hash is used. This will likely require a new revlog with >32 bytes for hash storage.
Wire Protocol Transition
Capabilities negotiation will need to exchange hash information and support.
Servers that have transitioned to a new hash will need to reject clients not supporting that hash and tell them to upgrade. The rejection should ideally be fast. This may be difficult in some cases because clients don't expose their features until bundle request time. We may have to error during discovery when SHA-1 hashes are used to request data stored under <HGHASH>.
TODO audit wire protocol and figure out how to do this.
Feedback from Git People
> * Did you encounter any unexpected issues that you wished you had though > about before hand? The main issues in the Git codebase were some coding practices which didn't anticipate changing the hash function. For example, there were a lot of "unsigned char sha1[20]" declarations in the code, as well as magic numbers like 48 ("shallow " plus a hex SHA-1 value), which all had to be identified and converted. There was also some reticence at first on the part of the community. People didn't think it was that important, so I started by introducing a set of #define constants and a structure for object IDs and pitched it as a code cleanup with the vague possibility of a hash function transition in the future. I often had multiple series of work that hadn't been sent upstream and found that other topics had conflicted with my changes. I probably should have been better about sending out a lot of these patches sooner, which would have decreased the number of conflicts. There are also people who expected us to have completed this work already and who questioned the decisions we have made, including why we did not pick their preferred hash algorithm. This being the Internet, this is not entirely unexpected, but it is something to be aware of. I recommend easily accessible pointers to documentation you can provide. > * How much time did you spent on that sha256 conversion already, and how > much more do you expect to spend? I've sent 17 sets of patches that converted all the uses of "unsigned char sha1[20]" into a C structure (so we could extend it in the future), there are 9 sets of patches which update the testsuite to make it work with SHA-256, and then three sets of patches that actually implement SHA-256, and that's just to get us to the point where a repository can be either entirely SHA-1 or entirely SHA-256. Interoperability and transition (storing in SHA-256 but allowing input or output in SHA-1) will require more patches, most of which haven't yet been written. I can't estimate how many hours I've spent on this, but it started in 2015 and has been going on during my free time for years. If you consider that there are about 20-30 patches in each set of patches, then that gives you a rough idea of the scope. I anticipate writing at least ten more series of patches before the entire thing is done. This is our equivalent of your Python 3 work. If y'all already have a structure or data type for the hash, or some sort of abstraction for it, then I expect you'll spend a lot less time, especially since Python (and now Rust, AIUI) are a little more object oriented. I highly recommend starting there with some abstractions, switching everything to use them, and then seeing what works and doesn't. If your test suite has any hard-coded hash values, prepare to spend a good amount of time fixing assumptions there. > * Do you have any advices for other people trying the same endeavor in > Mercurial? It's been my view that moving away from SHA-1 is essential to the viability of Git as a project. If you can't store arbitrary data in your repository, you're going to have a problem, and any signatures you make are going to be meaningless if the hash is weak. So my suggestion is to consider it as important, reasonably urgent work, not to the point of panicking, but something to prioritize. I also think it's helpful to have a plan. We have a transition plan and added documentation (in Documentation/technical/hash-function-transition.txt) and are implementing it reasonably well. Some things haven't gone exactly according to the plan, but it's helpful that everyone is on the same page. We also planned for interoperability between the old and new so people can switch over one repo at a time, which I think is enormously important (but is going to be a lot of work). My approach after making all of the struct object_id conversions was to compile a binary that switched the hash wholezale (without any config options) and then find what broke. I fixed the most basic things that prevented repository creation from working and then went from there, fixing tests as I went. I also made our testsuite care less about hash values by computing them in a lot of tests, since tests about, say, the diff format care about the format, not the specific values involved. Of course, there may be other approaches that work as well, but that one worked for me. > * What motivate the choice of sha256 as a replacement? Have other hash > function been considered? And if so, what made you discard them ? When I started the work, I started with BLAKE2b-256. I wanted a 256-bit hash because it fits on an 80-column terminal. I started with BLAKE2b because it's fast, and I wanted to give people a reason to switch. A lot of people don't know or care about why SHA-1 is weak, and saying, "You should switch because it's much faster _and_ more secure," is a compelling argument. We discussed several alternatives: BLAKE2b-256, SHA-256, SHA3-256, SHAKE256, SHA-512/256, K12 (a Keccak-based hash), and others. We settled on SHA-256 because it's ubiquitous and we depend on platform crypto libraries for fast implementations. Windows and macOS have a tiny number of hash algorithms implemented, and SHA-256 is really the only 256-bit option. The fact that it is vulnerable to length-extension attacks is irrelevant to us because we hash the type and length as a prefix to the object, so we aren't vulnerable to it. SHA-256 also has hardware acceleration on newer Intel and AMD processors, as well as on ARM, which was a compelling reason. My advice is to pick a SHA-2 or SHA-3 algorithm (including SHAKE256) and, if your object format is not immune to length-extension attacks, to not pick SHA-256. The reason is that you have government agencies and contractors (all over the world) who are legally required to pick and use only approved algorithms, and you don't want people to not pick Mercurial because of some silly policy reason. I love BLAKE2b, and I certainly don't love those policies, but that's the world we live in.
Related work
Git's migration plan Fossil's approach