Commit Signing Plan

1. Problem Statement

Mercurial should provide stronger guarantees about the authenticity of commits, including who made them and optionally who "signed off" on them.

2. Background and Issues

Mercurial only has a single field for author information. It captures a name and email for the person or entity making the commit. However, any user can set any value for the author field. This opens up the possibility for spoofing. A nefarious person could create a commit that appears to be coming from a well-known and trusted individual. This form of "social engineering attack" could result in a reviewer letting his or her guard down ("person X writes good code: I don't have to pay too much attention") and malicious code being inserted into a repository unnoticed. The viability of this attack varies, as many workflows use tools with accounts and these tools almost certainly expose account info that could be used to reinforce the identity of the patch author/submitter. But, Mercurial should arguably not have to rely on supplemental tools for commit verification.

Mercurial currently doesn't formally record who "signed off" on a commit. Many projects have adopted a "two person rule" where any new commit requires at least 2 people: an author and a separate (trusted) person to sign off on it. Organizations like Mozilla have resorted to annotating commit messages with this metadata. e.g. "r=indygreg" (this means "positively reviewed by indygreg"). Of course, anybody who can edit a commit message can add this metadata and create falisified entries. A nefarious individual could construct a commit that appears to have sign-off and then convince someone to land it.

A similar issue is one where a code author changes a commit before landing it. Someone giving sign-off may find themselves in a position where they gave sign-off, but the landed commit changed in a way that would have invalidated their sign-off. A more formalized method for verifying changes landed exactly as intended could prevent this.

Along that vein is the issue of recording who pushed what where. If a falsified commit gets introduced to the repository, it isn't always clear how it got there because the Mercurial server does not keep a formal log of this. This problem has more or less been solved by the pushlog extension. However, this data only establishes a paper trail: it doesn't prevent falsified entries from being introduced.

3. GPG Extension and Its Limitations

Mercurial ships with a "gpg" extension that allows commits to be signed with GPG. This is done by:

  1. Find the SHA-1 of a changeset to be signed
  2. Produce a GPG signature of that SHA-1
  3. Append the signature to the .hgsigs file
  4. Commit the result

See http://selenic.com/repo/hg/rev/b09e5150bf8f for an example commit.

This is the only mechanism currently built in to Mercurial to establish a chain of trust for a commit.

There are some limitations with the existing extension.

First, it isn't practical to sign every commit to the repository. This is because every signing operation requires a new commit to record the added signature(s). This effectively means one extra commit per push operation. In practice, nobody takes this approach. Instead, only a small number of commits are signed. Commonly, it's only release commits or tag commits that are signed.

Second, commit signing isn't scalable for high commit volume workloads. Organizations like Facebook and Mozilla commit to repositories so frequently that there are "push races" to repositories and pushers typically need to rebase before pushing. Since a rebase would rewrite the commit's SHA-1, it would invalidate the GPG signature and require re-signing. This would require one of the following to overcome:

  1. Signers would need to take responsibility for pushing commits they sign off on.
  2. A different entity would have to re-sign the commits.

#1 may be unacceptable to some organizations and workflows, as it effectively requires that the person doing sign-off is the person pushing changesets. There is overhead here.

#2 would break the chain of trust from the original signer. It undermines the purpose of signed commits in the first place and is thus an inadequate solution for people wanting signed commits for trust chain verification.

4. Proposed Solution: In-Commit Signing

The issues of extra commits and rebasing losing signatures can be worked around by introducing a new method of commit signing.

Instead of signing the SHA-1 of the commit (which is derived from the content of all files in the repository at the time of the commit (the manifest), all ancestor commits, and fields like date, author, and commit message from the commit itself), we will sign a hash covering just the changes in the commit. This signature will be added to commits themselves such that signing doesn't require additional commits.

Generically, the process for signing a changeset is thus:

  1. Build a representation of the changes made in a changeset
  2. Hash that representation
  3. Sign the hash
  4. Add hash and signature to an extra field in the changeset and commit the amended result

The process for verifying a changeset is thus:

  1. Build a representation of the changes made by a changeset
  2. Hash that representation
  3. Verify that hash matches what was signed
  4. Verify the signature of the hash is valid

4.1. Creating Representations of Commits

The method for creating the representation of commits for hashing/signing will be based on the built-in changelog text generation and changelog hashing but with the following changes:

  1. Full manifest node will be omitted
  2. Extra fields belonging to the fields used to hold signatures will be omitted
  3. Parent changesets will not be included

In the absence of the full manifest node (which is a representation of the state of every file in the commit), we will construct a partial manifest consisting of just the files changed by the commit. We will need to include an explicit list of deleted files, since these aren't explicitly captured by manifests. e.g.

mercurial/hg.py 23cc12f225f1b42f32dc0d897a4f95a38ddc8f4a
mercurial/deleted.py 217bc3fde6d82c0210cf56aeae11d05a03f35b2b d

The representation and hash of a commit is thus stable as long as the following conditions are met:

The representation and hash of a commit is thus conveying the commit metadata and end state of files changed by the commit (as opposed to commit data, all parent commits, and end state of all files in the repo at the time of the commit). The produced hash and signature sacrifices some details to achieve flexibility and usability.

4.2. Storing Signatures

Signatures will be stored in "extra" fields as part of the changeset. The following fields will be added:

The Author-Signature field will capture a signature from the author of the commit. Mercurial will verify that the key used to produce this signature matches the author field in the commit. This field and signature can be used to verify that commits are coming from the person the author field says they are coming from, thus preventing spoofing in the author field. This field is arguably not as important as establishing trust for sign-off.

The Sign-Off-Signature-N fields (where N is an integer) will hold signatures from people signing off on the commit. These signatures can be used to verify that a trusted person reviewed the change and that the change landed exactly as the reviewer intended. We'll start at count 0 or 1 and append new signatures to the end as they arrive. All Sign-Off-Signature-* fields will be ignored when computing the representation of a commit for in-commit signing. This allows signatures to be added or removed without requiring re-signing.

4.3. Concerns

Like existing full-commit signatures, in-commit signatures could still get invalidated in a lot of workflows. If a file-level merge occurs on rebase, the signature becomes invalid. If the commit message changes, the signature becomes invalid. This may cause excessive churn and require re-signs. Security or convenience: pick one.