Merge Driver Plan
Contents
- The Problem
- The Solution
-
The Implementation
-
The merge driver
- Why is the hook in-process?
- Why are these top-level functions?
- But I really need to share state.
- Why can't multiple merge drivers be defined?
- Why store the merge driver in the merge state?
- Why is the merge driver record advisory?
- Why do we need to be so paranoid about the merge driver's value?
- What can go wrong?
- Driver-resolved files
- Changes to operations
-
The merge driver
1. The Problem
It is somewhat common in the real world to have generated files alongside source files in the working copy. When a merge happens, generated files that are modified on both ends are likely to cause merge conflicts. The best way to resolve these conflicts is usually to regenerate these files, and that's what developers have to typically do by hand. This is something that can be completely automated in principle, and doing this by hand sucks.
Mercurial should be able to automatically resolve generated files.
1.1. But isn't checking in generated files bad?
While there are a lot of ways checking in generated files is bad, but it can make sense if:
- These files change relatively rarely for individual developers but often enough in the aggregate to be a problem.
- These files take a long time to generate but the resultant artifacts are small.
- These files capture the state of the world they were created in (e.g. databases) in important ways. That state of the world can change such that the files can no longer be generated again.
- Serving these files via an out-of-band mechanism like an artifact server is not feasible, or much more work than just serving them via Mercurial.
- While the files could be generated by a build system, the project really has no need for a build system outside of these generated files, and would like to keep fast iteration cycles by avoiding build steps.
Each of the above points has been true for at least one repository at at least one large organization.
Ultimately, software engineering is often about tradeoffs, and in some cases checking in generated files is the right tradeoff to make. This feature will make working with such files less painful.
1.2. Doesn't the merge tool support already in Mercurial solve this problem?
Mercurial does support custom merge tools for arbitrary globs of files, but the current merge tool support lacks some important features:
- They only work when each generated file has a separate command you need to run: however, in some cases multiple files can be regenerated with a single command.
- It is only suitable when the set of generated files is statically known: in some cases this configuration will be part of the repository itself, in e.g. a JSON file.
- Most importantly, there's no way to define an ordering for file resolutions. Generated files form a dependency graph -- they might depend on source files, other generated files, and so on. Resolutions need to be performed in topological order (source files first, then the generated files that depend on source code alone, then further generated files, and so on).
There's no way we can reasonably bake all of the above into configuration -- it is incredibly specific to the codebase.
2. The Solution
Add first-class support to Mercurial for generated files and generation steps.
Add support to Mercurial for custom merge drivers. A merge driver is a piece of code that controls the overall merge process.
Add support to Mercurial for driver-resolved files. A driver-resolved file is a file that will be handled by the merge driver outside of the usual resolve mechanism.
Have the merge driver expose two top-level operations: preprocess and conclude.
preprocess runs right before files are resolved. Typically, this is where files will be marked as driver-resolved.
conclude runs right after all source files have been resolved by the user. Typically, this is where driver-resolved files will be regenerated.
3. The Implementation
3.1. The merge driver
A merge driver is a python (in-process) hook that has the ability to control the overall merge process. It implements preprocess and conclude as top-level functions:
The hook is configured with
[ui] mergedriver = python:path/to/hook
The merge driver has its own states:
unmarked (u): preprocess still needs to be run.
marked (m): conclude still needs to be run.
success/skipped (s): the merge driver no longer needs to be run.
The state transitions look like (this is currently broken):
For a paused merge, the merge driver and its state are stored in the merge state. The merge state gets a new entry, with a lowercase m. The lowercase indicates that this record is advisory and that older versions of Mercurial can ignore it.
This bit under consideration. Whenever the merge driver is accessed from disk, its current value in the configuration is compared with the old value from disk. If the value is different we abort, with the only way out being to abort the merge and redo it from the beginning.
3.1.1. Why is the hook in-process?
Unlike with other kinds of hooks, the wlock must be held while this hook is called. This raises a bunch of issues with subprocess-based hooks, especially since many Mercurial operations the subprocess might want to do will require that the wlock be taken. Mercurial currently has no notion of locks being inherited by subprocesses. Trying to add one raises a lot of concerns, including:
- How does the parent process invalidate internal data structures after child processes are complete?
- How is mutual exclusion enforced between child processes that could be started up in parallel?
- How are child processes prevented from outlasting the hook?
- If the parent process crashes or is killed, how do child processes get to know?
- How would this interact with the command server?
These problems are all solvable, but at least for the first iteration it is simpler to avoid all these problems by staying in-process.
3.1.2. Why are these top-level functions?
The API is specifically designed to discourage sharing state between the preprocess and conclude functions. Any such storage of state is almost certainly a bug, because conclude can be called without calling preprocess.
3.1.3. But I really need to share state.
Your options are:
- recompute whatever state you originally wanted to share -- simplest, and great if recomputing is fast
in conclude, compute what generation steps you need to perform given the list of driver-resolved files -- great, if possible
- cache state in a global with the cache key being (repo, mergestate) -- ensure that your assumptions don't change because of user resolutions
- persist state to disk
All the options have tradeoffs -- the API is designed to make implementers think about this problem rather than just exposing an object and having them getting it wrong.
3.1.4. Why can't multiple merge drivers be defined?
Unlike with other kinds of hooks, there is in general no reasonable way to compose merge drivers. In particular -- what if different merge drivers disagree on how to generate a particular file? The semantics of multiple merge drivers get confusing very quickly.
If a repository really has independent merge drivers, it should be straightforward to write a wrapper merge driver that composes them.
3.1.5. Why store the merge driver in the merge state?
Storing the fact that files are driver-resolved without storing how to resolve them is not really helpful. (Counterpoint: we don't store merge tool configuration in the merge state, so maybe we shouldn't bother with this either.)
3.1.6. Why is the merge driver record advisory?
We will store the merge driver in the merge state whenever it is configured. In a lot of cases running the merge driver will not be necessary -- in those cases there's no point in aborting. It only makes sense to abort when the merge driver is "active" -- when there are files that need to be resolved by the driver. That case will be handled below.
3.1.7. Why do we need to be so paranoid about the merge driver's value?
Mostly for security reasons. Consider the following case:
- A configures a malicious merge driver in their ~/.hgrc, then pauses the merge.
- A gives a copy of their entire repo, including .hg (but not ~/.hgrc), to B.
- B inspects .hg/hgrc and finds it to be clean.
- B then continues the merge, and the malicious merge driver gets invoked.
Aborting when the merge driver has changed is one way to deal with this. Exactly how this should be handled is still under discussion.
3.1.8. What can go wrong?
The merge driver could:
raise an exception: we catch all exceptions raised by the merge driver as failures (state u) and then pause the merge immediately afterwards to give the user a chance to fix whatever might be breaking the merge state.
- crash the entire Mercurial process without giving the chance for the merge state to be written out: this is treated as an interrupted update.
3.2. Driver-resolved files
Driver-resolved files are marked with a brand new state -- not u or r, but d (shown as D in hg resolve --list).
Since old versions of Mercurial will not be able to understand what these files mean, they're stored as a separate type of record: D (rather than the standard F). The contents of D records are the same as those of F records. D is uppercase so that old versions of Mercurial abort when they see such files.
3.3. Changes to operations
3.3.1. merge.applyupdates
This function is what actually changes the working copy whenever a merge (or update, or graft, or rebase...) happens.
- Move merge operations to being last.
Call preprocess after all non-merge operations have happened, but before any merge operations have happened.
After all merge operations have happened, call conclude if:
the merge driver state is 'm', and
- there are no more unresolved files left. This concludes the merge.
3.3.2. resolve
Make hg resolve call conclude, if:
a driver-resolved file was requested to be resolved (implicitly covers hg resolve --all), and
- there are no unresolved files left at the end of the resolve.
If hg resolve --mark --all is run, do not mark driver-resolved files as resolved. (Do we need to add a new --force option to override this?)
Hint that hg resolve --all needs to be run if all unresolved files are resolved but no driver-resolved files are requested to be resolved.
3.3.3. commit
If a merge driver is configured and the merge state is not 's', do not proceed with the commit -- instead, tell the user to run hg resolve --all. This is to give users a chance to test their code before committing.