EOL Translation Plan
Status: draft
Some people wish to have end-of-line characters translated into their native form for their operating system. This means CRLF (\r\n, carrige-return followed by line-feed) on Windows and LF (\n) on Unix. Older versions of Mac OS used CR (\r), but Mac OS X and later now use LF.
Problem
The existing solution is the Win32TextExtension, but it has a number of short-comings:
- the settings are not part of the repository and so new developers must set them up the first time. This is the biggest problem.
the names of the filters provided by the extension are not intuitive (cleverencode: and cleverdecode:). Instead of configuring a given file to be of a given EOL type, one must "clever encode" in on commit and "clever decode" it on checkout.
- the name of the extension itself indicates that this extension is only for Windows. It is true that this is probably the platform with the weakest tool support for files in non-native EOL format. However, users on other platforms might benefit from a EOL translation extension as well.
Solution
We will keep configuration in a version controlled file called .hgeol in the repository root. It declares which files the extension should convert. The file looks like this:
[patterns] Windows.txt = CRLF Unix.txt = LF test/mixed.txt = BIN **.txt = native **.py = native **.proj = CRLF
Files with a declared format are always checked out in that format, files declared as "native" are converted to the operating system native format, and files not mentioned (or declared as binary) receive no treatment. The first pattern wins, so more specific patterns should be put first. In the above example, the test/mixed.txt file is not converted to native encoding since the test/mixed.txt rule matches before the **.txt rule.
Repository format
Files declared as LF, CRLF, or BIN are stored as-is in the repository. Files declared as native are stored in a configurable repository-native format which defaults to LF. The repository-native format is configured in .hgeol with
[repository] native = CRLF
This solution is minimally invasive. Even people who don't use the extension will get a checkout where LF, CRLF, and BIN files have the right format.
Operations
The extension will change the behavior of core commands as follows.
Checkout and update
The extension should read the .hgeol file from the target revision. TODO: it currently always uses tip.
So
hg update -r 100
will read .hgeol from revision 100 and ensure that files have EOLs according to the rules from that revision after the update.
If the working copy is dirty, the following should happen:
- the files are be converted back to their repository form (using the encode filter)
- the update is made a normal and files are merged into the target revision
- the files are converted back to their working copy form (using the decode filter)
The idea is that this should let one move changes around "like normal" by basically ignoring the EOL rules while doing so.
Commit
Files are checked to ensure correct EOLs. If .hgeol is changed, the EOLs in working directory can get out of sync with the .hgeol file. This should make the commit abort with a message:
abort: EOL mis-match in Windows.txt: has LF, but should have CRLF (run "hg eolupdate" to update files)
The hg eolupdate command will rewrite files in the working directory to match the .hgeol file. After that, the commit will succeed as normal and include the rewritten files. This means that updates to .hgeol are made in lock-step with the corresponding file changes. That way things are kept nicely synchronized in the repository.
The eolupdate command will make it easy to clearly separate content changes and EOL style changes. We can try to guide our users into not mixing those changes together in a single commit, by letting hg eolupdate fail if the repository has uncommitted changes, to specifically avoid updating EOLs in a "content" changeset.
How Subversion does it
We have a test case that documents how Subversion handles some cases of inconsistencies. It turns out that
Subversion will silently recode a file to the specified EOL style if the file has a consistent style. So accidentally changing all EOLs from LF to CRLF wont matter for Subversion -- it will rewrite the file in the working copy upon commit.
Subversion will also silently update files if the svn:eol-style property is set of them. Until the commit, svn diff is empty, but the file is changed in the working copy on commit.
Subversion aborts on commit if a file has inconsistent EOL style.
Add
No checking is done at that time, the check will be done when the files are committed.
Diff
File content is normalized to the repository form before the diff is computed. The diff is then presented using the working copy form. TODO: the diff is currently shown based on the repository form of files.
Semantics
Regardless of the implementation details, we are aware that we will need to pick unambiguous names for our various components. For some, native does not stand out as a name that is self-explanatory, but it does make sense to those exposed to Subversion's svn:eol property setting which inspired this mechanism in the first place.
A naming policy centered on storage might be more clear to end-users: storeasis, storeaslf is already depicting the behavior on commit, for example. Depending on the implementation, it might be interesting to specify distinctly the behaviors on commit, and on update: "storeasis, getaslf" or "storeaslf, getasis", or "storeascrlf, converttolocal" are too long, but are self-explanatory.
Some suggested that mercurial should not use crlf or lf in our names, and use instead 'windows' and 'unix', respectively. One convention can be chosen, or aliases can be used.
Implementation
An implementation has been started here: http://bitbucket.org/mg/hg-eol/. It uses a mixed format for the repository and converts all files (with no questions asked) into the target format.
TODO
- The extension needs testing on Windows.
- Needs testing with merges of files with different EOL styles.
There is currently no hg eolupdate command.
The .hgeol file is read from the repository tip revision instead of the working copy parent revision.