EOL Translation Plan

Status: draft

Some people wish to have end-of-line characters translated into their native form for their operating system. This means CRLF (\r\n, carrige-return followed by line-feed) on Windows and LF (\n) on Unix. Older versions of Mac OS used CR (\r), but Mac OS X and later now use LF.

Problem

The existing solution is the Win32TextExtension, but it has a number of short-comings:

Solution

We will keep configuration in a version controlled file called .hgeol in the repository root. It declares which files the extension should convert. The file looks like this:

[patterns]
Windows.txt = CRLF
Unix.txt = LF
test/mixed.txt = BIN
**.txt = native
**.py = native
**.proj = CRLF

Files with a declared format are always checked out in that format, files declared as "native" are converted to the operating system native format, and files not mentioned (or declared as binary) receive no treatment. The first pattern wins, so more specific patterns should be put first. In the above example, the test/mixed.txt file is not converted to native encoding since the test/mixed.txt rule matches before the **.txt rule.

Repository format

Files declared as LF, CRLF, or BIN are stored as-is in the repository. Files declared as native are stored in a configurable repository-native format which defaults to LF. The repository-native format is configured in .hgeol with

[repository]
native = CRLF

This solution is minimally invasive. Even people who don't use the extension will get a checkout where LF, CRLF, and BIN files have the right format.

Operations

The extension will change the behavior of core commands as follows.

Checkout and update

The extension should read the .hgeol file from the target revision. TODO: it currently always uses tip.

So

hg update -r 100

will read .hgeol from revision 100 and ensure that files have EOLs according to the rules from that revision after the update.

If the working copy is dirty, the following should happen:

The idea is that this should let one move changes around "like normal" by basically ignoring the EOL rules while doing so.

Commit

Files are checked to ensure correct EOLs. If .hgeol is changed, the EOLs in working directory can get out of sync with the .hgeol file. This should make the commit abort with a message:

abort: EOL mis-match in Windows.txt: has LF, but should have CRLF
(run "hg eolupdate" to update files)

The hg eolupdate command will rewrite files in the working directory to match the .hgeol file. After that, the commit will succeed as normal and include the rewritten files. This means that updates to .hgeol are made in lock-step with the corresponding file changes. That way things are kept nicely synchronized in the repository.

The eolupdate command will make it easy to clearly separate content changes and EOL style changes. We can try to guide our users into not mixing those changes together in a single commit, by letting hg eolupdate fail if the repository has uncommitted changes, to specifically avoid updating EOLs in a "content" changeset.

How Subversion does it

We have a test case that documents how Subversion handles some cases of inconsistencies. It turns out that

Add

No checking is done at that time, the check will be done when the files are committed.

Diff

File content is normalized to the repository form before the diff is computed. The diff is then presented using the working copy form. TODO: the diff is currently shown based on the repository form of files.

Semantics

Regardless of the implementation details, we are aware that we will need to pick unambiguous names for our various components. For some, native does not stand out as a name that is self-explanatory, but it does make sense to those exposed to Subversion's svn:eol property setting which inspired this mechanism in the first place.

A naming policy centered on storage might be more clear to end-users: storeasis, storeaslf is already depicting the behavior on commit, for example. Depending on the implementation, it might be interesting to specify distinctly the behaviors on commit, and on update: "storeasis, getaslf" or "storeaslf, getasis", or "storeascrlf, converttolocal" are too long, but are self-explanatory.

Some suggested that mercurial should not use crlf or lf in our names, and use instead 'windows' and 'unix', respectively. One convention can be chosen, or aliases can be used.

Implementation

An implementation has been started here: http://bitbucket.org/mg/hg-eol/. It uses a mixed format for the repository and converts all files (with no questions asked) into the target format.

TODO