EOL Translation Plan

Status: draft

Some people wish to have end-of-line characters translated into their native form for their operating system. This means CRLF (\r\n, carrige-return followed by line-feed) on Windows and LF (\n) on Unix. Older versions of Mac OS used CR (\r), but Mac OS X and later now use LF.

Problem

The existing solution is the Win32TextExtension, but it has a number of short-comings:

Solution

We will keep configuration in a version controlled file called .hgeol in the repository root. It declares which files the extension should convert. The file looks like this:

[patterns]
Windows.txt = CRLF
Unix.txt = LF
test/mixed.txt = BIN
**.txt = native
**.py = native
**.proj = CRLF

Files with a declared format are always checked out in that format, files declared as "native" are converted to the operating system native format, and files not mentioned (or declared as binary) receive no treatment. The first pattern wins, so more specific patterns should be put first. In the above example, the test/mixed.txt file is not converted to native encoding since the test/mixed.txt rule matches before the **.txt rule.

Repository format

Files declared as LF or CRLF format are stored as-is in the repository. Files declared as native are stored in a configurable repository-native format which defaults to LF.

This solution is minimally invasive. Even people who don't use the extension will get a checkout where LF and CRLF files have the right format.

Update repository on setting changes

When .hgeol is changed, the EOLs in working directory can get out of sync with the .hgeol file. This should make the next commit abort with a message:

abort: EOL mis-match in Windows.txt: has LF, but should have CRLF
(run "hg eolupdate" to update files)

The hg eolupdate command will rewrite files in the working directory to match the .hgeol file. After that, the commit will succeed as normal and include the rewritten files. This means that updates to .hgeol are made in lock-step with the corresponding file changes, hopefully keeping things nicely synchronized in the repository.

The eolupdate command will make it easy to clearly separate content changes and EOL style changes. We can try to guide our users into not mixing those changes together in a single commit, by letting hg eolupdate fail if the repository has uncommitted changes, to specifically avoid updating EOLs in a "content" changeset.

Unexpected format changes

What should we do if a file has an unexpected format when doing hg commit or hg diff? The easiest solution is to make an encode filter which translates both LF and CRLF into the target format, regardless of what the expected source format is.

How Subversion does it

We have a test case that documents how Subversion handles some cases of inconsistencies. It turns out that

Semantics

Regardless of the implementation details, we are aware that we will need to pick unambiguous names for our various components. For some, native does not stand out as a name that is self-explanatory, but it does make sense to those exposed to Subversion's svn:eol property setting which inspired this mechanism in the first place.

A naming policy centered on storage might be more clear to end-users: storeasis, storeaslf is already depicting the behavior on commit, for example. Depending on the implementation, it might be interesting to specify distinctly the behaviors on commit, and on update: "storeasis, getaslf" or "storeaslf, getasis", or "storeascrlf, converttolocal" are too long, but are self-explanatory.

Some suggested that mercurial should not use crlf or lf in our names, and use instead 'windows' and 'unix', respectively. One convention can be chosen, or aliases can be used.

Implementation

An implementation has been started here: http://bitbucket.org/mg/hg-eol/. It uses a mixed format for the repository and converts all files (with no questions asked) into the target format.

TODO