Note:
This page is no longer relevant but is kept for historical purposes.
Note:
This page is primarily intended for developers of Mercurial.
EOL Translation Plan
This page is historical, see the EolExtension.
Status: draft
A plan for improved handling of EOL translation.
The problem
Different platforms have different conventions for representation of end-of-line in text files. A common feature request for Mercurial is good support for getting native line endings in text files when changesets are shared between different platforms. The internal storage format for Mercurial is bytes, and core Mercurial doesn't care much about what is encoded in the bytes.
Windows traditionally uses CRLF ('\r\n', carriage-return followed by line-feed). The default text editor on Windows, Notepad, only understands CRLF. Command line tools and redirection also uses CRLF.
Unix and Linux traditionally uses LF ('\n') on Unix. Many tools can handle CRLF, but sometimes the native format is essential.
Older versions of Mac OS used CR ('\r'), but Mac OS X and later is Unix and uses LF.
Requirements
The main requirement is simply to somehow get
- Good support for native line endings for text files
Here comes an attempt to describe some requests / requirements / challenges-to-be-considered. Some are obligatory to some, some considers some nice-to-have, and some would consider implementation of some of them misfeatures. What is listed here should be essential to the problem, not tied to a specific solution and not just enumerate examples of bad solutions to avoid:
- The line ending functionality must be enabled automatically in all working directories in all clones without any (or minimal) setup.
- It must be guaranteed that no changesets with invalid line-endings gets committed (anywhere or centrally) - the history must be clean without unnecessary changes and the invariant of correct handling of line-endings must never be violated.
- Violations of invariants must be handled properly anyway. Inconsistencies should be detected and be fixed in a non-obtrusive way.
- Line-ending fix-ups should not be mixed with ordinary changes.
- Line-ending fix-up must be configurable so it only happens for some files. Patterns (and meta-data flags if there were any) are hard to maintain and will be wrong. Auto-detection isn't sufficiently exact and might fail to both sides.
- The handling of line-endings might change over time and that must be handled properly.
Patch handling, merges and mq must work properly in all cases.
- The "native" encoding must be configurable, for example in order to prepare a zip-file for a Windows user on a Unix machine.
The win32text extension
Mercurial already comes with the Win32TextExtension, but it has a number of short-comings:
- The settings are not part of the repository so all users must configure it for each clone. This is the biggest problem.
- The general encode/decode filters are used, and they can thus by design not be used for anything else.
The names of the filters provided by the extension are not intuitive (cleverencode: and cleverdecode:). Instead of configuring a given file to be of a given EOL type, one must "clever encode" it on commit and "clever decode" it on checkout. The terminology is imperative rather than declarative.
- The name and the behavior of the extension itself indicates that it solves a Windows problem. That is discriminating and doesn't acknowledge that the real problem is about interoperability between different conventions.
Bad interaction with mq extension.
Subversion
Subversion has built-in support for line ending conversion. We have a test case that documents how Subversion handles some cases of inconsistencies. It turns out that
Subversion will silently recode a file to the specified EOL style if the file has a consistent style. So accidentally changing all EOLs from LF to CRLF wont matter for Subversion -- it will rewrite the file in the working copy upon commit.
Subversion will also silently update files if the svn:eol-style property is set of them. Until the commit, svn diff is empty, but the file is changed in the working copy on commit.
Subversion aborts on commit if a file has inconsistent EOL style.
Git
Git also has a lot of options for crlf and whitespace.
Design of the eol extension
Note: This design is work-in-progress together with the corresponding implementation at http://bitbucket.org/mg/hg-eol/.
We will keep configuration in a version controlled file called .hgeol in the repository root. It declares which files the extension should convert. The file could look like this:
[patterns] Windows.txt = CRLF Unix.txt = LF test/mixed.txt = BIN **.txt = native **.py = native **.proj = CRLF [repository] native = CRLF
The [patterns] section defines glob patterns for conversions - see hg help patterns. The first pattern wins, so more specific patterns should be put first. In the above example, the test/mixed.txt file is considered BIN since the test/mixed.txt rule matches before the **.txt rule.
This solution has been designed to be minimally invasive, so that repositories without the extension will behave as correct as possible.
Working directory format
Files with a declared format as CRLF or LF are always checked out in that format, files declared as native are converted to the operating system native format, and files not mentioned (or declared as binary) receive no treatment.
Repository format
Files declared as LF, CRLF, or BIN are stored as-is in the repository. Files declared as native are stored in a configurable repository-native format which defaults to LF.
The repository-native format can optionally be configured in .hgeol in the [repository] section.
Detailed behavior
The extension will change the behavior of core commands as follows.
hg update
The extension should read the .hgeol file from the target revision. So
hg update -r 100
will read .hgeol from revision 100 and ensure that files have EOLs according to the rules from that revision after the update.
If the working copy is dirty, the following should happen:
- the files are converted back to their repository form (using the encode filter)
- the update is made as normal and files are merged into the target revision
- the files are converted back to their working copy form (using the decode filter)
The idea is that this should let one move changes around "like normal" by basically ignoring the EOL rules while doing so.
hg commit
Files are checked to ensure correct EOLs. If .hgeol is changed, the EOLs in working directory can get out of sync with the .hgeol file. This should make the commit abort with a message:
abort: EOL mis-match in Windows.txt: has LF, but should have CRLF (run "hg eolupdate" to update files)
The hg eolupdate command will rewrite files in the working directory to match the .hgeol file. After that, the commit will succeed as normal and include the rewritten files. This means that updates to .hgeol are made in lock-step with the corresponding file changes. That way things are kept nicely synchronized in the repository.
The eolupdate command will make it easy to clearly separate content changes and EOL style changes. We can try to guide our users into not mixing those changes together in a single commit, by letting hg eolupdate fail if the repository has uncommitted changes, to specifically avoid updating EOLs in a "content" changeset.
hg add
No checking is done at that time, the check will be done when the files are committed.
hg diff
File content is normalized to the repository form before the diff is computed. The diff is then presented using the working copy form. TODO: the diff is currently shown based on the repository form of files.
Discussion of the eol extension
Format naming
Regardless of the implementation details, we are aware that we will need to pick unambiguous names for our various components. For some, native does not stand out as a name that is self-explanatory, but it does make sense to those exposed to Subversion's svn:eol property setting which inspired this mechanism in the first place.
A naming policy centered on storage might be more clear to end-users: storeasis, storeaslf is already depicting the behavior on commit, for example. Depending on the implementation, it might be interesting to specify distinctly the behaviors on commit, and on update: "storeasis, getaslf" or "storeaslf, getasis", or "storeascrlf, converttolocal" are too long, but are self-explanatory.
Some suggested that mercurial should not use CRLF or LF in our names, and use instead 'Windows' and 'Unix', respectively. One convention can be chosen, or aliases can be used.
Instead of defining the repository format for native in a separate section it could be a part of the format specification, such as native/CRLF, native/LF, or native/auto. The native/auto setting means that files are stored with native EOLs in the working copy, but otherwise preserved in the repository. So a CRLF file will remain in CRLF format in the working copy, but be checked out in LF format on Unix and in CRLF format on Windows.
RAW could perhaps be a better name than than BIN.
Content filtering hooks
The eol extension utilizes the generic encode/decode filters, just like win32text does. The filters can thus not be used to anything else, and eol gets some extra complexity in order to work with that interface. Perhaps the extension should build on something else than the current encode/decode filters.
The keyword extension solves a similar problem - perhaps some code can be reused, or perhaps there is a common need for better hooks for content filtering?
Another but very similar problem is conversion between character encodings - for example between UTF-8 with or without BOM, UTF-16 and other multi-byte formats, and 8-bit encodings such as the ISO 8859 variants and the most common Windows code pages. Yet another example could be automatic coding style conversions. Perhaps all these problems have so much in common that they all could be solved at once?
Mercurial should be careful not to lose any information, so it would be nice if a warning was given before any lossy filter was applied. For example, a pure conversion from CRLF to LF isn't lossy, but normalization of a file with inconsistent line-endings is. Perhaps core Mercurial could recognize a .hgfilter which could like this:
[eol] **.py = native/lf [keywords] src/**.py = [encoding] **.txt = native/utf-8
Extensions could handle a section, and core mercurial could warn about any unhandled sections. That would help ensuring that users had the right extensions enabled. This functionality could also be used to ensure that certain (commit) hooks are enabled in all working clones. We note that some kind of filters just ensure an invariant (CRLF or LF (or RAW)) and thus can be applied several times, for example both on checkout and on commit and as a possibel extra fix-up step to ensure the invariant both in working directory and in repo. Inconsistency is thus easily fixed. Other kinds of filters converts between different formats without reaching one fix-point, so the filters must be each others inverse (and probably only partial) and applied exactly once. Inconsistencies with this kind of filters is hard to clean up.
TODO
The extension needs testing on Windows (but note a complication: the test suite for mercurial relies on shell scripts and is therefore somewhat hard to run on Windows; see WindowsTestingPlan)
- Needs testing with merges of files with different EOL styles.
There is currently no hg eolupdate command.
- Hooks must be implemented.
- Consider: Should we also ensure trailing LF? Common and sometimes needed on Unix, but not common on Windows.
- Should diff and patch work in working directory format or repository format? Neither of them seem obviously right in all cases - perhaps there is a need for both?
Should working directory or repository format be used for hg archive? And for mq patches? And hgweb?
What to do with working directory when .hgeol is changed locally or in a updated/merged changeset, and how to merge a changeset with a new .hgeol with a changeset with new files it matches?
Extension Help Text
This is the module help text. It has been put here for easy editing and to collect all information on this page:
This extension allows you to manage what kind of line endings (CRLF or LF) are used in the repository and in the local working directory. The extension reads its configuration from a versioned ``.hgeol`` configuration file every time you run an ``hg`` command. ``.hgeol`` has similar syntax to regular Mercurial configuration files. It uses two sections, ``[patterns]`` and ``[repository]``. Use ``[patterns]`` to specify the encodings to use by file pattern in the working directory. The available encodings are ``LF``, ``CRLF``, and ``BIN``. Additionally, ``native`` is an alias for the platform's default encoding: ``LF`` on Unix (including Mac OS X) and ``CRLF`` on Windows. Note that ``BIN`` (do nothing to line endings) is Mercurial's default behaviour; it's only needed so that later, more specific patterns can override earlier, more general patterns. You can override the default interpretation of ``native`` by configuring ``eol.native``. Set it to ``LF`` or ``CRLF``. The repository representation of newlines in files configured as ``native`` can be specified in the ``[repository]`` section in ``.hgeol``. The default is LF, meaning that on Windows, files configured as ``native`` (CRLF) will be converted to LF on commit. Example versioned ``.hgeol`` file:: [patterns] **.py = native **.vcproj = CRLF **.txt = native Makefile = LF **.jpg = BIN [repository] native = LF Example ``.hgrc`` (or ``Mercurial.ini``) section:: [eol] native = CRLF See 'hg help patterns' for more information about the glob patterns used.