<> <> = EOL Translation Plan = This page is historical, see the EolExtension. '''Status: draft''' A plan for improved handling of EOL translation. <> == The problem == Different platforms have different conventions for representation of end-of-line in text files. A common feature request for Mercurial is good support for getting native line endings in text files when changesets are shared between different platforms. The internal storage format for Mercurial is bytes, and core Mercurial doesn't care much about what is encoded in the bytes. * Windows traditionally uses CRLF (`'\r\n'`, carriage-return followed by line-feed). The default text editor on Windows, Notepad, only understands CRLF. Command line tools and redirection also uses CRLF. * Unix and Linux traditionally uses LF (`'\n'`) on Unix. Many tools can handle CRLF, but sometimes the native format is essential. * Older versions of Mac OS used CR (`'\r'`), but Mac OS X and later is Unix and uses LF. === Requirements === The main requirement is simply to somehow get * Good support for native line endings for text files Here comes an attempt to describe some requests / requirements / challenges-to-be-considered. Some are obligatory to some, some considers some nice-to-have, and some would consider implementation of some of them misfeatures. What is listed here should be essential to the problem, not tied to a specific solution and not just enumerate examples of bad solutions to avoid: * The line ending functionality must be enabled automatically in all working directories in all clones without any (or minimal) setup. * It must be guaranteed that no changesets with invalid line-endings gets committed (anywhere or centrally) - the history must be clean without unnecessary changes and the invariant of correct handling of line-endings must never be violated. * Violations of invariants must be handled properly anyway. Inconsistencies should be detected and be fixed in a non-obtrusive way. * Line-ending fix-ups should not be mixed with ordinary changes. * Line-ending fix-up must be configurable so it only happens for some files. Patterns (and meta-data flags if there were any) are hard to maintain and will be wrong. Auto-detection isn't sufficiently exact and might fail to both sides. * The handling of line-endings might change over time and that must be handled properly. * Patch handling, merges and `mq` must work properly in all cases. * The "native" encoding must be configurable, for example in order to prepare a zip-file for a Windows user on a Unix machine. === The win32text extension === Mercurial already comes with the Win32TextExtension, but it has a number of short-comings: * The settings are not part of the repository so all users must configure it for each clone. This is the biggest problem. * The general encode/decode filters are used, and they can thus by design not be used for anything else. * The names of the filters provided by the extension are not intuitive (`cleverencode:` and `cleverdecode:`). Instead of configuring a given file to be of a given EOL type, one must "clever encode" it on commit and "clever decode" it on checkout. The terminology is imperative rather than declarative. * The name and the behavior of the extension itself indicates that it solves a Windows problem. That is discriminating and doesn't acknowledge that the real problem is about interoperability between different conventions. * Bad interaction with `mq` extension. === Subversion === Subversion has built-in support for line ending conversion. We have a [[http://bitbucket.org/mg/hg-eol/src/tip/tests/test-svn|test case]] that documents how Subversion handles some cases of inconsistencies. It turns out that * Subversion will '''silently recode''' a file to the specified EOL style if the file has a consistent style. So accidentally changing all EOLs from LF to CRLF wont matter for Subversion -- it will rewrite the file in the working copy upon commit. * Subversion will also '''silently update''' files if the `svn:eol-style` property is set of them. Until the commit, `svn diff` is empty, but the file is changed in the working copy on commit. * Subversion '''aborts''' on commit if a file has inconsistent EOL style. === Git === Git also has a lot of [[http://www.kernel.org/pub/software/scm/git/docs/git-config.html|options]] for crlf and whitespace. = Design of the eol extension = Note: This design is work-in-progress together with the corresponding implementation at http://bitbucket.org/mg/hg-eol/. We will keep configuration in a version controlled file called `.hgeol` in the repository root. It declares which files the extension should convert. The file could look like this: {{{ [patterns] Windows.txt = CRLF Unix.txt = LF test/mixed.txt = BIN **.txt = native **.py = native **.proj = CRLF [repository] native = CRLF }}} The `[patterns]` section defines glob patterns for conversions - see `hg help patterns`. The first pattern wins, so more specific patterns should be put first. In the above example, the `test/mixed.txt` file is considered `BIN` since the `test/mixed.txt` rule matches before the `**.txt` rule. This solution has been designed to be minimally invasive, so that repositories without the extension will behave as correct as possible. == Working directory format == Files with a declared format as `CRLF` or `LF` are always checked out in that format, files declared as `native` are converted to the operating system native format, and files not mentioned (or declared as binary) receive no treatment. == Repository format == Files declared as `LF`, `CRLF`, or `BIN` are stored as-is in the repository. Files declared as `native` are stored in a configurable repository-native format which defaults to `LF`. The repository-native format can optionally be configured in `.hgeol` in the `[repository]` section. == Detailed behavior == The extension will change the behavior of core commands as follows. === hg update === The extension should read the `.hgeol` file from the target revision. So {{{ hg update -r 100 }}} will read `.hgeol` from revision 100 and ensure that files have EOLs according to the rules from that revision after the update. If the working copy is dirty, the following should happen: * the files are converted back to their repository form (using the encode filter) * the update is made as normal and files are merged into the target revision * the files are converted back to their working copy form (using the decode filter) The idea is that this should let one move changes around "like normal" by basically ignoring the EOL rules while doing so. === hg commit === Files are checked to ensure correct EOLs. If `.hgeol` is changed, the EOLs in working directory can get out of sync with the `.hgeol` file. This should make the commit abort with a message: {{{ abort: EOL mis-match in Windows.txt: has LF, but should have CRLF (run "hg eolupdate" to update files) }}} The `hg eolupdate` command will rewrite files in the working directory to match the `.hgeol` file. After that, the commit will succeed as normal and include the rewritten files. This means that updates to `.hgeol` are made in lock-step with the corresponding file changes. That way things are kept nicely synchronized in the repository. The `eolupdate` command will make it easy to clearly separate content changes and EOL style changes. We can try to guide our users into not mixing those changes together in a single commit, by letting `hg eolupdate` fail if the repository has uncommitted changes, to specifically avoid updating EOLs in a "content" changeset. === hg add === No checking is done at that time, the check will be done when the files are committed. === hg diff === File content is normalized to the repository form before the diff is computed. The diff is then presented using the working copy form. '''TODO''': the diff is currently shown based on the repository form of files. = Discussion of the eol extension = == Format naming == Regardless of the implementation details, we are aware that we will need to pick unambiguous names for our various components. For some, `native` does not stand out as a name that is self-explanatory, but it does make sense to those exposed to Subversion's svn:eol property setting which inspired this mechanism in the first place. A naming policy centered on storage might be more clear to end-users: `storeasis`, `storeaslf` is already depicting the behavior on commit, for example. Depending on the implementation, it might be interesting to specify distinctly the behaviors on commit, and on update: "`storeasis, getaslf`" or "`storeaslf, getasis`", or "`storeascrlf, converttolocal`" are too long, but are self-explanatory. Some suggested that mercurial should not use `CRLF` or `LF` in our names, and use instead 'Windows' and 'Unix', respectively. One convention can be chosen, or aliases can be used. Instead of defining the repository format for `native` in a separate section it could be a part of the format specification, such as `native/CRLF`, `native/LF`, or `native/auto`. The `native/auto` setting means that files are stored with native EOLs in the working copy, but otherwise preserved in the repository. So a CRLF file will remain in CRLF format in the working copy, but be checked out in LF format on Unix and in CRLF format on Windows. `RAW` could perhaps be a better name than than `BIN`. == Content filtering hooks == The eol extension utilizes the generic encode/decode filters, just like win32text does. The filters can thus not be used to anything else, and eol gets some extra complexity in order to work with that interface. Perhaps the extension should build on something else than the current encode/decode filters. The keyword extension solves a similar problem - perhaps some code can be reused, or perhaps there is a common need for better hooks for content filtering? Another but very similar problem is conversion between character encodings - for example between UTF-8 with or without BOM, UTF-16 and other multi-byte formats, and 8-bit encodings such as the ISO 8859 variants and the most common Windows code pages. Yet another example could be automatic coding style conversions. Perhaps all these problems have so much in common that they all could be solved at once? Mercurial should be careful not to lose any information, so it would be nice if a warning was given before any lossy filter was applied. For example, a pure conversion from CRLF to LF isn't lossy, but normalization of a file with inconsistent line-endings is. Perhaps core Mercurial could recognize a `.hgfilter` which could like this: {{{ [eol] **.py = native/lf [keywords] src/**.py = [encoding] **.txt = native/utf-8 }}} Extensions could handle a section, and core mercurial could warn about any unhandled sections. That would help ensuring that users had the right extensions enabled. This functionality could also be used to ensure that certain (commit) hooks are enabled in all working clones. We note that some kind of filters just ensure an invariant (CRLF or LF (or RAW)) and thus can be applied several times, for example both on checkout and on commit and as a possibel extra fix-up step to ensure the invariant both in working directory and in repo. Inconsistency is thus easily fixed. Other kinds of filters converts between different formats without reaching one fix-point, so the filters must be each others inverse (and probably only partial) and applied exactly once. Inconsistencies with this kind of filters is hard to clean up. == TODO == * The extension needs testing on Windows (but note a complication: the test suite for mercurial relies on shell scripts and is therefore somewhat hard to run on Windows; see WindowsTestingPlan) * Needs testing with merges of files with different EOL styles. * There is currently no `hg eolupdate` command. * Hooks must be implemented. * Consider: Should we also ensure trailing LF? Common and sometimes needed on Unix, but not common on Windows. * Should diff and patch work in working directory format or repository format? Neither of them seem obviously right in all cases - perhaps there is a need for both? * Should working directory or repository format be used for `hg archive`? And for `mq` patches? And hgweb? * What to do with working directory when `.hgeol` is changed locally or in a updated/merged changeset, and how to merge a changeset with a new `.hgeol` with a changeset with new files it matches? == Extension Help Text == This is the module help text. It has been put here for easy editing and to collect all information on this page: {{{ This extension allows you to manage what kind of line endings (CRLF or LF) are used in the repository and in the local working directory. The extension reads its configuration from a versioned ``.hgeol`` configuration file every time you run an ``hg`` command. ``.hgeol`` has similar syntax to regular Mercurial configuration files. It uses two sections, ``[patterns]`` and ``[repository]``. Use ``[patterns]`` to specify the encodings to use by file pattern in the working directory. The available encodings are ``LF``, ``CRLF``, and ``BIN``. Additionally, ``native`` is an alias for the platform's default encoding: ``LF`` on Unix (including Mac OS X) and ``CRLF`` on Windows. Note that ``BIN`` (do nothing to line endings) is Mercurial's default behaviour; it's only needed so that later, more specific patterns can override earlier, more general patterns. You can override the default interpretation of ``native`` by configuring ``eol.native``. Set it to ``LF`` or ``CRLF``. The repository representation of newlines in files configured as ``native`` can be specified in the ``[repository]`` section in ``.hgeol``. The default is LF, meaning that on Windows, files configured as ``native`` (CRLF) will be converted to LF on commit. Example versioned ``.hgeol`` file:: [patterns] **.py = native **.vcproj = CRLF **.txt = native Makefile = LF **.jpg = BIN [repository] native = LF Example ``.hgrc`` (or ``Mercurial.ini``) section:: [eol] native = CRLF See 'hg help patterns' for more information about the glob patterns used. }}} ---- CategoryDeveloper