Size: 12320
Comment:
|
Size: 12343
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 3: | Line 3: |
<<TableOfContents>> |
EOL Translation Plan
Status: draft
Contents
The problem
Different platforms have different conventions for representation of end-of-line in text files. A common feature request for Mercurial is good support for getting native line endings in text files when changesets are shared between different platforms. The internal storage format for Mercurial is bytes, and core Mercurial doesn't care much about what is encoded in the bytes.
Windows traditionally uses CRLF ('\r\n', carriage-return followed by line-feed). The default text editor on Windows, Notepad, only understands CRLF. Command line tools and redirection also uses CRLF.
Unix and Linux traditionally uses LF ('\n') on Unix. Many tools can handle CRLF, but sometimes the native format is essential.
Older versions of Mac OS used CR ('\r'), but Mac OS X and later is Unix and uses LF.
Line-ending is only an issue for text files - not for binary files (whatever that is).
Many cross-platform projects are using Mercurial, apparently without major issues. Line-ending handling seems however to be a major issue for the Python project which is migrating to Mercurial, and that is currently the primary driver for improvements in this area.
Requirements
The main requirement is simply to somehow get
- Good support for native line endings for text files
but:
The devil is in the details - especially when it comes to integration with Mercurials existing design. (DVCS. hgrc is insecure and cannot be shared. Files are byte sequences.)
- It is tempting to see other VCSs solution to this problem as a specification of how Mercurial should do. (Mercurial != not Mercurial).
- It is tempting to let previous bad experience (with Mercurial with or without win32text) lead directly to specific requirements for a new solution.
Here comes an attempt to describe some requests / requirements / challenges-to-be-considered. Some are obligatory to some, some considers some nice-to-have, and some would consider implementation of some of them misfeatures. What is listed here should be essential to the problem, not tied to a specific solution and not just enumerate examples of bad solutions to avoid:
- The line ending functionality must be enabled automatically in all working directories in all clones without any (or minimal) setup.
- It must be guaranteed that no changesets with invalid line-endings gets committed (anywhere or centrally) - the history must be clean without unnecessary changes and the invariant of correct handling of line-endings must never be violated.
- Violations of invariants must be handled properly anyway. Inconsistencies should be detected and be fixed in a non-obtrusive way.
- Line-ending fix-ups should not be mixed with ordinary changes.
- Line-ending fix-up must be configurable so it only happens for some files. Patterns (and meta-data flags if there were any) are hard to maintain and will be wrong. Auto-detection isn't sufficiently exact and might fail to both sides.
- The handling of line-endings might change over time and that must be handled properly.
Patch handling, merges and mq must work properly in all cases.
- The "native" encoding must be configurable, for example in order to prepare a zip-file for a Windows user on a Unix machine.
The win32text extension
Mercurial already comes with the win32text extension, but it has a number of short-comings:
- The settings are not part of the repository so all users must configure it for each clone. This is the biggest problem.
- The general encode/decode filters are used, and they can thus by design not be used for anything else.
The names of the filters provided by the extension are not intuitive (cleverencode: and cleverdecode:). Instead of configuring a given file to be of a given EOL type, one must "clever encode" it on commit and "clever decode" it on checkout. The terminology is imperative rather than declarative.
- The name and the behavior of the extension itself indicates that it solves a Windows problem. That is discriminating and doesn't acknowledge that the real problem is about interoperability between different conventions.
- Sometimes innocent users get spurious diffs, apparently without any changes, just to fix the line ending. This is often caused by others not using the extension properly.
Bad interaction with mq extension.
Subversion
Subversion has built-in support for line ending conversion.
How Subversion commits
We have a test case that documents how Subversion handles some cases of inconsistencies. It turns out that
Subversion will silently recode a file to the specified EOL style if the file has a consistent style. So accidentally changing all EOLs from LF to CRLF wont matter for Subversion -- it will rewrite the file in the working copy upon commit.
Subversion will also silently update files if the svn:eol-style property is set of them. Until the commit, svn diff is empty, but the file is changed in the working copy on commit.
Subversion aborts on commit if a file has inconsistent EOL style.
Design of hg-eol
Note: This design is work-in-progress together with the corresponding implementation at http://bitbucket.org/mg/hg-eol/.
We will keep configuration in a version controlled file called .hgeol in the repository root. It declares which files the extension should convert. The file could look like this:
[patterns] Windows.txt = CRLF Unix.txt = LF test/mixed.txt = BIN **.txt = native **.py = native **.proj = CRLF [repository] native = CRLF
The [patterns] section defines conversions for patterns - see hg help patterns (glob format is default). The first pattern wins, so more specific patterns should be put first. In the above example, the test/mixed.txt file is considered BIN since the test/mixed.txt rule matches before the **.txt rule.
This solution has been designed to be minimally invasive, so that repositories without the extension will behave as correct as possible.
Working directory format
Files with a declared format as CRLF or LF are always checked out in that format, files declared as native are converted to the operating system native format, and files not mentioned (or declared as binary) receive no treatment.
Repository format
Files declared as LF, CRLF, or BIN are stored as-is in the repository. Files declared as native are stored in a configurable repository-native format which defaults to LF.
The repository-native format can be configured in .hgeol in the [repository] section.
Detailed behavior
The extension will change the behavior of core commands as follows.
hg update
The extension should read the .hgeol file from the target revision. TODO: it currently always uses tip.
So
hg update -r 100
will read .hgeol from revision 100 and ensure that files have EOLs according to the rules from that revision after the update.
If the working copy is dirty, the following should happen:
- the files are be converted back to their repository form (using the encode filter)
- the update is made as normal and files are merged into the target revision
- the files are converted back to their working copy form (using the decode filter)
The idea is that this should let one move changes around "like normal" by basically ignoring the EOL rules while doing so.
hg commit
Files are checked to ensure correct EOLs. If .hgeol is changed, the EOLs in working directory can get out of sync with the .hgeol file. This should make the commit abort with a message:
abort: EOL mis-match in Windows.txt: has LF, but should have CRLF (run "hg eolupdate" to update files)
The hg eolupdate command will rewrite files in the working directory to match the .hgeol file. After that, the commit will succeed as normal and include the rewritten files. This means that updates to .hgeol are made in lock-step with the corresponding file changes. That way things are kept nicely synchronized in the repository.
The eolupdate command will make it easy to clearly separate content changes and EOL style changes. We can try to guide our users into not mixing those changes together in a single commit, by letting hg eolupdate fail if the repository has uncommitted changes, to specifically avoid updating EOLs in a "content" changeset.
hg add
No checking is done at that time, the check will be done when the files are committed.
hg diff
File content is normalized to the repository form before the diff is computed. The diff is then presented using the working copy form. TODO: the diff is currently shown based on the repository form of files.
Discussion of hg-eol
Format naming
Regardless of the implementation details, we are aware that we will need to pick unambiguous names for our various components. For some, native does not stand out as a name that is self-explanatory, but it does make sense to those exposed to Subversion's svn:eol property setting which inspired this mechanism in the first place.
A naming policy centered on storage might be more clear to end-users: storeasis, storeaslf is already depicting the behavior on commit, for example. Depending on the implementation, it might be interesting to specify distinctly the behaviors on commit, and on update: "storeasis, getaslf" or "storeaslf, getasis", or "storeascrlf, converttolocal" are too long, but are self-explanatory.
Some suggested that mercurial should not use crlf or lf in our names, and use instead 'windows' and 'unix', respectively. One convention can be chosen, or aliases can be used.
Instead of defining the repository format for native in a separate section it could be a part of the format specification, such as NATIVE/CRLF or NATIVE/LF (or even NATIVE/LS).
The name LS could be used a format using the unicode Line Separator, U+2028, especially in the repository. That would ensure that all users of a repository used the extension and that none of the platforms got special second class citizen treatment.
RAW could perhaps be a better name than than BIN.
Content filtering hooks
The hg-eol extension utilizes the generic encode/decode filters, just like win32text does. The filters can thus not be used to anything else, and hg-eol gets some extra complexity in order to work with that interface. Perhaps the extension should build on something else than the current encode/decode filters.
The keyword extension solves a similar problem - perhaps some code can be reused, or perhaps there is a common need for better hooks for content filtering?
Another but very similar problem is conversion between character encodings - for example between UTF-8 with or without BOM, UTF-16 and other multi-byte formats, and 8-bit encodings such as the ISO 8859 variants and the most common Windows code pages. Yet another example could be automatic coding style conversions. Perhaps all these problems have so much in common that they all could be solved at once?
TODO
- The extension needs testing on Windows.
- Needs testing with merges of files with different EOL styles.
There is currently no hg eolupdate command.
The right .hgeol file should be used (from working directory or parent revision, not tip revision).
- Hooks must be implemented.
- Consider: Should we also ensure trailing LF? Common and sometimes needed on Unix, but not common on Windows.
- Should diff and patch work in working directory format or repository format? Neither of them seem obviously right in all cases - perhaps there is a need for both?
Should wd or repo format be used for hg archive? And for mq patches? And hgweb?
What to do with working directory when .hgeol is changed locally or in a updated/merged changeset, and how to merge a changeset with a new .hgeol with a changeset with new files it matches?