Tracking File Encoding in Mercurial

Clients displaying text files do not currently have enough information available to always reliably decode file contents to unicode, leading to incorrectly decoded and rendered text under some circumstances.

This proposal describes a mechanism for users to explicitly declare file encoding on an individual file level as an aid to the client. Encoding meta data is versioned and stored alongside the repository's contents.

Problem Statement

For files that are not using a self-identifying character encoding like a UTF BOM and not plain ASCII/latin-1, the client is left guessing how to interpret the contents. Different clients apply different approaches with varying levels of success.

As an example, Bitbucket assumes all files in all repositories to use UTF-8. If decoding fails, it then runs the first kilobyte through an encoding detection library that applies heuristics to guess the encoding. If decoding fails again, it switched back to UTF-8, this time forcing decoding, replacing invalid codepoints, visibly mutilating the contents.

In cases where the detection succeeds, there is still no guarantee that the detected encoding was the one used by the user, as encoding schemes overlap and a particular byte sequence can be valid for multiple schemes, mapping to distinct unicode code points.

Current Situation

Mercurial currently offers a global way to provide an encoding hint through $HGENCODING. This has limitations however:

It is not versioned or stored alongside the repo's contents, meaning that forks/upstreams do not inherit this meta data, so that each service must be configured individually and knowledge must be exchanged through separate means.
A single encoding is inappropriate for repositories that use multiple file encodings. An example could be a project that switched encodings at some point in its history, or large "mono-repos" containing diverse, independent sub projects.

File Format and Syntax

Similar to .hgignore, .hgencoding contains one file matching pattern per line, but unlike .hgignore, each pattern is followed by an encoding, e.g.:

    syntax: glob
    **.cs = windows-1252

Empty lines are skipped. The # character is treated as a comment character, and the \ character is treated as an escape character.

Mercurial supports several pattern syntaxes. The default syntax used is Python/Perl-style regular expressions.

To change the syntax used, use a line of the following form:

    syntax: NAME

where NAME is one of the following:

regexp Regular expression, Python/Perl syntax.
glob Shell-style glob.

The chosen syntax stays in effect when parsing all patterns that follow, until another syntax is selected.

Neither glob nor regexp patterns are rooted. A glob-syntax pattern of the form *.c will match a file ending in .c in any directory, and a regexp pattern of the form \.c$ will do the same. To root a regexp pattern, start it with ^.

Subdirectories can have their own .hgencoding settings by adding subinclude:path/to/subdir/.hgencoding to the root .hgencoding. See hg help patterns for details on subinclude: and include:.

Contract

The rules established in .hgencoding are meant to provide decoding hints to clients opting in to this system. However, there is no requirement for Mercurial, or any part of its ecosystem to strictly enforce them.

It is the expectation that graphical display clients that currently rely on decoding file contents to unicode, like web apps and desktop clients, will take advantage of the additional meta data, while core Mercurial utilities including hg merge and hg diff will not.

In the event that a rule's encoding fails to decode a file's contents, clients are free to fall back to current methods of interpretation, including heuristic-based guessing, forced decoding, or treating the file as binary. No logging or prompting is required.

The same is true for values that are not valid encoding names, or encodings whose definition is unknown to the client.

File Name Encoding

Decoding ambiguities apply to file contents, as well as file names in the bytes-based manifest. This spec applies to the former only and does not address manifest parsing.

Precedents From Other SCMs

While the problem of tracking encoding meta data for repository contents is not unique to Mercurial, the proposal set forth herein is. The situation in Git in particular is very similar, yet we choose to not adopt its approach verbatim.

Git offers a somewhat generic system for tracking file meta data through its .gitattributes files that are also checked into the repository, apply glob-based path matching, allow hierarchical file overrides and are meant to be hints, rather than strict decoding requirements, with many tools, including all of its own non-graphical tools ignoring it.

Yet our approach varies subtly.

It uses a dedicated file, not a generic one that collects meta data for various independent purposes
Its pattern syntax follows .hgignore's distinction between globs and regexps
Its rule syntax follows Mercurial's merge-patterns style of separating pattern and value with =.
While nested file overrides are provided, they must be declared explicitly from files higher up the directory structure for them to take effect, following .hgignore's (sub)include system.

These deviations are deliberate, as following Git directly would cause inconsistencies with existing mechanisms introduced by like .hgignore and .hgeol. It could also implicitly set unrealistic expectations with the user, with regards to other things managed by .gitattributes that Mercurial already does differently (e.g. line-endings).

CategoryDeveloper CategoryNewFeatures