(part of InternationalizationPlan)
To allow for interoperability between users with different charset encodings, Mercurial will transcode certain elements of the data it manages to UTF-8. Mercurial intentionally makes no assumptions about the charset of any data it manages except the elements described below.
Elements that need to be transcoded
- Usernames
- Commit messages
- Tags
- Branch names
Files and encodings
- Changelog - UTF-8 (globally distributed)
- .hgtags - UTF-8 (globally distributed, mostly managed by hg)
- .hgrc - local (locally managed)
- .hg/localtags - local (locally managed)
- .hg/branch - local (no special reason)
- .hg/branchcache - UTF-8 (otherwise, we'd need to invalidate when we changed locales)
Things that need to be done
- add local encoding detection in util, with environment override
- add transcoding functions to util
- tolocal - decode stored data from UTF-8 robustly, falling back to latin-1 on failure
- fromlocal - transcode local strings to UTF-8 with "strict" by default
- transcode usernames and commit messages
- transcode tags
- transcode branch data
- use UTF-8 in hgweb
- add --encoding and --encodingmode global options
- add a test
Legacy repositories
Legacy repositories may contain non-UTF-8 data as UTF-8 wasn't enforced. To continue to operate robustly, we do the following:
- attempt to decode with UTF-8, strict
- attempt to decode with Latin-1, strict
- attempt to decode with UTF-8, replacing unknown characters
Windows charset weirdness
See ["Character Encoding On Windows"] for a discussion of dealing with Windows charset braindamage.