Note:
This page is primarily intended for developers of Mercurial.
Note:
This page is no longer relevant but is kept for historical purposes.
This page does not meet our wiki style guidelines. Please help improve this page by cleaning up its formatting. |
(part of InternationalizationPlan)
To allow for interoperability between users with different charset encodings, Mercurial will transcode certain elements of the data it manages to UTF-8. Mercurial intentionally makes no assumptions about the charset of any data it manages except the elements described below.
Elements that need to be transcoded
- Usernames
- Commit messages
- Tags
- Branch names
Files and encodings
- Changelog - UTF-8 (globally distributed)
- .hgtags - UTF-8 (globally distributed, mostly managed by hg)
- .hgrc - local (locally managed)
- .hg/localtags - local (locally managed)
- .hg/branch - local (no special reason)
- .hg/branchcache - UTF-8 (otherwise, we'd need to invalidate when we changed locales)
Things that need to be done
add local encoding detection in util, with environment override
add transcoding functions to util
- tolocal - decode stored data from UTF-8 robustly, falling back to latin-1 on failure
- fromlocal - transcode local strings to UTF-8 with "strict" by default
transcode usernames and commit messages
transcode tags
transcode branch data
properly report encoding in hgweb (see hgweb encoding)
add --encoding and --encodingmode global options
add a test
Legacy repositories
Legacy repositories may contain non-UTF-8 data as UTF-8 wasn't enforced. To continue to operate robustly, we do the following:
- attempt to decode with UTF-8, strict
- attempt to decode with Latin-1, strict
- attempt to decode with UTF-8, replacing unknown characters
Windows and OS X charset weirdness
See CharacterEncodingOnWindows for a discussion of dealing with Windows charset braindamage and Character_Encoding_On_OSX for a similar form of braindamage on OS X.