Differences between revisions 5 and 6
Revision 5 as of 2006-12-03 18:13:11
Size: 1741
Editor: mpm
Comment:
Revision 6 as of 2006-12-03 22:45:14
Size: 1776
Editor: mpm
Comment:
Deletions are marked like this. Additions are marked like this.
Line 23: Line 23:
 * add local encoding detection in util, with environment override
 * add transcoding functions to util
 * add local encoding detection in util, with environment override (./)
 * add transcoding functions to util (./)
Line 27: Line 27:
 * transcode usernames and commit messages
 * transcode tags
 * transcode branch data
 * transcode usernames and commit messages (./)
 * transcode tags (./)
 * transcode branch data (./)
Line 31: Line 31:
 * add --encoding and --encodingmode global options
 * add a test
 * add --encoding and --encodingmode global options (./)
 * add a test (./)

(part of InternationalizationPlan)

To allow for interoperability between users with different charset encodings, Mercurial will transcode certain elements of the data it manages to UTF-8. Mercurial intentionally makes no assumptions about the charset of any data it manages except the elements described below.

Elements that need to be transcoded

  • Usernames
  • Commit messages
  • Tags
  • Branch names

Files and encodings

  • Changelog - UTF-8 (globally distributed)
  • .hgtags - UTF-8 (globally distributed, mostly managed by hg)
  • .hgrc - local (locally managed)
  • .hg/localtags - local (locally managed)
  • .hg/branch - local (no special reason)
  • .hg/branchcache - UTF-8 (otherwise, we'd need to invalidate when we changed locales)

Things that need to be done

  • add local encoding detection in util, with environment override (./)

  • add transcoding functions to util (./)

    • tolocal - decode stored data from UTF-8 robustly, falling back to latin-1 on failure
    • fromlocal - transcode local strings to UTF-8 with "strict" by default
  • transcode usernames and commit messages (./)

  • transcode tags (./)

  • transcode branch data (./)

  • use UTF-8 in hgweb
  • add --encoding and --encodingmode global options (./)

  • add a test (./)

Legacy repositories

Legacy repositories may contain non-UTF-8 data as UTF-8 wasn't enforced. To continue to operate robustly, we do the following:

  • attempt to decode with UTF-8, strict
  • attempt to decode with Latin-1, strict
  • attempt to decode with UTF-8, replacing unknown characters

Windows charset weirdness

See ["Character Encoding On Windows"] for a discussion of dealing with Windows charset braindamage.


CategoryNewFeatures

ChangelogEncodingPlan (last edited 2012-10-25 21:04:47 by mpm)