Differences between revisions 5 and 11 (spanning 6 versions)
Revision 5 as of 2006-12-03 18:13:11
Size: 1741
Editor: mpm
Comment:
Revision 11 as of 2009-05-20 09:19:58
Size: 1899
Comment: Fix wiki link.
Deletions are marked like this. Additions are marked like this.
Line 23: Line 23:
 * add local encoding detection in util, with environment override
 * add transcoding functions to util
 * add local encoding detection in util, with environment override (./)
 * add transcoding functions to util (./)
Line 27: Line 27:
 * transcode usernames and commit messages
 * transcode tags

 * transcode branch data
 * use UTF-8
in hgweb
 * add --encoding and --encodingmode global options
 * add a test
 * transcode usernames and commit messages (./)
 * transcode tags (./)

 * transcode branch data (./)
 * properly report encoding
in hgweb (see [[hgweb encoding]]) (./)
 * add --encoding and --encodingmode global options (./)
 * add a test (./)
Line 42: Line 42:
== Windows charset weirdness == == Windows and OS X charset weirdness ==
Line 44: Line 44:
See ["Character Encoding On Windows"] for a discussion of dealing with Windows charset braindamage. See CharacterEncodingOnWindows for a discussion of dealing with Windows charset braindamage and [[Character_Encoding_On_OSX]] for a similar form of braindamage on OS X.

(part of InternationalizationPlan)

To allow for interoperability between users with different charset encodings, Mercurial will transcode certain elements of the data it manages to UTF-8. Mercurial intentionally makes no assumptions about the charset of any data it manages except the elements described below.

Elements that need to be transcoded

  • Usernames
  • Commit messages
  • Tags
  • Branch names

Files and encodings

  • Changelog - UTF-8 (globally distributed)
  • .hgtags - UTF-8 (globally distributed, mostly managed by hg)
  • .hgrc - local (locally managed)
  • .hg/localtags - local (locally managed)
  • .hg/branch - local (no special reason)
  • .hg/branchcache - UTF-8 (otherwise, we'd need to invalidate when we changed locales)

Things that need to be done

  • add local encoding detection in util, with environment override (./)

  • add transcoding functions to util (./)

    • tolocal - decode stored data from UTF-8 robustly, falling back to latin-1 on failure
    • fromlocal - transcode local strings to UTF-8 with "strict" by default
  • transcode usernames and commit messages (./)

  • transcode tags (./)

  • transcode branch data (./)

  • properly report encoding in hgweb (see hgweb encoding) (./)

  • add --encoding and --encodingmode global options (./)

  • add a test (./)

Legacy repositories

Legacy repositories may contain non-UTF-8 data as UTF-8 wasn't enforced. To continue to operate robustly, we do the following:

  • attempt to decode with UTF-8, strict
  • attempt to decode with Latin-1, strict
  • attempt to decode with UTF-8, replacing unknown characters

Windows and OS X charset weirdness

See CharacterEncodingOnWindows for a discussion of dealing with Windows charset braindamage and Character_Encoding_On_OSX for a similar form of braindamage on OS X.


CategoryNewFeatures

ChangelogEncodingPlan (last edited 2012-10-25 21:04:47 by mpm)