Differences between revisions 1 and 9 (spanning 8 versions)
Revision 1 as of 2006-11-11 18:01:43
Size: 355
Editor: grooz
Comment:
Revision 9 as of 2007-01-12 16:24:15
Size: 1906
Editor: mpm
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
(part of InternationalizationPlan)

To allow for interoperability between users with different charset encodings, Mercurial will transcode
certain elements of the data it manages to UTF-8. Mercurial intentionally makes no assumptions about the charset of any data it manages except the elements described below.

== Elements that need to be transcoded ==

 * Usernames
 * Commit messages
 * Tags
 * Branch names

== Files and encodings ==

 * Changelog - UTF-8 (globally distributed)
 * .hgtags - UTF-8 (globally distributed, mostly managed by hg)
 * .hgrc - local (locally managed)
 * .hg/localtags - local (locally managed)
 * .hg/branch - local (no special reason)
 * .hg/branchcache - UTF-8 (otherwise, we'd need to invalidate when we changed locales)
Line 2: Line 23:
 * decode user provided log messages from locale encoded byte strings to unicode strings;
 * use terminal encoding to display those messages (provided as unicode strings);
 * encode log messages (provided as unicode string) in UTF-8 when storing;
 * decode log messages from UTF-8 to unicode string when retrieving;
 * add local encoding detection in util, with environment override (./)
 * add transcoding functions to util (./)
   * tolocal - decode stored data from UTF-8 robustly, falling back to latin-1 on failure
   * fromlocal - transcode local strings to UTF-8 with "strict" by default
 * transcode usernames and commit messages (./)
 * transcode tags (./)
 * transcode branch data (./)
 * properly report encoding in hgweb (see ["hgweb encoding"]) (./)
 * add --encoding and --encodingmode global options (./)
 * add a test (./)

== Legacy repositories ==

Legacy repositories may contain non-UTF-8 data as UTF-8 wasn't enforced. To continue to operate robustly, we do the following:

 * attempt to decode with UTF-8, strict
 * attempt to decode with Latin-1, strict
 * attempt to decode with UTF-8, replacing unknown characters

== Windows and OS X charset weirdness ==

See ["Character Encoding On Windows"] for a discussion of dealing with Windows charset braindamage and ["Character Encoding On OSX"] for a similar form of braindamage on OS X.

----
CategoryNewFeatures

(part of InternationalizationPlan)

To allow for interoperability between users with different charset encodings, Mercurial will transcode certain elements of the data it manages to UTF-8. Mercurial intentionally makes no assumptions about the charset of any data it manages except the elements described below.

Elements that need to be transcoded

  • Usernames
  • Commit messages
  • Tags
  • Branch names

Files and encodings

  • Changelog - UTF-8 (globally distributed)
  • .hgtags - UTF-8 (globally distributed, mostly managed by hg)
  • .hgrc - local (locally managed)
  • .hg/localtags - local (locally managed)
  • .hg/branch - local (no special reason)
  • .hg/branchcache - UTF-8 (otherwise, we'd need to invalidate when we changed locales)

Things that need to be done

  • add local encoding detection in util, with environment override (./)

  • add transcoding functions to util (./)

    • tolocal - decode stored data from UTF-8 robustly, falling back to latin-1 on failure
    • fromlocal - transcode local strings to UTF-8 with "strict" by default
  • transcode usernames and commit messages (./)

  • transcode tags (./)

  • transcode branch data (./)

  • properly report encoding in hgweb (see ["hgweb encoding"]) (./)

  • add --encoding and --encodingmode global options (./)

  • add a test (./)

Legacy repositories

Legacy repositories may contain non-UTF-8 data as UTF-8 wasn't enforced. To continue to operate robustly, we do the following:

  • attempt to decode with UTF-8, strict
  • attempt to decode with Latin-1, strict
  • attempt to decode with UTF-8, replacing unknown characters

Windows and OS X charset weirdness

See ["Character Encoding On Windows"] for a discussion of dealing with Windows charset braindamage and ["Character Encoding On OSX"] for a similar form of braindamage on OS X.


CategoryNewFeatures

ChangelogEncodingPlan (last edited 2012-10-25 21:04:47 by mpm)