Differences between revisions 4 and 7 (spanning 3 versions)
Revision 4 as of 2006-11-12 21:21:14
Size: 899
Editor: grooz
Comment:
Revision 7 as of 2006-12-04 00:02:08
Size: 1816
Editor: mpm
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
(part of InternationalizationPlan)

To allow for interoperability between users with different charset encodings, Mercurial will transcode
certain elements of the data it manages to UTF-8. Mercurial intentionally makes no assumptions about the charset of any data it manages except the elements described below.

== Elements that need to be transcoded ==

 * Usernames
 * Commit messages
 * Tags
 * Branch names

== Files and encodings ==

 * Changelog - UTF-8 (globally distributed)
 * .hgtags - UTF-8 (globally distributed, mostly managed by hg)
 * .hgrc - local (locally managed)
 * .hg/localtags - local (locally managed)
 * .hg/branch - local (no special reason)
 * .hg/branchcache - UTF-8 (otherwise, we'd need to invalidate when we changed locales)
Line 2: Line 23:
 * decode user provided log messages from locale encoded byte strings to unicode strings
   * decode logfile if --logfile option used
   * decode the message specified in command line if --message option used
   * decode edited file otherwise
 * use terminal encoding to display those messages (provided as unicode strings)
 * encode log messages (provided as unicode string) in UTF-8 when storing
 * decode log messages from UTF-8 to unicode string when retrieving
 * add local encoding detection in util, with environment override (./)
 * add transcoding functions to util (./)
   * tolocal - decode stored data from UTF-8 robustly, falling back to latin-1 on failure
   * fromlocal - transcode local strings to UTF-8 with "strict" by default
 * transcode usernames and commit messages (./)
 * transcode tags (./)
 * transcode branch data (./)
 * properly report encoding in hgweb (see ["hgweb encoding"])
 * add --encoding and --encodingmode global options (./)
 * add a test (./)
Line 11: Line 35:
Something has to be done with repositories having changelog messages encoded in Latin-1 or other encodings. [http://www.kernel.org/hg/linux-2.6/ Linux kenel tree] is an example. The options are
 * allow users to specify repository changelog encoding in hgrc
 * provide a tool to convert repositories from legacy encoding to UTF-8

Legacy repositories may contain non-UTF-8 data as UTF-8 wasn't enforced. To continue to operate robustly, we do the following:

 * attempt to decode with UTF-8, strict
 * attempt to decode with Latin-1, strict
 * attempt to decode with UTF-8, replacing unknown characters

== Windows charset weirdness ==

See ["Character Encoding On Windows"] for a discussion of dealing with Windows charset braindamage.

(part of InternationalizationPlan)

To allow for interoperability between users with different charset encodings, Mercurial will transcode certain elements of the data it manages to UTF-8. Mercurial intentionally makes no assumptions about the charset of any data it manages except the elements described below.

Elements that need to be transcoded

  • Usernames
  • Commit messages
  • Tags
  • Branch names

Files and encodings

  • Changelog - UTF-8 (globally distributed)
  • .hgtags - UTF-8 (globally distributed, mostly managed by hg)
  • .hgrc - local (locally managed)
  • .hg/localtags - local (locally managed)
  • .hg/branch - local (no special reason)
  • .hg/branchcache - UTF-8 (otherwise, we'd need to invalidate when we changed locales)

Things that need to be done

  • add local encoding detection in util, with environment override (./)

  • add transcoding functions to util (./)

    • tolocal - decode stored data from UTF-8 robustly, falling back to latin-1 on failure
    • fromlocal - transcode local strings to UTF-8 with "strict" by default
  • transcode usernames and commit messages (./)

  • transcode tags (./)

  • transcode branch data (./)

  • properly report encoding in hgweb (see ["hgweb encoding"])
  • add --encoding and --encodingmode global options (./)

  • add a test (./)

Legacy repositories

Legacy repositories may contain non-UTF-8 data as UTF-8 wasn't enforced. To continue to operate robustly, we do the following:

  • attempt to decode with UTF-8, strict
  • attempt to decode with Latin-1, strict
  • attempt to decode with UTF-8, replacing unknown characters

Windows charset weirdness

See ["Character Encoding On Windows"] for a discussion of dealing with Windows charset braindamage.


CategoryNewFeatures

ChangelogEncodingPlan (last edited 2012-10-25 21:04:47 by mpm)