Size: 899
Comment:
|
Size: 1965
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
<<Include(A:dated)>> <<Include(A:dev)>> <<Include(A:style)>> (part of InternationalizationPlan) To allow for interoperability between users with different charset encodings, Mercurial will transcode certain elements of the data it manages to UTF-8. Mercurial intentionally makes no assumptions about the charset of any data it manages except the elements described below. == Elements that need to be transcoded == * Usernames * Commit messages * Tags * Branch names == Files and encodings == * Changelog - UTF-8 (globally distributed) * .hgtags - UTF-8 (globally distributed, mostly managed by hg) * .hgrc - local (locally managed) * .hg/localtags - local (locally managed) * .hg/branch - local (no special reason) * .hg/branchcache - UTF-8 (otherwise, we'd need to invalidate when we changed locales) |
|
Line 2: | Line 27: |
* decode user provided log messages from locale encoded byte strings to unicode strings * decode logfile if --logfile option used * decode the message specified in command line if --message option used * decode edited file otherwise * use terminal encoding to display those messages (provided as unicode strings) * encode log messages (provided as unicode string) in UTF-8 when storing * decode log messages from UTF-8 to unicode string when retrieving |
* add local encoding detection in util, with environment override (./) * add transcoding functions to util (./) * tolocal - decode stored data from UTF-8 robustly, falling back to latin-1 on failure * fromlocal - transcode local strings to UTF-8 with "strict" by default * transcode usernames and commit messages (./) * transcode tags (./) * transcode branch data (./) * properly report encoding in hgweb (see [[hgweb encoding]]) (./) * add --encoding and --encodingmode global options (./) * add a test (./) |
Line 11: | Line 39: |
Something has to be done with repositories having changelog messages encoded in Latin-1 or other encodings. [http://www.kernel.org/hg/linux-2.6/ Linux kenel tree] is an example. The options are * allow users to specify repository changelog encoding in hgrc * provide a tool to convert repositories from legacy encoding to UTF-8 |
Legacy repositories may contain non-UTF-8 data as UTF-8 wasn't enforced. To continue to operate robustly, we do the following: * attempt to decode with UTF-8, strict * attempt to decode with Latin-1, strict * attempt to decode with UTF-8, replacing unknown characters == Windows and OS X charset weirdness == See CharacterEncodingOnWindows for a discussion of dealing with Windows charset braindamage and [[Character_Encoding_On_OSX]] for a similar form of braindamage on OS X. |
Note:
This page appears to contain material that is no longer relevant. Please help improve this page by updating its content.
Note:
This page is primarily intended for developers of Mercurial.
|
(part of InternationalizationPlan)
To allow for interoperability between users with different charset encodings, Mercurial will transcode certain elements of the data it manages to UTF-8. Mercurial intentionally makes no assumptions about the charset of any data it manages except the elements described below.
Elements that need to be transcoded
- Usernames
- Commit messages
- Tags
- Branch names
Files and encodings
- Changelog - UTF-8 (globally distributed)
- .hgtags - UTF-8 (globally distributed, mostly managed by hg)
- .hgrc - local (locally managed)
- .hg/localtags - local (locally managed)
- .hg/branch - local (no special reason)
- .hg/branchcache - UTF-8 (otherwise, we'd need to invalidate when we changed locales)
Things that need to be done
add local encoding detection in util, with environment override
add transcoding functions to util
- tolocal - decode stored data from UTF-8 robustly, falling back to latin-1 on failure
- fromlocal - transcode local strings to UTF-8 with "strict" by default
transcode usernames and commit messages
transcode tags
transcode branch data
properly report encoding in hgweb (see hgweb encoding)
add --encoding and --encodingmode global options
add a test
Legacy repositories
Legacy repositories may contain non-UTF-8 data as UTF-8 wasn't enforced. To continue to operate robustly, we do the following:
- attempt to decode with UTF-8, strict
- attempt to decode with Latin-1, strict
- attempt to decode with UTF-8, replacing unknown characters
Windows and OS X charset weirdness
See CharacterEncodingOnWindows for a discussion of dealing with Windows charset braindamage and Character_Encoding_On_OSX for a similar form of braindamage on OS X.