hgweb presents an interesting problem for charset conversion and localization. We've got data from a couple sources displayed:
- changelog, which is UTF-8 internally
- .hgrc, which is assumed to be in the default locale charset
- repo files, for which we make no assumptions about encoding
Charset-unaware versions of Mercurial have been presenting everything in the default HTML character set (Latin-1) or claiming UTF-8 with the gitweb style. With charset detection, we have a couple possibilities:
- coerce the encoding to UTF-8 everywhere
- This will cause Mercurial to assume files like .hgrc, localtags, etc. are in UTF-8. This will also cause all file data to be reported as UTF-8 as well. Data that is actively managed as UTF-8 like changeset comments and usernames will be transcoded correctly.
- set page encoding to UTF-8, actively transcode output from default locale to UTF-8
- Simple, but loses characters due to round-trip to less expressive encodings. Likely to mangle repo file contents for display and corrupt raw downloads.
- set page encoding to UTF-8, actively transcode .hgrc from default locale and use UTF-8 internally
- This is complicated as the rest of Mercurial assumes all processing is done in the current locale. Some hgrc data may not be transcodeable. File contents may be displayed incorrectly, especially if they contain "binary" data.
- as above, but transcode files as well
- Even more complicated and also may break raw downloads due to transcoding errors. This may even happen silently.
- report default locale encoding as web page encoding
- Very simple and consistent with command-line behavior. Options to override behavior exist. Properly interprets .hgrc files and displays repo files as they would be displayed on the server console.
The last option is the most straightforward. So hgweb will serve pages in the default system character set, which can be overridden by a line in the CGI script, an environment variable, or a command-line option to hg serve.