Typical Windows systems operate with two character sets. On typical US systems, for example, most applications use a character set called cp1252 that is close to Latin-1. At the same time, the console ("DOS box") uses the legacy PC character set called cp437.
This makes things rather complicated for a command-line application like Mercurial. Data that Mercurial does locale conversion on includes things like user names and commit messages. These can come from the command line, from files like .hgrc, from local editors, and be output to files or displayed on the console.
In the case of data taken from files, we should generally assume that the contents are in the cp1252 charset. But this may not always be correct because a user may have used a native console-based editor to create the file.
The command line presents a similar problem. While typically the command line is typed in a console and are thus in cp437, it may have actually come from a batch file written in Notepad using cp1252. Or it may come from another program spawning hg to do its work, such as a graphical IDE or an importer where cp1252 is native.
Even environment variables like HGUSER are problematic, as it may have been set in the registry editor or with Notepad rather than on the command line.
Finally, consider output. When output goes directly to the console, it's usually possible to determine the character set to use (though in some situations, there will be no codepage associated with a console!). However, redirection makes things confusing again. Consider "hg log | more" or "hg log > file". These cases are indistinguishable from Mercurial's point of view. For the "| more" case, we'd like to have cp437 so that non-ASCII characters are displayed correctly. But for the "> file" case, we'd probably want cp1252 so that tools like Notepad will get the right results.
So what should a program like Mercurial do? The best options are:
- Assume the console codepage everywhere (eg cp437)
- This approach works nicely, so long as all your applications are legacy console apps. But that's rarely the case these days.
- Assume the system codepage for files, the console codepage for the command line, and the console codepage for output if output is not redirected
- As we've seen, this can get fooled in all cases. Crucially, it can get fooled in the very common "hg log | more" case. It will also likely get fooled when called from an IDE, which will eventually be a common scenario. It also has a huge downside: we must add conversion for input and output for all places that print localized data. This adds a large complexity burden throughout the code for a problem that only affects Windows users and is likely to result in bugs as developers on the UNIX side add or change features. Given that the large added complexity doesn't actually solve the problem, it doesn't appear worth it.
- Assume the system codepage everywhere (eg cp1252)
- Like the first possibility, this one is extremely straightforward and consistent. While it will do the wrong thing when outputting non-ASCII characters on the console, it will get cases involving I/O to files used by windowed applications and control by IDEs right. It also incurs no extra complexity penalty.
This leaves us with the question of how to deal with character set compatibility problems. Here are a few possibilities:
- use ASCII
- This will always "just work".
- set your console code page to match your system code page
- This will also eliminate all the problems, at the expense of getting some garbled output from legacy applications. To query the current console code page within cmd.exe on Windows NT, issue the command:
- chcp
- chcp 1251
You may wish to enable Win32ChCpExtension which does this codepage switching for you, each time you run your Mercurial.
- This will also eliminate all the problems, at the expense of getting some garbled output from legacy applications. To query the current console code page within cmd.exe on Windows NT, issue the command:
- override Mercurial's encoding with an environment variable
- Setting HGENCODING will override the detected system character set.
- override Mercurial's encoding with a command-line option
- Using the global --encoding option will allow you to set your preferred encoding on each command.
- use GUI-based tools to interact with Mercurial
- This also eliminates the problem, by eliminating that pesky console entirely.
- use Linux/UNIX and UTF-8
- This makes Bill Gates cry.