Note:
This page is primarily intended for developers of Mercurial.
Windows UTF-8 Plan
A plan to make Mercurial on Windows interoperate with UTF-8 elsewhere.
Contents
1. Overview
According to EncodingStrategy, Mercurial generally avoids managing encoding of filenames. This generally works well on Linux and Mac, where UTF-8 is now a well-supported default, but less well on Windows.
To maximize interoperability while maximally preserving backwards-compatibility, we should recognize manifests that are in UTF-8 and switch to a Unicode filesystem mode on Windows. This is referred to as the "hybrid strategy" in EncodingStrategy. All internal filename handling is done in UTF-8 and converted to/from UTF-16 at a VFS abstraction layer.
2. Definitions
- UTF-8 changeset: a changeset where every filename in its manifest is a valid UTF-8 string (ASCII is a subset)
- legacy changeset: a changeset that contains one or more non-UTF-8-compatible filenames
3. Steps
- Rename opener to vfs
- Add methods for all basic filesystem operations to vfs object
- osutil.listdir
- os.lstat/unlink/getcwd
- os.path.join/lexists/islink/isfile?
- etc.
- Replace usage of non-basic methods:
os.path.exists -> os.path.lexists
os.path.getmtime -> os.lstat
- shutil.rmtree / os.removedirs
- Update all users to use vfs methods
- Add an isutf8 helper method
- On Windows:
- derive u16vfs class from vfs that uses wide APIs internally and give UTF-8 results
- Detect when updating to a UTF-8 changeset, use u16vfs
4. Python interface
Most of Python's APIs accept Unicode objects and will use Windows' wide APIs (aka UTF-16) to give Unicode results. The biggest exception here is os.getcwd() which takes no args and needs to be replaced by os.getcwdu().
5. Issues
5.1. Upgrading to UTF-8
Repositories that have all-ASCII filenames will work without change.
Repositories with legacy filenames can be converted by renaming files to UTF-8 and committing. This may require a Linux machine or clever utility.
5.2. Console will still be legacy
The console is still restricted to a legacy charset and Mercurial will continue to avoid transcoding when dealing with the console. Thus, UTF-8 names will be output as UTF-8 byte strings and result in mojibake unless cp65001 is used. This is identical to the current situation when working with UTF-8 changesets, except the filenames will be readable on disk.
Applications like TortoiseHg will be able to deal with this issue.
5.3. Merge between UTF-8 and non-UTF-8 commits
This could create problems. We probably don't want to make merge aware of this issue.