Note:

This page is primarily intended for developers of Mercurial.

Windows UTF-8 Plan

A plan to make Mercurial on Windows interoperate with UTF-8 elsewhere.

1. Overview

According to EncodingStrategy, Mercurial generally avoids managing encoding of filenames. This generally works well on Linux and Mac, where UTF-8 is now a well-supported default, but less well on Windows.

To maximize interoperability while maximally preserving backwards-compatibility, we should recognize manifests that are in UTF-8 and switch to a Unicode filesystem mode on Windows. This is referred to as the "hybrid strategy" in EncodingStrategy. All internal filename handling is done in UTF-8 and converted to/from UTF-16 at a VFS abstraction layer.

2. Definitions

3. Steps

4. Python interface

Most of Python's APIs accept Unicode objects and will use Windows' wide APIs (aka UTF-16) to give Unicode results. The biggest exception here is os.getcwd() which takes no args and needs to be replaced by os.getcwdu().

5. Issues

5.1. Upgrading to UTF-8

Repositories that have all-ASCII filenames will work without change.

Repositories with legacy filenames can be converted by renaming files to UTF-8 and committing. This may require a Linux machine or clever utility.

5.2. Console will still be legacy

The console is still restricted to a legacy charset and Mercurial will continue to avoid transcoding when dealing with the console. Thus, UTF-8 names will be output as UTF-8 byte strings and result in mojibake unless cp65001 is used. This is identical to the current situation when working with UTF-8 changesets, except the filenames will be readable on disk.

Applications like TortoiseHg will be able to deal with this issue.

5.3. Merge between UTF-8 and non-UTF-8 commits

This could create problems. We probably don't want to make merge aware of this issue.

6. See also


CategoryNewFeatures