Note:
This page is primarily intended for developers of Mercurial.
Windows UTF-8 Plan
A plan to make Mercurial on Windows interoperate with UTF-8 elsewhere.
Contents
1. Overview
According to EncodingStrategy, Mercurial generally avoids managing encoding of filenames. This generally works well on Linux and Mac, where UTF-8 is now a well-supported default, but less well on Windows.
To maximize interoperability while maximally preserving backwards-compatibility, we should recognize manifests that are in UTF-8 and switch to a Unicode filesystem mode on Windows. This is referred to as the "hybrid strategy" in EncodingStrategy. All internal filename handling is done in UTF-8 and converted to/from UTF-16 at a VFS abstraction layer.
2. Definitions
- UTF-8 changeset: a changeset where every filename in its manifest is a valid UTF-8 string (ASCII is a subset)
- legacy changeset: a changeset that contains one or more non-UTF-8-compatible filenames
3. Steps
- Rename opener to vfs
- Add methods for all basic filesystem operations to vfs object
- osutil.listdir
- os.lstat/unlink/getcwd
- os.path.join/lexists/islink/isfile?
- etc.
- Replace usage of non-basic methods:
os.path.exists -> os.path.lexists
os.path.getmtime -> os.lstat
- shutil.rmtree / os.removedirs
- Update all users to use vfs methods
- Add an isutf8 helper method
- On Windows:
- derive u16vfs class from vfs that uses wide APIs internally and give UTF-8 results
- Detect when updating to a UTF-8 changeset, use u16vfs
4. Python interface
Most of Python's APIs accept Unicode objects and will use Windows' wide APIs (aka UTF-16) to give Unicode results. The biggest exception here is os.getcwd() which takes no args and needs to be replaced by os.getcwdu().
5. Issues
5.1. Upgrading to UTF-8
Repositories that have all-ASCII filenames will work without change.
Repositories with legacy filenames can be converted by renaming files to UTF-8 and committing. This may require a Linux machine or clever utility.
5.2. Console will still be legacy
The console is still restricted to a legacy charset and Mercurial will continue to avoid transcoding when dealing with the console. Thus, UTF-8 names will be output as UTF-8 byte strings and result in mojibake unless cp65001 is used. This is identical to the current situation when working with UTF-8 changesets, except the filenames will be readable on disk.
Applications like TortoiseHg will be able to deal with this issue.
5.3. Merge between UTF-8 and non-UTF-8 commits
This could create problems. We probably don't want to make merge aware of this issue.
6. Status of progress
6.1. Step of current working
Now "Add methods for all basic filesystem operations to vfs object" and "Update all users to use vfs methods" steps are in progress incrementally: see last of Matt's mention, too.
- replace file API below on repository relative paths
- builtin open() and file()
- file API via os.* and os.path.*
- file API via osutil.* (e.g. osutil.listdir)
- file API via shutil.* (e.g. shutil.rmtree)
- file API via util.* (e.g. util.unlink)
- replace "os.path.join" on repository relative paths
"Replace usage of non-basic methods" step should be done, before each "Update all users to use vfs methods" works, if needed.
6.2. Status of each files
finished (or not used)
finished but some are still used for ABS paths
not yet finished
filename |
file API |
os.path.join |
mercurial/bookmarks.py |
|
|
mercurial/bundlerepo.py |
|
|
mercurial/changegroup.py |
|
|
mercurial/changelog.py |
|
|
mercurial/context.py |
|
*1 |
mercurial/hg.py |
|
|
mercurial/localrepo.py |
|
*2 |
mercurial/lock.py |
|
|
mercurial/patch.py |
|
|
mercurial/repair.py |
|
|
mercurial/statichttprepo.py |
|
|
mercurial/store.py |
|
|
mercurial/transaction.py |
*3 |
|
- *1 using "os.path.join" to add subrepo prefix to target filename
- *2 using "os.path.join" to implement self join/wjoin
- *3 using "util.copyfile" to backup target files
Some other files are also changed for WindowsUTF8Plan, but just partially (e.g. hgext/shelve.py, mercurial/commands.py and so on)
6.3. Current API of vfs
LEGACY function |
vfs function |
note |
builtin open()/file() |
open() |
"vfs.open(name)" should be used in newly added code instead of "vfs(name)" |
util.posixfile() |
open() |
|
os.chmod() |
chmod() |
|
os.path.exists() |
exists() |
this shouldn't be used for files in working directory |
util.fstat() |
fstat() |
this shouldn't be used for files in working directory, because this implies os.stat() |
os.path.isdir() |
isdir() |
lstat() should be used for multiple examinations |
os.path.isfile() |
isfile() |
lstat() should be used for multiple examinations |
os.path.islink() |
islink() |
lstat() should be used for multiple examinations |
os.path.lexists() |
lexists() |
|
os.lstat() |
lstat() |
|
util.makedir() |
makedir() |
this can take "notindexed" argument |
util.makedirs() |
makedirs() |
this can create directory recursively |
util.makelock() |
makelock() |
|
os.mkdir() |
mkdir() |
|
tempfile.mkstemp() |
mkstemp() |
|
osutil.listdir() |
readdir() |
API for os.listdir() is not yet provided |
util.readlock() |
readlock() |
|
util.rename() |
rename() |
|
os.readlink() |
readlink() |
|
util.setflags() |
setflags() |
|
os.stat() |
stat() |
lstat() should be used for files in working directory |
os.symlink() |
symlink() |
|
util.unlink() |
unlink() |
|
util.unlinkpath() |
unlinkpath() |
|
os.utime() |
utime() |
|