Differences between revisions 8 and 9
Revision 8 as of 2014-04-27 10:52:41
Size: 6091
Comment: add link to earlier discussion about working for vfs
Revision 9 as of 2014-06-22 17:19:45
Size: 6202
Editor: ChinmayJoshi
Comment: Updated status as per recent updates
Deletions are marked like this. Additions are marked like this.
Line 77: Line 77:
||mercurial/hg.py || {X} || {X} ||
Line 79: Line 80:
||mercurial/patch.py || {X} || {X} ||
Line 96: Line 98:
||os.path.exists() ||exists() || this shouldn't be used for files in working directory (even though lexists is not provided yet) ||
||util.fstat() ||fstat() || this shouldn't be used for files in working directory, because this implies os.stat() ||
||os.path.exists() ||exists() ||this shouldn't be used for files in working directory ||
||util.fstat() ||fstat() ||this shouldn't be used for files in working directory, because this implies os.stat() ||
Line 101: Line 103:
||os.path.lexists() ||lexists() || ||
Line 112: Line 115:
||os.stat() ||stat() || lstat() should be used for files in working directory || ||os.stat() ||stat() ||lstat() should be used for files in working directory ||
Line 115: Line 118:
||util.unlinkpath() ||unlinkpath() || ||

Note:

This page is primarily intended for developers of Mercurial.

Windows UTF-8 Plan

A plan to make Mercurial on Windows interoperate with UTF-8 elsewhere.

1. Overview

According to EncodingStrategy, Mercurial generally avoids managing encoding of filenames. This generally works well on Linux and Mac, where UTF-8 is now a well-supported default, but less well on Windows.

To maximize interoperability while maximally preserving backwards-compatibility, we should recognize manifests that are in UTF-8 and switch to a Unicode filesystem mode on Windows. This is referred to as the "hybrid strategy" in EncodingStrategy. All internal filename handling is done in UTF-8 and converted to/from UTF-16 at a VFS abstraction layer.

2. Definitions

  • UTF-8 changeset: a changeset where every filename in its manifest is a valid UTF-8 string (ASCII is a subset)
  • legacy changeset: a changeset that contains one or more non-UTF-8-compatible filenames

3. Steps

  • Rename opener to vfs
  • Add methods for all basic filesystem operations to vfs object
    • osutil.listdir
    • os.lstat/unlink/getcwd
    • os.path.join/lexists/islink/isfile?
    • etc.
  • Replace usage of non-basic methods:
    • os.path.exists -> os.path.lexists

    • os.path.getmtime -> os.lstat

    • shutil.rmtree / os.removedirs
  • Update all users to use vfs methods
  • Add an isutf8 helper method
  • On Windows:
    • derive u16vfs class from vfs that uses wide APIs internally and give UTF-8 results
    • Detect when updating to a UTF-8 changeset, use u16vfs

4. Python interface

Most of Python's APIs accept Unicode objects and will use Windows' wide APIs (aka UTF-16) to give Unicode results. The biggest exception here is os.getcwd() which takes no args and needs to be replaced by os.getcwdu().

5. Issues

5.1. Upgrading to UTF-8

Repositories that have all-ASCII filenames will work without change.

Repositories with legacy filenames can be converted by renaming files to UTF-8 and committing. This may require a Linux machine or clever utility.

5.2. Console will still be legacy

The console is still restricted to a legacy charset and Mercurial will continue to avoid transcoding when dealing with the console. Thus, UTF-8 names will be output as UTF-8 byte strings and result in mojibake unless cp65001 is used. This is identical to the current situation when working with UTF-8 changesets, except the filenames will be readable on disk.

Applications like TortoiseHg will be able to deal with this issue.

5.3. Merge between UTF-8 and non-UTF-8 commits

This could create problems. We probably don't want to make merge aware of this issue.

6. Status of progress

6.1. Step of current working

Now "Add methods for all basic filesystem operations to vfs object" and "Update all users to use vfs methods" steps are in progress incrementally: see last of Matt's mention, too.

  • replace file API below on repository relative paths
    • builtin open() and file()
    • file API via os.* and os.path.*
    • file API via osutil.* (e.g. osutil.listdir)
    • file API via shutil.* (e.g. shutil.rmtree)
    • file API via util.* (e.g. util.unlink)
  • replace "os.path.join" on repository relative paths

"Replace usage of non-basic methods" step should be done, before each "Update all users to use vfs methods" works, if needed.

6.2. Status of each files

  • (./) finished (or not used)

  • <!> finished but some are still used for ABS paths

  • {X} not yet finished

filename

file API

os.path.join

mercurial/bookmarks.py

(./)

(./)

mercurial/bundlerepo.py

<!>

(./)

mercurial/changegroup.py

<!>

(./)

mercurial/changelog.py

(./)

(./)

mercurial/context.py

(./)

(./) *1

mercurial/hg.py

{X}

{X}

mercurial/localrepo.py

(./)

{X} *2

mercurial/lock.py

(./)

(./)

mercurial/patch.py

{X}

{X}

mercurial/repair.py

(./)

(./)

mercurial/statichttprepo.py

(./)

(./)

mercurial/store.py

(./)

(./)

mercurial/transaction.py

{X} *3

(./)

  • *1 using "os.path.join" to add subrepo prefix to target filename
  • *2 using "os.path.join" to implement self join/wjoin
  • *3 using "util.copyfile" to backup target files

Some other files are also changed for WindowsUTF8Plan, but just partially (e.g. hgext/shelve.py, mercurial/commands.py and so on)

6.3. Current API of vfs

LEGACY function

vfs function

note

builtin open()/file()

open()

"vfs.open(name)" should be used in newly added code instead of "vfs(name)"

util.posixfile()

open()

os.chmod()

chmod()

os.path.exists()

exists()

this shouldn't be used for files in working directory

util.fstat()

fstat()

this shouldn't be used for files in working directory, because this implies os.stat()

os.path.isdir()

isdir()

lstat() should be used for multiple examinations

os.path.isfile()

isfile()

lstat() should be used for multiple examinations

os.path.islink()

islink()

lstat() should be used for multiple examinations

os.path.lexists()

lexists()

os.lstat()

lstat()

util.makedir()

makedir()

this can take "notindexed" argument

util.makedirs()

makedirs()

this can create directory recursively

util.makelock()

makelock()

os.mkdir()

mkdir()

tempfile.mkstemp()

mkstemp()

osutil.listdir()

readdir()

API for os.listdir() is not yet provided

util.readlock()

readlock()

util.rename()

rename()

os.readlink()

readlink()

util.setflags()

setflags()

os.stat()

stat()

lstat() should be used for files in working directory

os.symlink()

symlink()

util.unlink()

unlink()

util.unlinkpath()

unlinkpath()

os.utime()

utime()

7. See also


CategoryNewFeatures

WindowsUTF8Plan (last edited 2014-06-22 17:19:45 by ChinmayJoshi)