#pragma section-numbers 2 <> = Windows UTF-8 Plan = A plan to make Mercurial on Windows interoperate with UTF-8 elsewhere. <> == Overview == According to EncodingStrategy, Mercurial generally avoids managing encoding of filenames. This generally works well on Linux and Mac, where UTF-8 is now a well-supported default, but less well on Windows. To maximize interoperability while maximally preserving backwards-compatibility, we should recognize manifests that are in UTF-8 and switch to a Unicode filesystem mode on Windows. This is referred to as the "hybrid strategy" in EncodingStrategy. All internal filename handling is done in UTF-8 and converted to/from UTF-16 at a VFS abstraction layer. == Definitions == * UTF-8 changeset: a changeset where every filename in its manifest is a valid UTF-8 string (ASCII is a subset) * legacy changeset: a changeset that contains one or more non-UTF-8-compatible filenames == Steps == * Rename opener to vfs * Add methods for all basic filesystem operations to vfs object * osutil.listdir * os.lstat/unlink/getcwd * os.path.join/lexists/islink/isfile? * etc. * Replace usage of non-basic methods: * os.path.exists -> os.path.lexists * os.path.getmtime -> os.lstat * shutil.rmtree / os.removedirs * Update all users to use vfs methods * Add an isutf8 helper method * On Windows: * derive u16vfs class from vfs that uses wide APIs internally and give UTF-8 results * Detect when updating to a UTF-8 changeset, use u16vfs == Python interface == Most of Python's APIs accept Unicode objects and will use Windows' wide APIs (aka UTF-16) to give Unicode results. The biggest exception here is os.getcwd() which takes no args and needs to be replaced by os.getcwdu(). == Issues == === Upgrading to UTF-8 === Repositories that have all-ASCII filenames will work without change. Repositories with legacy filenames can be converted by renaming files to UTF-8 and committing. This may require a Linux machine or clever utility. === Console will still be legacy === The console is still restricted to a legacy charset and Mercurial will continue to avoid transcoding when dealing with the console. Thus, UTF-8 names will be output as UTF-8 byte strings and result in mojibake unless cp65001 is used. This is identical to the current situation when working with UTF-8 changesets, except the filenames will be readable on disk. Applications like TortoiseHg will be able to deal with this issue. === Merge between UTF-8 and non-UTF-8 commits === This could create problems. We probably don't want to make merge aware of this issue. == Status of progress == === Step of current working === Now ''"Add methods for all basic filesystem operations to vfs object"'' and ''"Update all users to use vfs methods"'' steps are in progress incrementally: see last of [[http://selenic.com/pipermail/mercurial-devel/2012-June/041526.html|Matt's mention]], too. * replace file API below on repository relative paths * builtin open() and file() * file API via os.* and os.path.* * file API via osutil.* (e.g. osutil.listdir) * file API via shutil.* (e.g. shutil.rmtree) * file API via util.* (e.g. util.unlink) * replace "os.path.join" on repository relative paths ''"Replace usage of non-basic methods"'' step should be done, before each ''"Update all users to use vfs methods"'' works, if needed. === Status of each files === * (./) finished (or not used) * finished but some are still used for ABS paths * {X} not yet finished ||'''filename''' ||'''file API''' ||'''os.path.join''' || ||mercurial/bookmarks.py || (./) || (./) || ||mercurial/bundlerepo.py || || (./) || ||mercurial/changegroup.py || || (./) || ||mercurial/changelog.py || (./) || (./) || ||mercurial/context.py || (./) || (./) *1 || ||mercurial/hg.py || {X} || {X} || ||mercurial/localrepo.py || (./) || {X} *2 || ||mercurial/lock.py || (./) || (./) || ||mercurial/patch.py || {X} || {X} || ||mercurial/repair.py || (./) || (./) || ||mercurial/statichttprepo.py || (./) || (./) || ||mercurial/store.py || (./) || (./) || ||mercurial/transaction.py || {X} *3 || (./) || * *1 using "os.path.join" to add subrepo prefix to target filename * *2 using "os.path.join" to implement self join/wjoin * *3 using "util.copyfile" to backup target files Some other files are also changed for WindowsUTF8Plan, but just partially (e.g. hgext/shelve.py, mercurial/commands.py and so on) === Current API of vfs === ||'''LEGACY function''' ||'''vfs function''' ||'''note''' || ||builtin open()/file() ||open() ||"vfs.open(name)" should be used in newly added code instead of "vfs(name)" || ||util.posixfile() ||open() || || ||os.chmod() ||chmod() || || ||os.path.exists() ||exists() ||this shouldn't be used for files in working directory || ||util.fstat() ||fstat() ||this shouldn't be used for files in working directory, because this implies os.stat() || ||os.path.isdir() ||isdir() ||lstat() should be used for multiple examinations || ||os.path.isfile() ||isfile() ||lstat() should be used for multiple examinations || ||os.path.islink() ||islink() ||lstat() should be used for multiple examinations || ||os.path.lexists() ||lexists() || || ||os.lstat() ||lstat() || || ||util.makedir() ||makedir() ||this can take "notindexed" argument || ||util.makedirs() ||makedirs() ||this can create directory recursively || ||util.makelock() ||makelock() || || ||os.mkdir() ||mkdir() || || ||tempfile.mkstemp() ||mkstemp() || || ||osutil.listdir() ||readdir() ||API for os.listdir() is not yet provided || ||util.readlock() ||readlock() || || ||util.rename() ||rename() || || ||os.readlink() ||readlink() || || ||util.setflags() ||setflags() || || ||os.stat() ||stat() ||lstat() should be used for files in working directory || ||os.symlink() ||symlink() || || ||util.unlink() ||unlink() || || ||util.unlinkpath() ||unlinkpath() || || ||os.utime() ||utime() || || == See also == * EncodingStrategy * http://www.selenic.com/pipermail/mercurial-devel/2011-December/036385.html ---- CategoryNewFeatures