Note:
This page is primarily intended for developers of Mercurial.
Note:
This page is no longer relevant but is kept for historical purposes.
Case-folding Plan
To deal with CaseFolding on the repo side, we need to:
- escape uppercase ASCII characters in filenames
- escape high ASCII
- Unicode and other characters may be case-folded as well
- Filesystems and operating systems may do other unfortunate things to
- filenames which will cause interoperability trouble
- use the same scheme by default on all systems to avoid backup and media sharing issues
A simple escaping scheme is as follows:
replace _ with __
- replace A-Z with _a, etc.
replace characters 126-255 and '\:*?"<>|' with ~7e to ~ff (note this escapes tilde as well
Note that we rarely need to
Implementation plan:
add separate localrepo access methods for all store data (changelog, manifest, data/*, journal, lock)
if .hg/data exists at localrepo init time, use old access scheme
if not, access all store data with escaped paths inside .hg/store/ (eg .hg/store/00changelog.i or .hg/store/data/_readme.i)
This scheme will automatically escape all paths on newly cloned or created repos.
On the working directory side, the best we can do is detect collisions. A simple scheme might look something like this:
detect case sensitive filesystem at checkout/update time
scan manifest for case-folding collisions and issue a warning
There are some further issues on the working directory and user inteface side.
renames which only change case (e.g. foo -> Foo) will not be properly detected in the filesystem
user supplied filenames may differ in case from the actual file on disk (Question: is it reasonable to require the user to specify the correct case? Probably not, see Issue646).
Also, filesystems like OSX do Unicode normalisation, meaning that two filenames with differing normal forms may in fact be the same.
Finally, there are some filename identity issues even on Unix - the files foo/bar and baz/../foo/bar are the same. These are (presumably) solved on Unix, so looking at how the solution works may offer some advice on how to deal with user input issues.
Proposal:
- Classify file names into different types:
- Manifest internal (case sensitive always)
- OS Native (possibly case or normalisation insensitive)
- Identify which type of file name is involved in the various API calls
- Determine the correct behaviour whenever the 2 types come into contact
There are some other differences between manifest internal and os native pathnames (the former always uses / path separators, where the latter uses os.sep) as well as differences between absolute and relative pathnames - in reviewing API calls, these differences should be noted as well.
In some cases, this may require carrying round of additional data, to preserve both the user-supplied name, and the actual filesystem canonical name.
See also
Issue839 - "Hg local store creates paths too long for Windows"