Differences between revisions 10 and 11
Revision 10 as of 2008-03-11 08:48:44
Size: 1662
Comment:
Revision 11 as of 2008-04-14 08:19:59
Size: 3229
Editor: PaulMoore
Comment:
Deletions are marked like this. Additions are marked like this.
Line 31: Line 31:
There are some further issues on the working directory and user inteface side.

 * renames which only change case (e.g. foo -> Foo) will not be properly detected in the filesystem
 * user supplied filenames may differ in case from the actual file on disk (Question: is it reasonable to require the user to specify the correct case? Probably not, see [http://www.selenic.com/mercurial/bts/issue646 Issue646]).

Also, filesystems like OSX do Unicode normalisation, meaning that two filenames with differing normal forms may in fact be the same.

Finally, there are some filename identity issues even on Unix - the files foo/bar and baz/../foo/bar are the same. These are (presumably) solved on Unix, so looking at how the solution works may offer some advice on how to deal with user input issues.

Proposal:

 * Classify file names into different types:
  * Manifest internal (case sensitive always)
  * OS Native (possibly case or normalisation insensitive)
 * Identify which type of file name is involved in the various API calls
 * Determine the correct behaviour whenever the 2 types come into contact

There are some other differences between manifest internal and os native pathnames (the former always uses / path separators, where the latter uses os.sep) as well as differences between absolute and relative pathnames - in reviewing API calls, these differences should be noted as well.

In some cases, this may require carrying round of additional data, to preserve both the user-supplied name, and the actual filesystem canonical name.

To deal with CaseFolding on the repo side, we need to:

  • escape uppercase ASCII characters in filenames
  • escape high ASCII
    • Unicode and other characters may be case-folded as well
    • Filesystems and operating systems may do other unfortunate things to
      • filenames which will cause interoperability trouble
  • use the same scheme by default on all systems to avoid backup and media sharing issues

A simple escaping scheme is as follows:

  • replace _ with __

  • replace A-Z with _a, etc.
  • replace characters 126-255 and '\:*?"<>|' with ~7e to ~ff (note this escapes tilde as well

Note that we rarely need to

Implementation plan:

  • add separate localrepo access methods for all store data (changelog, manifest, data/*, journal, lock) (./)

  • if .hg/data exists at localrepo init time, use old access scheme (./)

  • if not, access all store data with escaped paths inside .hg/store/ (eg .hg/store/00changelog.i or .hg/store/data/_readme.i) (./)

This scheme will automatically escape all paths on newly cloned or created repos.

On the working directory side, the best we can do is detect collisions. A simple scheme might look something like this:

  • detect case sensitive filesystem at checkout/update time (./)

  • scan manifest for case-folding collisions and issue a warning (./)

There are some further issues on the working directory and user inteface side.

  • renames which only change case (e.g. foo -> Foo) will not be properly detected in the filesystem

  • user supplied filenames may differ in case from the actual file on disk (Question: is it reasonable to require the user to specify the correct case? Probably not, see [http://www.selenic.com/mercurial/bts/issue646 Issue646]).

Also, filesystems like OSX do Unicode normalisation, meaning that two filenames with differing normal forms may in fact be the same.

Finally, there are some filename identity issues even on Unix - the files foo/bar and baz/../foo/bar are the same. These are (presumably) solved on Unix, so looking at how the solution works may offer some advice on how to deal with user input issues.

Proposal:

  • Classify file names into different types:
    • Manifest internal (case sensitive always)
    • OS Native (possibly case or normalisation insensitive)
  • Identify which type of file name is involved in the various API calls
  • Determine the correct behaviour whenever the 2 types come into contact

There are some other differences between manifest internal and os native pathnames (the former always uses / path separators, where the latter uses os.sep) as well as differences between absolute and relative pathnames - in reviewing API calls, these differences should be noted as well.

In some cases, this may require carrying round of additional data, to preserve both the user-supplied name, and the actual filesystem canonical name.

See also


CategoryWindows CategoryNewFeatures

CaseFoldingPlan (last edited 2012-11-06 23:04:58 by abuehl)