Differences between revisions 1 and 11 (spanning 10 versions)
Revision 1 as of 2006-10-23 23:10:22
Size: 1300
Editor: mpm
Comment:
Revision 11 as of 2008-04-14 08:19:59
Size: 3229
Editor: PaulMoore
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
 - escape uppercase ASCII characters in filenames
 - escape high ASCII
   - Unicode and other characters may be case-folded as well
   - Filesystems and operating systems may do other unfortunate things to
 * escape uppercase ASCII characters in filenames
 * escape high ASCII
   * Unicode and other characters may be case-folded as well
   * Filesystems and operating systems may do other unfortunate things to
Line 8: Line 8:
 - use the same scheme by default on all systems to avoid backup and media sharing issues  * use the same scheme by default on all systems to avoid backup and media sharing issues
Line 12: Line 12:
 - replace _ with __
 - replace A-Z with _a, etc.
 - replace characters 126-255 with ~7e to ~ff (note this escapes tilde as well
 * replace {{{_}}} with {{{__}}}
 * replace A-Z with _a, etc.
 * replace characters 126-255 and '\:*?"<>|' with ~7e to ~ff (note this escapes tilde as well
Line 20: Line 20:
 - add separate localrepo access methods for all store data (changelog, manifest, data/*, journal, lock) {*}
 - if .hg/data exists at localrepo __init__ time, use old access scheme
 -
if not, access all store data with escaped paths inside .hg/store/ (eg .hg/store/00changelog.i or .hg/store/data/_readme.i)
 * add separate localrepo access methods for all store data (changelog, manifest, data/*, journal, lock) (./)
 
* if .hg/data exists at localrepo __init__ time, use old access scheme (./)
 * if not, access all store data with escaped paths inside .hg/store/ (eg .hg/store/00changelog.i or .hg/store/data/_readme.i) (./)
Line 28: Line 28:
 - detect case sensitive filesystem at checkout/update time
 - scan manifest for case-folding collisions and issue a warning
 * detect case sensitive filesystem at checkout/update time (./)
 * scan manifest for case-folding collisions and issue a warning (./)

There are some further issues on the working directory and user inteface side.

 * renames which only change case (e.g. foo -> Foo) will not be properly detected in the filesystem
 * user supplied filenames may differ in case from the actual file on disk (Question: is it reasonable to require the user to specify the correct case? Probably not, see [http://www.selenic.com/mercurial/bts/issue646 Issue646]).

Also, filesystems like OSX do Unicode normalisation, meaning that two filenames with differing normal forms may in fact be the same.

Finally, there are some filename identity issues even on Unix - the files foo/bar and baz/../foo/bar are the same. These are (presumably) solved on Unix, so looking at how the solution works may offer some advice on how to deal with user input issues.

Proposal:

 * Classify file names into different types:
  * Manifest internal (case sensitive always)
  * OS Native (possibly case or normalisation insensitive)
 * Identify which type of file name is involved in the various API calls
 * Determine the correct behaviour whenever the 2 types come into contact

There are some other differences between manifest internal and os native pathnames (the former always uses / path separators, where the latter uses os.sep) as well as differences between absolute and relative pathnames - in reviewing API calls, these differences should be noted as well.

In some cases, this may require carrying round of additional data, to preserve both the user-supplied name, and the actual filesystem canonical name.

=== See also ===

 * [http://hgbook.red-bean.com/hgbookch7.html#x11-1530007.7 "Case sensitivity"] in [http://hgbook.red-bean.com/hgbook.html hgbook]
 * [http://www.selenic.com/mercurial/bts/issue839 Issue839] - "Hg local store creates paths too long for Windows"

----
CategoryWindows CategoryNewFeatures

To deal with CaseFolding on the repo side, we need to:

  • escape uppercase ASCII characters in filenames
  • escape high ASCII
    • Unicode and other characters may be case-folded as well
    • Filesystems and operating systems may do other unfortunate things to
      • filenames which will cause interoperability trouble
  • use the same scheme by default on all systems to avoid backup and media sharing issues

A simple escaping scheme is as follows:

  • replace _ with __

  • replace A-Z with _a, etc.
  • replace characters 126-255 and '\:*?"<>|' with ~7e to ~ff (note this escapes tilde as well

Note that we rarely need to

Implementation plan:

  • add separate localrepo access methods for all store data (changelog, manifest, data/*, journal, lock) (./)

  • if .hg/data exists at localrepo init time, use old access scheme (./)

  • if not, access all store data with escaped paths inside .hg/store/ (eg .hg/store/00changelog.i or .hg/store/data/_readme.i) (./)

This scheme will automatically escape all paths on newly cloned or created repos.

On the working directory side, the best we can do is detect collisions. A simple scheme might look something like this:

  • detect case sensitive filesystem at checkout/update time (./)

  • scan manifest for case-folding collisions and issue a warning (./)

There are some further issues on the working directory and user inteface side.

  • renames which only change case (e.g. foo -> Foo) will not be properly detected in the filesystem

  • user supplied filenames may differ in case from the actual file on disk (Question: is it reasonable to require the user to specify the correct case? Probably not, see [http://www.selenic.com/mercurial/bts/issue646 Issue646]).

Also, filesystems like OSX do Unicode normalisation, meaning that two filenames with differing normal forms may in fact be the same.

Finally, there are some filename identity issues even on Unix - the files foo/bar and baz/../foo/bar are the same. These are (presumably) solved on Unix, so looking at how the solution works may offer some advice on how to deal with user input issues.

Proposal:

  • Classify file names into different types:
    • Manifest internal (case sensitive always)
    • OS Native (possibly case or normalisation insensitive)
  • Identify which type of file name is involved in the various API calls
  • Determine the correct behaviour whenever the 2 types come into contact

There are some other differences between manifest internal and os native pathnames (the former always uses / path separators, where the latter uses os.sep) as well as differences between absolute and relative pathnames - in reviewing API calls, these differences should be noted as well.

In some cases, this may require carrying round of additional data, to preserve both the user-supplied name, and the actual filesystem canonical name.

See also


CategoryWindows CategoryNewFeatures

CaseFoldingPlan (last edited 2012-11-06 23:04:58 by abuehl)