Differences between revisions 4 and 14 (spanning 10 versions)
Revision 4 as of 2016-11-18 17:40:10
Size: 4717
Comment: revise current status
Revision 14 as of 2016-12-05 13:08:40
Size: 12529
Comment: comment why the default of follow() is not relpath
Deletions are marked like this. Additions are marked like this.
Line 9: Line 9:
'''Main proponents: YourNameHere''' '''Main proponents: KatsunoriFujiwara, RodrigoDamazio'''
Line 14: Line 14:

<<TableOfContents>>
Line 40: Line 42:
||'''pattern type''' ||'''root-ed''' ||'''cwd-ed''' ||'''any-of-path''' ||
||wildcard ||--- ||glob ||relglob ||
||regexp ||re ||--- ||relre ||
||raw string ||path ||relpath ||--- ||

If rule is read in from file (e.g. .hgignore):

 * "glob" is treated as "relglob"
 * "re" is treated as "relre"

This is mentioned in "hg help patterns" and "hg help hgignore", but
syntax name "relglob" and "relre" themselves aren't explained.

Matching is examined:

 * '''non'''-recursively for glob/relglob as PATTERN (e.g. argument in command line), but
 * recursively for glob/relglob as INCLUDES/EXCLUDES, or other pattern types

For example, file "foo/bar/baz" is:
==== Summary of mode, relative-to, and recursion of each types ====

||'''mode''' ||'''root-ed''' ||'''cwd-ed''' ||'''any-of-path''' ||'''control recursion by pattern''' ||'''context depend recursion''' ||
||wildcard ||- ||`glob:` ||`relglob:` ||by ** ||o ||
||regexp ||`re:` ||- ||`relre:` ||by $ ||x (*A) ||
||raw string ||`path:` ||`relpath:` ||- ||(always) ||x ||

  * (*A) "regexp" mode ignore pattern matches recursively (e.g. "`re:^foo$`" ignores file `foo/bar`). Detail is explained later.

==== The list of contexts, in which pattern is specified ====

||'''pattern for''' ||'''default type''' ||'''recursion of wildcard''' ||'''related API''' ||
||fileset ||`glob:` ||x ||ctx.match() ||
||files() template function ||`glob:` ||x ||ctx.match() ||
||diff() template function ||`glob:` ||o (*1) ||ctx.match() ||
||file() revset predicate ||`glob:` ||x ||match.match() ||
||follow() revset predicate ||`path:` (*3) ||x ||match.match() ||
||--include/--exclude ||`glob:` ||o (*1) ||match.match() ||
||hgignore ||`relre:` ||o (*1) ||match.match() ||
||`archive` web command ||`path:` ||- (*2) ||scmutil.match() ||
||`hg locate` ||`relglob:` ||x ||scmutil.match() ||
||`hg log` ||`relpath:` ||x ||scmutil.matchandpats() ||
||others (e.g. `hg files`) ||`relpath:` ||x ||scmutil.match() ||

  * (*1) treated as `include`/`exclude` of match.match() (otherwise, treated as `pats` of match.match())
  * (*2) no wildcard pattern matching occurs for `archive` web command, becuase `path:` is forcibly added to specified pattern in this case
  * (*3) can't be `relpath:` because of backward compatibility (Cset:5618858dce26)

For "recursion of wildcard":

  * if "recursive of wildcard", pattern `glob:foo/bar` matches against file `foo/bar/baz`, for example
  * Inner context is used to decide "recursion of wildcard", if multiple contexts are combined

For example, file `foo/bar/baz` is:
Line 61: Line 78:
 * not matched at: `hg files -I 'set:"glob:foo/bar"'`  * not matched at: `hg files -I "set:'glob:foo/bar'"`
Line 64: Line 81:
The latter seems to cause the issue mentioned by Rodrigo in "[[https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-October/089003.html|match: adding non-recursive directory matching]]". The last case seems to cause the issue mentioned by Rodrigo in "[[https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-October/089003.html|match: adding non-recursive directory matching]]". And the second case can be used as instant work around for that issue.

Table below re-summarizes about recursion (= matching against intermediate
directory) of each modes.

||'''mode''' ||'''-I/-X''' ||'''in "set:"''' ||'''-I/-X with "set:"''' ||
||wildcard ||always ||endswith("**") ||endswith("**") ||
||regexp ||not endswith("$") ||not endswith("$") ||not endswith("$") ||
||raw string ||always ||always ||always ||

"Recursion of wildcard" of the pattern from a file follows one of what
tries to read that file in. For example:

  * wildcard pattern read in by "`-I listfile:FILE`" matches recursively, but
  * one read in by "`hg status listfile:FILE`" doesn't

==== Reading patterns from file ====

||'''read in by''' ||'''type substitution''' ||'''default type for hgignore''' ||'''default type for otherwise''' ||
||include:FILE ||o ||`relre:` ||`relre:` ||
||listfile:FILE ||x ||(*X) ||(*Y) ||

  * (*X) this is prohibited by match.readpatternfile()
  * (*Y) decision about "default type" depends on the context, in which `listfile:FILE` is used (e.g. `relglob:` for "`hg locate`", but `relpath:` for "`hg files`").

If "type substitution", substitutions below occur always at reading
patterns from file. This is mentioned in "`hg help patterns`" and
"`hg help hgignore`", but type `relglob:` and `relre:` themselves
aren't explained.

  * `glob:` => `relglob:`
  * `re:` => `relre:`

Reading from `.hgignore` and "`[ui] ignore`" is treated as a variant of
`include:` internally (e.g. `include:$REPOROOT/.hgignore`)

==== Recursion of ignore patterns ====

As a ignore pattern, "wildcard" and "raw string" modes are obviously
recursive, because:

  * treating as same as "`--include PATTERN`" makes "wildcard" mode recursive
  * "raw string" mode is always recursive, regardless of context

On the other hand, "regexp" mode itself is non-recursive. For
example, with "`re:^foo$`" in .hgignore, "`hg debugignore`" shows the
regexp, which doesn't match against file `foo/bar`.

But actually, "`re:^foo$`" in .hgignore ignores file `foo/bar`, because
dirstate (and "`hg debugignore`") examines whether specified file does:

  * match against specified ignore patterns, or
  * exist under the directory, which matches against specified ignore patterns

and that file is ignored, if one of conditions above is true.

Therefore, "regexp" ignore pattern is recursive, even if it uses "`$`".

In conclusion, '''all ignore patterns are treated as recursive,
regardless of pattern types'''.

This special recursion of "regexp" mode is specific for ignore
patterns. In other cases, "regexp" mode pattern isn't recursive, if it
uses "`$`".
Line 68: Line 148:
By introducing systematic new pattern types, both "start point"
and "recursion" of matching can be fully controlled arbitrarily
in any contexts (as PATTERN, -I/-X, and so on).
Line 70: Line 154:
How about introducing new systematic names like below to
re-organize current complicated mapping between names and
matching ?

||'''pattern type''' ||'''root-ed''' ||'''cwd-ed''' ||'''any-of-path''' ||
||wildcard ||rootglob ||cwdglob ||anyglob ||
||regexp ||rootre ||cwdre ||anyre ||
||raw string ||rootpath ||cwdpath ||anypath ||

Each of existing pattern types will be internally treated as an alias of types above.
New types and their start point.

||'''mode''' ||'''root-ed''' ||'''cwd-ed''' ||'''any-of-path''' ||
||wildcard ||`rootglob:` ||`cwdglob:` ||`anyglob:` ||
||regexp ||`rootre:` ||`cwdre:` ||`anyre:` ||
||raw string ||`rootpath:` ||`cwdpath:` ||`anypath:` ||

New "wildcard" and "regexp" types other than `anyglob:` match
recursively, fully according to specified pattern, as below.
`anyglob:` pattern should be always recursive, because this type
is "any-of-path" matching.

||'''type''' ||'''recursive''' ||
||`rootglob:` ||endswith("**") ||
||`cwdglob:` ||endswith("**") ||
||`anyglob:` ||always ||
||`rootre:` ||not endswith("$") ||
||`cwdre:` ||not endswith("$") ||
||`anyre:` ||not endswith("$") ||
||`rootpath:` ||always ||
||`cwdpath:` ||always ||
||`anypath:` ||always ||

==== Emulate legacy types as an alias of new types ====

Current match.py implementation adds prefix/suffix regexp below to the
specified pattern internally, according to what it is used for. See
implementation of `_regex()` and `match._normalize()`, and `_buildmatch()`
invocations in `match.__init__()` in match.py, for detail.

||'''type''' ||'''used for''' ||'''prefix''' ||'''suffix''' ||'''recursive''' ||
||`glob:` ||pattern ||"`$CWD/`" ||"`$`" ||endswith("**") ||
|| ||include/exclude ||"`$CWD/`" ||"`(?:/|$)`" ||always ||
||`relglob:` ||pattern ||"`(?:|.*/)`" ||"`$`" ||endswith("**") ||
|| ||include/exclude ||"`(?:|.*/)`" ||"`(?:/|$)`" ||always ||
||`re:` ||(always) ||(none) ||(none) ||not endswith("$") ||
||`relre:` ||(always) ||"`.*`" (*1) ||(none) ||not endswith("$") ||
||`path:` ||(always) ||"`^`" (*2) ||"`(?:/|$)`" ||always ||
||`relpath:` ||(always) ||"`$CWD/`" ||"`(?:/|$)`" ||always ||

  * (*1) add this prefix, only if pattern doesn't start with "`^`"
  * (*2) (just nit picking) this may be redundant, because patterns are examined by `re.match()`, which requires matching from the beginning of a target string.

So, at first, let assume that newly introduced types use
additional prefix/suffix regexp below BY DEFAULT (now,
controlling recursion in "wildcard" and
"regexp" mode is user responsibility).

||'''type''' ||'''prefix''' ||'''suffix''' ||'''recursive''' ||
||`rootglob:` ||(none) ||"`$`" ||endswith("**") ||
||`cwdglob:` ||"`$CWD/`" ||"`$`" ||endswith("**") ||
||`anyglob:` ||"`(?:|.*/)`" ||"`(?:/|$)`" ||always ||
||`rootre:` ||(none) ||(none) ||not endswith("$") ||
||`cwdre:` ||"`$CWD/`" ||(none) ||not endswith("$") ||
||`anyre:` ||"`.*`" ||(none) ||not endswith("$") ||
||`rootpath:` ||(none) ||"`(?:/|$)`" ||always ||
||`cwdpath:` ||"`$CWD/`" ||"`(?:/|$)`" ||always ||
||`anypath:` ||"`(?:|.*/)`" ||"`(?:/|$)`" ||always ||

Then, legacy types can be emulated as an alias of newly introduced
type as below:

||'''type''' ||'''used as''' ||'''alias of''' ||'''needed suffix''' ||
||`glob:` ||pattern ||`cwdglob:` ||"`$`" (= default of `cwdglob:`) ||
|| ||include/exclude ||`cwdglob:` ||"`(?:/|$)`" ||
||`relglob:` ||pattern ||`anyglob:` ||"`$`" ||
|| ||include/exclude ||`anyglob:` ||"`(?:/|$)`" (= default of `anyglob:`) ||
||`re:` ||(always) ||`rootre:` ||(none) (= default of `rootre:`) ||
||`relre:` ||(always) ||`anyre:` ||(none) (= default of `anyre:`) ||
||`path:` ||(always) ||`rootpath:` ||"`(?:/|$)`" (= default of `rootpath:`) ||
||`relpath:` ||(always) ||`cwdpath:` ||"`(?:/|$)`" (= default of `cwdpath:`) ||

At this point, using suffix below forcibly for legacy `glob:` and
`relglob:` is as same as current match.py implementation.

  * "`$`" for pattern
  * "`(?:/|$)`" for include/exclude

Therefore,aliasing should be emulated easily.
Line 86: Line 239:
||'''type''' ||'''for recursive matching''' ||'''for non-recursive matching''' ||
||glob ||using "**" || using "*" ||
||re ||omitting "$" || appending "$" ||
||path ||always || --- ||

User can't control recursion of matching with "path" type pattern
||'''mode''' ||'''recursive''' ||
||wildcard ||endswith("**") ||
||regexp ||not endswith("$") ||
||raw string ||always ||

User can't control recursion of matching with "raw string" pattern
Line 94: Line 247:
Therefore, how about introducing two more additional pattern
types "file" and "dir" ?

||'''type''' ||'''for recursive''' ||'''for non-recursive''' ||
||file ||--- ||always ||
||dir ||always(*) ||--- ||

(*) "dir" matches against only directory.

After adding these types, there are 5 (base types) x 3 (start points) = 15 types

||'''base type''' ||'''root-ed''' ||'''cwd-ed''' ||'''any-of-path''' ||
||wildcard ||rootglob ||cwdglob ||anyglob ||
||regexp ||rootre ||cwdre ||anyre ||
||raw path ||rootpath ||cwdpath ||anypath ||
||raw file name ||rootfile ||cwdfile ||anyfile ||
||raw dir name ||rootdir ||cwddir ||anydir ||
Therefore, how about introducing two more additional modes
"raw file" and "raw dir" ? Additional suffix regexp can control these matching.

||'''mode''' ||'''recursive''' ||'''suffix''' ||
||raw file name ||never ||"`$`" ||
||raw dir name ||always, but matches against only directory ||"`/`" ||

After adding these modes, there are 5 (modes) x 3 (start points) = 15 types

||'''mode''' ||'''root-ed''' ||'''cwd-ed''' ||'''any-of-path''' ||
||wildcard ||`rootglob:` ||`cwdglob:` ||`anyglob:` ||
||regexp ||`rootre:` ||`cwdre:` ||`anyre:` ||
||raw string ||`rootpath:` ||`cwdpath:` ||`anypath:` ||
||raw file name ||`rootfile:` ||`cwdfile:` ||`anyfile:` ||
||raw dir name ||`rootdir:` ||`cwddir:` ||`anydir:` ||

Note:

This page is primarily intended for developers of Mercurial.

Better Matcher API and File Patterns Plan

Status: Project

Main proponents: KatsunoriFujiwara, RodrigoDamazio

/!\ This is a speculative project and does not represent any firm decisions on future behavior.

{X} Add a short summary of the idea here.

1. Goal

  • Short term: add non-recursive globs ?
  • Long term: extensible matcher API ?

2. Detailed description

2.1. Sprint Notes

Non-recursive globs (Rodrigo, spectral, Durham, :
    Issue is that * is sometimes recursive
    matcher API is a mess
    Should we re-write match.py or just add fileglob?
    Suggestion: add fileglob via a new, cleaner API, then migrate others over time
    Possible FB use case: pick parts of a tree to include and exclude (would add ordering dependency instead of excludes always trumping includes?)
    matcher API should be extensible
    matcher composition: anyof, allof, negate, per-file-type, etc.
    Inconsistencies in pattern behavior between hgignore, --include/--exclude, etc.
    FB: conversion between matchers and watchman expressions
    Proposal: wiki page, first group to have a use case proposes the initial API

2.2. Current Status

2.2.1. Summary of mode, relative-to, and recursion of each types

mode

root-ed

cwd-ed

any-of-path

control recursion by pattern

context depend recursion

wildcard

-

glob:

relglob:

by **

o

regexp

re:

-

relre:

by $

x (*A)

raw string

path:

relpath:

-

(always)

x

  • (*A) "regexp" mode ignore pattern matches recursively (e.g. "re:^foo$" ignores file foo/bar). Detail is explained later.

2.2.2. The list of contexts, in which pattern is specified

pattern for

default type

recursion of wildcard

related API

fileset

glob:

x

ctx.match()

files() template function

glob:

x

ctx.match()

diff() template function

glob:

o (*1)

ctx.match()

file() revset predicate

glob:

x

match.match()

follow() revset predicate

path: (*3)

x

match.match()

--include/--exclude

glob:

o (*1)

match.match()

hgignore

relre:

o (*1)

match.match()

archive web command

path:

- (*2)

scmutil.match()

hg locate

relglob:

x

scmutil.match()

hg log

relpath:

x

scmutil.matchandpats()

others (e.g. hg files)

relpath:

x

scmutil.match()

  • (*1) treated as include/exclude of match.match() (otherwise, treated as pats of match.match())

  • (*2) no wildcard pattern matching occurs for archive web command, becuase path: is forcibly added to specified pattern in this case

  • (*3) can't be relpath: because of backward compatibility (5618858dce26)

For "recursion of wildcard":

  • if "recursive of wildcard", pattern glob:foo/bar matches against file foo/bar/baz, for example

  • Inner context is used to decide "recursion of wildcard", if multiple contexts are combined

For example, file foo/bar/baz is:

  • not matched at: hg files glob:foo/bar

  • not matched at: hg files -I "set:'glob:foo/bar'"

  • but matched at: hg files -I glob:foo/bar

The last case seems to cause the issue mentioned by Rodrigo in "match: adding non-recursive directory matching". And the second case can be used as instant work around for that issue.

Table below re-summarizes about recursion (= matching against intermediate directory) of each modes.

mode

-I/-X

in "set:"

-I/-X with "set:"

wildcard

always

endswith("**")

endswith("**")

regexp

not endswith("$")

not endswith("$")

not endswith("$")

raw string

always

always

always

"Recursion of wildcard" of the pattern from a file follows one of what tries to read that file in. For example:

  • wildcard pattern read in by "-I listfile:FILE" matches recursively, but

  • one read in by "hg status listfile:FILE" doesn't

2.2.3. Reading patterns from file

read in by

type substitution

default type for hgignore

default type for otherwise

include:FILE

o

relre:

relre:

listfile:FILE

x

(*X)

(*Y)

  • (*X) this is prohibited by match.readpatternfile()
  • (*Y) decision about "default type" depends on the context, in which listfile:FILE is used (e.g. relglob: for "hg locate", but relpath: for "hg files").

If "type substitution", substitutions below occur always at reading patterns from file. This is mentioned in "hg help patterns" and "hg help hgignore", but type relglob: and relre: themselves aren't explained.

  • glob: => relglob:

  • re: => relre:

Reading from .hgignore and "[ui] ignore" is treated as a variant of include: internally (e.g. include:$REPOROOT/.hgignore)

2.2.4. Recursion of ignore patterns

As a ignore pattern, "wildcard" and "raw string" modes are obviously recursive, because:

  • treating as same as "--include PATTERN" makes "wildcard" mode recursive

  • "raw string" mode is always recursive, regardless of context

On the other hand, "regexp" mode itself is non-recursive. For example, with "re:^foo$" in .hgignore, "hg debugignore" shows the regexp, which doesn't match against file foo/bar.

But actually, "re:^foo$" in .hgignore ignores file foo/bar, because dirstate (and "hg debugignore") examines whether specified file does:

  • match against specified ignore patterns, or
  • exist under the directory, which matches against specified ignore patterns

and that file is ignored, if one of conditions above is true.

Therefore, "regexp" ignore pattern is recursive, even if it uses "$".

In conclusion, all ignore patterns are treated as recursive, regardless of pattern types.

This special recursion of "regexp" mode is specific for ignore patterns. In other cases, "regexp" mode pattern isn't recursive, if it uses "$".

2.3. Proposal by foozy

By introducing systematic new pattern types, both "start point" and "recursion" of matching can be fully controlled arbitrarily in any contexts (as PATTERN, -I/-X, and so on).

2.3.1. Control start point of matching arbitrarily

New types and their start point.

mode

root-ed

cwd-ed

any-of-path

wildcard

rootglob:

cwdglob:

anyglob:

regexp

rootre:

cwdre:

anyre:

raw string

rootpath:

cwdpath:

anypath:

New "wildcard" and "regexp" types other than anyglob: match recursively, fully according to specified pattern, as below. anyglob: pattern should be always recursive, because this type is "any-of-path" matching.

type

recursive

rootglob:

endswith("**")

cwdglob:

endswith("**")

anyglob:

always

rootre:

not endswith("$")

cwdre:

not endswith("$")

anyre:

not endswith("$")

rootpath:

always

cwdpath:

always

anypath:

always

2.3.2. Emulate legacy types as an alias of new types

Current match.py implementation adds prefix/suffix regexp below to the specified pattern internally, according to what it is used for. See implementation of _regex() and match._normalize(), and _buildmatch() invocations in match.__init__() in match.py, for detail.

type

used for

prefix

suffix

recursive

glob:

pattern

"$CWD/"

"$"

endswith("**")

include/exclude

"$CWD/"

"(?:/|$)"

always

relglob:

pattern

"(?:|.*/)"

"$"

endswith("**")

include/exclude

"(?:|.*/)"

"(?:/|$)"

always

re:

(always)

(none)

(none)

not endswith("$")

relre:

(always)

".*" (*1)

(none)

not endswith("$")

path:

(always)

"^" (*2)

"(?:/|$)"

always

relpath:

(always)

"$CWD/"

"(?:/|$)"

always

  • (*1) add this prefix, only if pattern doesn't start with "^"

  • (*2) (just nit picking) this may be redundant, because patterns are examined by re.match(), which requires matching from the beginning of a target string.

So, at first, let assume that newly introduced types use additional prefix/suffix regexp below BY DEFAULT (now, controlling recursion in "wildcard" and "regexp" mode is user responsibility).

type

prefix

suffix

recursive

rootglob:

(none)

"$"

endswith("**")

cwdglob:

"$CWD/"

"$"

endswith("**")

anyglob:

"(?:|.*/)"

"(?:/|$)"

always

rootre:

(none)

(none)

not endswith("$")

cwdre:

"$CWD/"

(none)

not endswith("$")

anyre:

".*"

(none)

not endswith("$")

rootpath:

(none)

"(?:/|$)"

always

cwdpath:

"$CWD/"

"(?:/|$)"

always

anypath:

"(?:|.*/)"

"(?:/|$)"

always

Then, legacy types can be emulated as an alias of newly introduced type as below:

type

used as

alias of

needed suffix

glob:

pattern

cwdglob:

"$" (= default of cwdglob:)

include/exclude

cwdglob:

"(?:/|$)"

relglob:

pattern

anyglob:

"$"

include/exclude

anyglob:

"(?:/|$)" (= default of anyglob:)

re:

(always)

rootre:

(none) (= default of rootre:)

relre:

(always)

anyre:

(none) (= default of anyre:)

path:

(always)

rootpath:

"(?:/|$)" (= default of rootpath:)

relpath:

(always)

cwdpath:

"(?:/|$)" (= default of cwdpath:)

At this point, using suffix below forcibly for legacy glob: and relglob: is as same as current match.py implementation.

  • "$" for pattern

  • "(?:/|$)" for include/exclude

Therefore,aliasing should be emulated easily.

2.3.3. Control recursion of matching arbitrarily

With current Mercurial (at least, 4.0 or earlier), recursion of each pattern types can be controlled by:

mode

recursive

wildcard

endswith("**")

regexp

not endswith("$")

raw string

always

User can't control recursion of matching with "raw string" pattern arbitrarily (it matches against both directory and file).

Therefore, how about introducing two more additional modes "raw file" and "raw dir" ? Additional suffix regexp can control these matching.

mode

recursive

suffix

raw file name

never

"$"

raw dir name

always, but matches against only directory

"/"

After adding these modes, there are 5 (modes) x 3 (start points) = 15 types

mode

root-ed

cwd-ed

any-of-path

wildcard

rootglob:

cwdglob:

anyglob:

regexp

rootre:

cwdre:

anyre:

raw string

rootpath:

cwdpath:

anypath:

raw file name

rootfile:

cwdfile:

anyfile:

raw dir name

rootdir:

cwddir:

anydir:

2.4. Proposal by Rodrigo

Add rootglob: to get over the issue of -I/-X patterns.

https://patchwork.mercurial-scm.org/patch/17311/

3. Roadmap

{X}

4. See Also


CategoryDeveloper CategoryNewFeatures

FileNamePatternsPlan (last edited 2016-12-05 13:08:40 by YuyaNishihara)