Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File.match? has an issue with paths #15319

Open
wolfgang371 opened this issue Jan 3, 2025 · 14 comments
Open

File.match? has an issue with paths #15319

wolfgang371 opened this issue Jan 3, 2025 · 14 comments
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:text
Milestone

Comments

@wolfgang371
Copy link

This is how Crystal behaves:

# Crystal 1.14.0 [dacd97bcc] (2024-10-09), LLVM: 18.1.6, Default target: x86_64-unknown-linux-gnu
# https://crystal-lang.org/api/1.14.0/File.html#match%3F%28pattern%3AString%2Cpath%3APath%7CString%29%3ABool-class-method
p File.match?("*.x", "a.x") # true
p File.match?("*.x", "a/b/c.x") # false, seems odd
p File.match?("**/*.x", "a/b/c.x") # false, seems odd
p File.match?("**.x", "a/b/c.x") # true, seems odd wrt. docs

In comparison to Ruby:

# ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux-gnu] - old, I know...
# https://ruby-doc.org/3.3.5/File.html#method-c-fnmatch
p File.fnmatch?("*.x", "a.x") # true
p File.fnmatch?("*.x", "a/b/c.x") # true
p File.fnmatch?("**/*.x", "a/b/c.x") # true
p File.fnmatch?("**.x", "a/b/c.x") # true
@wolfgang371 wolfgang371 added the kind:bug A bug in the code. Does not apply to documentation, specs, etc. label Jan 3, 2025
@straight-shoota
Copy link
Member

straight-shoota commented Jan 4, 2025

The behaviour of the patterns with double wildcard is very clearly an error: **/ is supposed to match directories recursively (such as a/b/), and *.x is supposed to match the last path segment. The pattern **/c.x matches correctly, but **/*.x does not. This definitely needs to be fixed.

On the other hand **. is not supposed to match directories recursively. The double wildcard should only match separators when it's the entire path segment.

There could be some argument that *.x rightfully does not match a/b/c.x because * excludes /. This depends on whether the pattern is required to match the full search string or only a substring (c.x).

Implementations such as Ruby and Python seem to allow substring matching, while Golang behaves similar as Crystal:

import fnmatch
fnmatch.fnmatch("a/b/c.x", "*.x")    # => True
package main

import (
	"fmt"
	"path/filepath"
)

func main() {
	fmt.Println(filepath.Match("*.x", "a/b/c.x"))     // => false
}

Note: filepath.Match in Golang doesn't support double wildcard (**) so the other examples there rightfully return false for these pattern rules.

@straight-shoota
Copy link
Member

And fnmatch:

#include <stdio.h>
#include <stdlib.h>
#include <fnmatch.h>

int main() {
    printf("*.x: %d\n", fnmatch("*.x", "a/b/c.x", FNM_PATHNAME));       // => 1
    printf("**/*.x: %d\n", fnmatch("**/*.x", "a/b/c.x", FNM_PATHNAME)); // => 1
    printf("**.x: %d\n", fnmatch("**.x", "a/b/c.x", FNM_PATHNAME));     // => 1
}

@wolfgang371
Copy link
Author

Looks like everybody is doing something different..
In Ruby, with this option the results are different:

p File.fnmatch?("*.x", "a.x", File::FNM_PATHNAME) # true
p File.fnmatch?("*.x", "a/b/c.x", File::FNM_PATHNAME) # false
p File.fnmatch?("**/*.x", "a/b/c.x", File::FNM_PATHNAME) # true
p File.fnmatch?("**.x", "a/b/c.x", File::FNM_PATHNAME) # false

I find this (standardized...) option overly complex. Regexp matches don't care about extra pre- or postfixes, this is what I was expecting somehow from this method as well.

On the other hand **. is not supposed to match directories recursively. The double wildcard should only match separators when it's the entire path segment.

Could you give me an example? I struggled on this explanation in the docs already. Thanks!

@oprypin
Copy link
Member

oprypin commented Jan 5, 2025

@straight-shoota Python's fnmatch is implemented in a way that '/' is a totally ordinary character like 'a'. You're supposed to pass only single path segments to it

@oprypin
Copy link
Member

oprypin commented Jan 5, 2025

The only comparable method in Python is https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.full_match

@oprypin
Copy link
Member

oprypin commented Jan 5, 2025

I've made a script to compare all programming languages I could think of :D

The script
#!/bin/bash

data='
"*.x", "a.x"
"*.x", "a/b/c.x"
"**/*.x", "a/b/c.x"
"**.x", "a/b/c.x"
'

pr --merge --omit-header -s'|' <(echo) <(
  echo
  echo "---"
  while IFS= read -r line; do
    test -n "$line" && echo "\`$line\`"
  done <<< "$data"
) <(

echo "Crystal"
echo "---"
(
  while IFS= read -r line; do
    test -n "$line" && echo "p File.match?($line)"
  done <<< "$data"
) | crystal eval

) <(

echo "Python"
echo "---"
(
  echo '
import pathlib

def matches(pattern, path):
  return pathlib.PurePath(path).full_match(pattern)
'
  while IFS= read -r line; do
    test -n "$line" && echo "print('true' if matches($line) else 'false')"
  done <<< "$data"
) | python

) <(

echo "C"
echo "---"
(
  echo '
#include <stdio.h>
#include <stdlib.h>
#include <fnmatch.h>
int main() {
'
  while IFS= read -r line; do
    test -n "$line" && echo "printf(\"%s\n\", fnmatch($line, FNM_PATHNAME) ? \"true\" : \"false\");"
  done <<< "$data"
  echo '}'
) > test.c && gcc test.c -o test_c && ./test_c

) <(

echo "Rust"
echo "---"
echo '
[package]
name = "test"
edition = "2021"
[dependencies]
glob = "0.3.2"
[[bin]]
name = "test"
' > Cargo.toml
mkdir -p src/bin/test
(
  echo '
extern crate glob;
use glob::Pattern;

fn matches(pattern: &str, path: &str) -> Option<bool> {
  return Some(Pattern::new(pattern).ok()?.matches(path));
}

fn main() {
'
  while IFS= read -r line; do
    test -n "$line" && echo "println!(\"{}\", match matches($line) { Some(true) => \"true\", Some(false) => \"false\", None => \"error\" });"
  done <<< "$data"
  echo '}'
) > src/bin/test/main.rs
cargo run --quiet

) <(

echo "Ruby"
echo "---"
(
  while IFS= read -r line; do
    test -n "$line" && echo "p File.fnmatch?($line)"
  done <<< "$data"
) | ruby
) <(

echo "Go"
echo "---"
(
echo '
package main

import (
  "fmt"
  "path/filepath"
)

func matches(pattern, path string) string {
  if v, e := filepath.Match(pattern, path); e != nil {
    return "error"
  } else {
    return fmt.Sprintf("%v", v)
  }
}

func main() {
'
  while IFS= read -r line; do
    test -n "$line" && echo "fmt.Println(matches($line))"
  done <<< "$data"
echo '}'
) > test.go && go run test.go

) <(echo)
Crystal Python C Rust Ruby Go
"*.x", "a.x" true true false true true true
"*.x", "a/b/c.x" false false true true true false
"**/*.x", "a/b/c.x" false true true true true false
"**.x", "a/b/c.x" true false true error true false

I think the Python example is the only one that makes sense

@straight-shoota
Copy link
Member

straight-shoota commented Jan 5, 2025

Thanks for putting together that comparison.
What's the difference between example 2 and 5 though? They appear to be identical.

I agree that Python's behaviour seems to be the most reasonable. And I believe that Crystal is actually intended to work the same but the implementation is buggy.

@oprypin
Copy link
Member

oprypin commented Jan 5, 2025

Oops, deleted that example

@straight-shoota
Copy link
Member

@wolfgang371

On the other hand **. is not supposed to match directories recursively. The double wildcard should only match separators when it's the entire path segment.

Could you give me an example? I struggled on this explanation in the docs already. Thanks!

Essentially, ** can only appear between directory separators (/), or at the begin or end of the pattern.
If it's preceded or followed by any other character than /, it's just two single wildcards, without recursive properties.

@wolfgang371
Copy link
Author

thanks, so like this

  • /**/ somewhere in the string -> valid
  • **/ at the beginning -> valid
  • /** at the end -> valid
  • ** all other cases -> invalid, treated as *

@oprypin
Copy link
Member

oprypin commented Jan 6, 2025

I have found more reasonable implementations for Go, Rust (external libs) and Ruby (pass FNM_PATHNAME) and updated the comparison.

The script
#!/bin/bash

lines=(
  '"*.x", "a.x"'
  '"a/b/*.x", "a/b/c.x"'
  '"a/b/**", "a/b/c.x"'
  '"a/**", "a/b/c.x"'
  '"a/**/*", "a/b/c/d.x"'
  '"a/**/d.x", "a/b/c/d.x"'
  '"**/*.x", "a/b/c.x"'
  '"a/b**/d.x", "a/bb/c/d.x"'
  '"a/**b/d.x", "a/bb/c/d.x"'
  '"a/b**/*", "a/bb/c/d.x"'
  '"**.x", "a/b/c.x"'
  '"*.x", "a/b/c.x"'
  '"c.x", "a/b/c.x"'
  '"b/*.x", "a/b/c.x"'
)

pr --merge --omit-header -s'|' <(echo) <(
  echo 'Pattern|Path'
  echo '-------|----'
  for line in "${lines[@]}"; do
    (IFS=', '; printf '`%s`|`%s`\n' $line)
  done
) <(

echo 'Crystal'
echo '-----'
(
  for line in "${lines[@]}"; do
    echo "p File.match?($line)"
  done
) | crystal eval

) <(

echo 'Go[^1]'
echo '-----'
(
touch go.mod
go mod edit -module=test
go get github.com/gobwas/glob
echo '
package main

import (
  "fmt"
  "github.com/gobwas/glob"
)

func matches(pattern, path string) string {
  if g, e := glob.Compile(pattern, '"'/'"'); e != nil {
    return "error"
  } else {
    return fmt.Sprintf("%v", g.Match(path))
  }
}

func main() {
'
  for line in "${lines[@]}"; do
    echo "fmt.Println(matches($line))"
  done
echo '}'
) > test.go && go run test.go

) <(

echo 'Python'
echo '-----'
(
  echo '
import pathlib

def matches(pattern, path):
  return pathlib.PurePath(path).full_match(pattern)
'
  for line in "${lines[@]}"; do
    echo "print(matches($line))"
  done
) | python

) <(

echo 'Rust[^2]'
echo '-----'
echo '
[package]
name = "test"
edition = "2021"
[dependencies]
glob-match = "0.2.1"
[[bin]]
name = "test"
' > Cargo.toml
mkdir -p src/bin/test
(
  echo '
extern crate glob_match;
use glob_match::glob_match;

fn main() {
'
  for line in "${lines[@]}"; do
    echo "println!(\"{}\", glob_match($line));"
  done
  echo '}'
) > src/bin/test/main.rs
cargo run --quiet

) <(

echo 'Ruby'
echo '-----'
(
  for line in "${lines[@]}"; do
    echo "p File.fnmatch?($line, File::FNM_PATHNAME | File::FNM_NOESCAPE)"
  done
) | ruby
) <(

echo 'C'
echo '-----'
(
  echo '
#include <stdio.h>
#include <stdlib.h>
#include <fnmatch.h>
int main() {
  int matched;
'
  for line in "${lines[@]}"; do
    echo "  matched = fnmatch($line, FNM_PATHNAME | FNM_NOESCAPE);"
    echo '  printf("%s\n", matched == 0 ? "true" : matched == FNM_NOMATCH ? "false" : "error");'
  done
  echo '}'
) > test.c && gcc test.c -o test_c && ./test_c

) <(echo) | sed -e 's/\btrue\b/✅/Ig' -e 's/\bfalse\b/❌/Ig' -e 's/\berror\b/🚨/Ig'

echo
echo '[^1]: https://github.com/gobwas/glob'
echo '[^2]: https://github.com/devongovett/glob-match'
Pattern Path Crystal Go1 Python Rust2 Ruby C
"*.x" "a.x"
"a/b/*.x" "a/b/c.x"
"a/b/**" "a/b/c.x"
"a/**" "a/b/c.x"
"a/**/*" "a/b/c/d.x"
"a/**/d.x" "a/b/c/d.x"
"**/*.x" "a/b/c.x"
"a/b**/d.x" "a/bb/c/d.x"
"a/**b/d.x" "a/bb/c/d.x"
"a/b**/*" "a/bb/c/d.x"
"**.x" "a/b/c.x"
"*.x" "a/b/c.x"
"c.x" "a/b/c.x"
"b/*.x" "a/b/c.x"

Footnotes

  1. https://github.com/gobwas/glob

  2. https://github.com/devongovett/glob-match

@straight-shoota
Copy link
Member

straight-shoota commented Jan 15, 2025

The algorithm for File.match? is based on https://research.swtch.com/glob
Unfortunately, this does not cover double wildcards. And those actually do add quite a bit of complexity. Multiple single wildcards in a path segment are only relevant for matches within that segment. But multiple double wildcards or combinations of double and single wildcards need to consider multiple paths.

This needs some more research to figure out how (or even if) we can implement double wildcards correctly with this algorithm, or if we need a different solution.
It might be an option to translate glob patterns to regular expressions, as apparently Python does (although I'm not sure if that applies to pathlib.PurePath).

@wolfgang371
Copy link
Author

how about just reusing (parts of) Dir.glob?

  # Returns an array of all files that match against any of *patterns*.
  #
  # ```
  # Dir.glob "path/to/folder/*.txt" # Returns all files in the target folder that end in ".txt".
  # Dir.glob "path/to/folder/**/*"  # Returns all files in the target folder and its subfolders.
  # ```
  # The pattern syntax is similar to shell filename globbing, see `File.match?` for details.

@RX14
Copy link
Contributor

RX14 commented Jan 21, 2025

I think I find the python behaviour the most reasonable in the comparison, or raising an error when ** would be treated as *. Modifying this behaviour has got to be done very carefully though as it risks creating some very subtle breaking changes in user programs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:text
Projects
None yet
Development

No branches or pull requests

4 participants