Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The library doesn't use Levenshtein at all, or at least it's inconsistent #53

Open
R-N opened this issue Jun 5, 2023 · 5 comments
Open

Comments

@R-N
Copy link

R-N commented Jun 5, 2023

This library is described as fuzzy string matching with Levenshtein distance. However, it doesn't seem to use Levenshtein at all?

fuzz.ratio("tide", "diet") returns:

  • 50 with python-Levenshtein installed
  • 25 without python-Levenshtein

This is because python-Levenshtein uses Indel distance for ratio instead of Levenshtein. Issue 47 in python-Levenshtein mentioned that "Levenshtein.ratio is based upon the Indel distance and not on the Levenshtein distance". (That's the correct repo I think). Indel distance is similar to Levenshtein, but they are not the same and may produce different results.

But it doesn't seem to just end there. difflib, the fallback for when you don't have python-Levenshtein installed, doesn't seem to use Levenshtein distance either. Issue 128 in the older repository/version of this package mentioned that difflib uses "Ratcliff-Obershelp algorithm, not the Levenshtein distance".

To conclude:

  • This library doesn't seem to use Levenshtein distance at all.
  • Installing python-Levenshtein doesn't just simply give speed up. It will produce different results because different algorithms are used.

Note: This library uses ratio function from Levenshtein package or difflib for all of its function with ratio in its name.

@maxbachmann
Copy link
Contributor

Note that the Indel distance is the same as the Levenshtein distance with substitutions weighted as 2. Still I agree this is misleading. I think to a part this stems from python-Levenshtein, where Levenshtein.distance calculates the Levenshtein distance, but Levenshtein.ratio calculates the normalized Indel similarity (In my defense this was implemented before I became the maintainer of python-Levenshtein 😅 )

@R-N
Copy link
Author

R-N commented Jun 5, 2023

Note that the Indel distance is the same as the Levenshtein distance with substitutions weighted as 2.

Yeah, I also mentioned they're similar. Still, they're not the same.

I think to a part this stems from python-Levenshtein, where Levenshtein.distance calculates the Levenshtein distance, but Levenshtein.ratio calculates the normalized Indel similarity

Yup. I mentioned that too. I can't help but wonder why. The library is literally named Levenshtein. It could've been an option instead of the default, having "Indel" in the function name like it does for Jaro, or as an optional parameter.

Well, I think the readme should at least be changed to tell people about this. It's where they'd read first and currently it says "It uses Levenshtein Distance" at the very beginning, while it doesn't.

@maxbachmann
Copy link
Contributor

Yes no clue why the original author wrote it like this 🤷‍♂️

@cornzz
Copy link

cornzz commented Jan 9, 2025

This is really confusing to me, I also stumbled upon the fact that different ratios are returned depending on whether python-Levenshtein is installed or not. What the hell, how is this acceptable? @maxbachmann

@maxbachmann
Copy link
Contributor

In thefuzz this is no longer the case for 1.5 years. Since #10 it always uses the same implementation. The basis for ratio is the Levenshtein distance with a substitution weight of 2. This is also known as InDel distance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants