Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace or remove whoosh #1402

Open
sbrunato opened this issue Nov 15, 2024 · 1 comment
Open

replace or remove whoosh #1402

sbrunato opened this issue Nov 15, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@sbrunato
Copy link
Collaborator

sbrunato commented Nov 15, 2024

Whoosh is not actively maintained and sometimes causes errors when several processes try to write in index at the same time.

It may be a good option to directly work on the product-types configuration dictionary, or try another library.

Check how TinyDB could replace whoosh in eodag. Usage example:

import re
from eodag import EODataAccessGateway
from tinydb import TinyDB, Query
from tinydb.storages import MemoryStorage


dag = EODataAccessGateway()
pts = dag.list_product_types()

db = TinyDB(storage=MemoryStorage)
db.insert_multiple(pts)

ProductType = Query()

filtered_s2 = db.search(ProductType.ID.matches(r'S2_*'))
filtered_keywords = db.search(ProductType.keywords.matches(r'.*sentinel.*', flags=re.IGNORECASE))
@sbrunato sbrunato added the enhancement New feature or request label Nov 15, 2024
@sbrunato sbrunato changed the title replace whoosh with tinydb replace or remove whoosh Dec 20, 2024
@sbrunato
Copy link
Collaborator Author

sbrunato commented Dec 20, 2024

Trying to filter on 1676 product-types using whoosh, tinydb, or built-in python list comprehension using free text search:

# whoosh: elapsed time 0.010s
dag.guess_product_type("(Sentinel OR Landsat) AND OPTICAL")

# tinydb: elapsed time 0.013s
[
    pt["ID"] for pt in db.search(
        lambda record: (
            any("Sentinel" in str(value) for value in record.values()) or 
            any("Landsat" in str(value) for value in record.values())
        ) and any("OPTICAL" in str(value) for value in record.values())
    )
]

# built-in python: elapsed time 0.013s
[
        k for k, v in dag.product_types_config.items() 
        if (
            any("Sentinel" in str(x) for x in v.values()) or 
            any("Landsat" in str(x) for x in v.values())
        ) and any("OPTICAL" in str(x) for x in v.values())
 ]

Built-in python list comprehension performance is equivalent to whoosh and tinydb. It avoids dependencies and whoosh build index.
But it needs to efficiently translate free-text-search to appropriate python filtering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant