-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ux): include basename of path in generated table names in read_*() #10522
base: main
Are you sure you want to change the base?
Conversation
3919f2f
to
54d6794
Compare
23b7669
to
1fc10ca
Compare
1fc10ca
to
6f67bd8
Compare
6f67bd8
to
edee1ea
Compare
basename = re.sub(r"[^a-zA-Z0-9_]", "_", basename) | ||
# MySQL has a limit of 64 characters for table names. | ||
# Let's not give users runtime errors because of this. | ||
basename = basename[-25:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to push this logic down into gen_name()
so we never create a too-long name?
This is an unstable API and shouldn't be relied on.
Failing tests are revealing that our old tests were buggy: we currently verify that register() worked by doing a I think this means we should be more careful about cleaning up after ourselves. I started adding a helper class/context manager for this, but it was getting too complex, and wasn't related to this PR, so I think it should happen in a followup PR. So it's somewhere, here is an idea on what that helper class could look like: class BackendJanitor:
"""Cleans up tables created during a test.
Examples
--------
>>> con = ibis.sqlite.connect(":memory:")
>>> con.create_table("foo", schema=dict(x="int64"))
>>> with BackendJanitor(con) as janitor:
... con.create_table("bar", schema=dict(y="int64"))
>>> con.list_tables()
['foo']
>>> janitor.tables_final
{'bar', 'foo'}
"""
def __init__(self, con: ibis.BaseBackend):
self.con = con
self.tables_initial = set(con.list_tables())
@property
def tables_now(self) -> set[str]:
return set(self.con.list_tables())
@property
def tables_final(self) -> set[str]:
"""The set of tables that existed at the point of the last call to `clean_up`."""
if not hasattr(self, "_tables_final"):
raise ValueError("need to call `clean_up` first")
return self._tables_final
def clean_up(self):
"""Remove all tables created since initialization."""
self._tables_final = self.tables_now
for table_name in self.tables_final - self.tables_initial:
self.con.drop_table(table_name)
def __enter__(self):
return self
def __exit__(self, *args):
self.clean_up() |
ehh, man, I really don't love how I implemented the ensured-cleanup of the created tables. I will do that properly in another PR before/after this if you want. |
Yeah, I would suggest just fixing the tests manually for now and then we can implement something more automatic later. To some degree, any solution here requires humans to do the right thing -- I think this is probably a good job for a context manager that generates table names, so you have something like: with test_table(con) as table_name:
con.create_table(table_name, some_obj) and then we clean up at ah, similar but slightly different: @contextlib.contextmanager
def create_and_destroy_db(con):
con.create_database(dbname := gen_name("db"))
try:
yield dbname
finally:
con.drop_database(dbname) |
backend = table._find_backend() | ||
name = table.get_name() | ||
try: | ||
backend.drop_table(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one example use case that may inform the API we design in #8382. For example, a singular con.drop(table_or_view)
would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really do not like this idea.
@@ -522,6 +522,26 @@ def gen_name(namespace: str) -> str: | |||
return f"ibis_{namespace}_{uid}" | |||
|
|||
|
|||
def gen_name_from_path(path: str | Path) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea of baking any part of the path into an identifier is a very bad idea.
Lots of software has weird limitations around paths, that we should not bring into Ibis, like the thing you've already discovered about MySQL.
In particular, paths can be arbitrary byte sequences, they're not required to be UTF-8, Latin-1, anything in between, or anything in particular except a sequence of bytes.
Is this implementation prepared to handle that? If not, then we shouldn't do it, because it's adding a new set of unnecessary assumptions. If you need to keep track of what tables map to what files, DuckDB has an argument for that, or if you're not using DuckDB then it's better done upstream in the system that does the reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can write some code to handle the "random sequence of bytes". I think it would just replace those with the empty string, or an underscore, or something. I don't think the behavior here matters that much, as long as it doesn't error, because this should be so rare. I think this means that the implementation should be able to be fairly simple and maintainable.
If I come up with something not terrible would you consider it @cpcloud? I strongly still like the main idea of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like the idea of replacing sequences of bytes with anything.
What are we gaining by doing any of this, that makes it a worthwhile maintenance tradeoff? In other words, why should we bake in new assumptions about paths?
Right now, I don't see an improvement from all the effort here that justifies the maintenance burden.
I often read in many files, and then when I do
con.tables
, it is not obvious which is which.This should make it a little more user-friendly.
I don't think think we need to be perfect here, I was just aiming for slightly better. For example, if you pass a collection of paths, then we arbitrarily pick the first one to be representative. Same thing with globs, the generated name isn't always the prettiest. If the user wants more precise semantics, they can name the table themselves.
I also simplified the polars.read_parquet() logic while I was in there, it was needed in order to make this change.
I considered a different API of
def gen_name_from_paths(*paths: str | Path, file_type: str) -> str:
, but decided that the choosing of the first file should be the caller's responsibility. IDK, I can switch to this if you like.