Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(steps): add unit test and fix unique col scaling #158

Merged

Conversation

jitingxu1
Copy link
Collaborator

changes:

  1. fix unique col standardization
  2. add unit test

resolve #119

@jitingxu1 jitingxu1 force-pushed the fix-all-unique-value-when-scale branch from cd44247 to d6446c7 Compare September 13, 2024 07:00
@codecov-commenter
Copy link

codecov-commenter commented Sep 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.73%. Comparing base (6dce35e) to head (b49543b).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #158      +/-   ##
==========================================
+ Coverage   88.70%   89.73%   +1.03%     
==========================================
  Files          27       28       +1     
  Lines        2009     2045      +36     
==========================================
+ Hits         1782     1835      +53     
+ Misses        227      210      -17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@deepyaman
Copy link
Collaborator

This doesn't quite fix it; it throws an error. I'm not too keen on it.

Why? Preprocessing can be like a recipe you apply, and you may always want to scale your numeric columns. I don't want to throw an error just because one of those columns is not unique; should figure out how to handle it instead.

See how scikit-learn handles it, for example: https://github.com/betatim/scikit-learn/blob/main/sklearn/preprocessing/_data.py#L82 (huh, not sure how I got to somebody's fork, but whatever... get the point)

If you want to refactor this PR to just add the unit tests, and create an issue to handle scaling without throwing an error, we can do that (but doesn't need to go in today).

(Or you can push off adding the unit tests until address the scaling in a future PR)

@jitingxu1
Copy link
Collaborator Author

This doesn't quite fix it; it throws an error. I'm not too keen on it.

Why? Preprocessing can be like a recipe you apply, and you may always want to scale your numeric columns. I don't want to throw an error just because one of those columns is not unique; should figure out how to handle it instead.

See how scikit-learn handles it, for example: https://github.com/betatim/scikit-learn/blob/main/sklearn/preprocessing/_data.py#L82 (huh, not sure how I got to somebody's fork, but whatever... get the point)

If you want to refactor this PR to just add the unit tests, and create an issue to handle scaling without throwing an error, we can do that (but doesn't need to go in today).

(Or you can push off adding the unit tests until address the scaling in a future PR)

make sense, I will update this PR soon.

@@ -11,6 +11,8 @@
from collections.abc import Iterable

_DOCS_PAGE_NAME = "standardization"
# a small epsilon value to handle near-constant columns during normalization
_APPROX_EPS = 10e-7
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy could get the machine precision for different float precision. I took the float16 precision as all the approximate precision for all float types to bypass the use of numpy.

@jitingxu1 jitingxu1 requested a review from deepyaman September 17, 2024 21:51
@jitingxu1 jitingxu1 force-pushed the fix-all-unique-value-when-scale branch from ac7e48d to b49543b Compare September 17, 2024 21:52
@deepyaman deepyaman merged commit c32d604 into ibis-project:main Sep 25, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug(steps): handle col with one unique value in Scale* step
3 participants