Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaning data before BPE #35

Open
12 tasks done
marianelamin opened this issue Sep 19, 2020 · 2 comments · Fixed by #37 · May be fixed by #70
Open
12 tasks done

Cleaning data before BPE #35

marianelamin opened this issue Sep 19, 2020 · 2 comments · Fixed by #37 · May be fixed by #70
Assignees
Labels
enhancement New feature or request

Comments

@marianelamin
Copy link
Collaborator

marianelamin commented Sep 19, 2020

  • Create a set of data cleaning methods

    • Set to lowercase
    • Change á é í ó ú -> aeiou and ñ -> gn
    • Remove Emojis
    • Remove mentions
    • Remove hashtags
    • Remove links
    • Remove punctuation: . - : , ?
    • Remove extra spaces
    • Remove spaces before and after string content.
    • Stemming ?
  • Create the Cleaning class. The idea is that each method above belongs to the cleaning class. This can be part of the c4v nlp cleaning library.

@dieko95
Copy link
Member

dieko95 commented Sep 19, 2020

@marianelamin Probably we should consider how we are going to treat emojis. As for the byte pair encoding, I'm not sure if it makes sense to include emojis.

Also, I would remove URLs.

Finally, I'm not 100% sure if we should use stemming. CC @Edilmo

@marianelamin marianelamin added the enhancement New feature or request label Sep 19, 2020
@marianelamin marianelamin added duplicate This issue or pull request already exists and removed duplicate This issue or pull request already exists labels Sep 19, 2020
@marianelamin
Copy link
Collaborator Author

marianelamin commented Sep 19, 2020

When removing punctuation, our current method only removes . - : , ?. Should we also remove ! and " or '?
@Edilmo @dieko95

@marianelamin marianelamin linked a pull request Sep 22, 2020 that will close this issue
@marianelamin marianelamin linked a pull request Apr 22, 2021 that will close this issue
@marianelamin marianelamin linked a pull request Apr 23, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants