Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Fix String.split() implementations #3528

Draft
wants to merge 138 commits into
base: nightly
Choose a base branch
from

Conversation

martinvuyk
Copy link
Contributor

@martinvuyk martinvuyk commented Sep 22, 2024

Main issue

Fix String.split() implementations to use a generic implementation and without assuming that indexing is by byte offset. Added all methods to StringLiteral and StringSlice. Some important optimizations were added by parametrizing and avoiding slicing with numeric tricks.

Changes in behavior

This PR changes split("") behavior to be non-raising and return the separated unicode characters analogous to when the whole string has the separator at start, end, and in between every character. Closes #3635

String, StringLiteral, and StringSlice .split() now return a List[StringSlice].

Benchmark results:

CPU: Intel® Core™ i7-7700HQ

improvement metric: markdown percentage improvement ((old_value - new_value) / old_value)

Average improvement for split with a sequence: 91.2486% . In orders of magnitude, this is a 11x improvement
Average improvement for split on any whitespace: 99.9975% . In orders of magnitude, this is a 40k x improvement

Name old_value (ms) new_value (ms) improvement
bench_string_split[1000000] 5.6970877871796 0.503624904955618 0.912486298354641
bench_string_split_none[1000000] 147157.6808882 3.78480079182398 0.999975287456096

@martinvuyk martinvuyk requested a review from a team as a code owner September 22, 2024 23:17
Signed-off-by: martinvuyk <[email protected]>
@martinvuyk martinvuyk changed the title [stdlib] Fix split implementations [stdlib] Fix String.split() implementations Sep 23, 2024
Copy link
Contributor

@msaelices msaelices left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job. I've just add a NIT-pick suggestion.

Also, is it possible to add a unit test?

stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
Co-authored-by: Manuel Saelices <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
@martinvuyk
Copy link
Contributor Author

Great job. I've just add a NIT-pick suggestion.

Also, is it possible to add a unit test?

Hi, thanks for the review. Any type of test in mind that split tescases don't cover ?

.gitignore Outdated Show resolved Hide resolved
stdlib/src/collections/string.mojo Outdated Show resolved Hide resolved
stdlib/src/utils/string_slice.mojo Outdated Show resolved Hide resolved
@msaelices
Copy link
Contributor

msaelices commented Sep 23, 2024

Great job. I've just add a NIT-pick suggestion.
Also, is it possible to add a unit test?

Hi, thanks for the review. Any type of test in mind that split tescases don't cover ?

I thought this check would be broken in nightly but I am glad that it's not:

❯ git diff
diff --git a/stdlib/test/collections/test_string.mojo b/stdlib/test/collections/test_string.mojo
index a664d321..b7a85c6c 100644
--- a/stdlib/test/collections/test_string.mojo
+++ b/stdlib/test/collections/test_string.mojo
@@ -824,6 +824,11 @@ def test_split():
     assert_equal(res6[2], "долор")
     assert_equal(res6[3], "сит")
     assert_equal(res6[4], "амет")
+    var res7 = in6.split("м")
+    assert_equal(res7[0], "Лоре")
+    assert_equal(res7[1], " ипсу")
+    assert_equal(res7[2], " долор сит а")
+    assert_equal(res7[3], "ет")

BTW, I still think it's a good test to add.

@martinvuyk
Copy link
Contributor Author

I thought this check would be broken in nightly but I am glad that it's not:

❯ git diff
diff --git a/stdlib/test/collections/test_string.mojo b/stdlib/test/collections/test_string.mojo
index a664d321..b7a85c6c 100644
--- a/stdlib/test/collections/test_string.mojo
+++ b/stdlib/test/collections/test_string.mojo
@@ -824,6 +824,11 @@ def test_split():
     assert_equal(res6[2], "долор")
     assert_equal(res6[3], "сит")
     assert_equal(res6[4], "амет")
+    var res7 = in6.split("м")
+    assert_equal(res7[0], "Лоре")
+    assert_equal(res7[1], " ипсу")
+    assert_equal(res7[2], " долор сит а")
+    assert_equal(res7[3], "ет")

BTW, I still think it's a good test to add.

I'm not understanding, so the lines from var res7 ... onwards didn't get merged in another PR and you'd like me to add them here?

@msaelices
Copy link
Contributor

I thought this check would be broken in nightly but I am glad that it's not:

❯ git diff
diff --git a/stdlib/test/collections/test_string.mojo b/stdlib/test/collections/test_string.mojo
index a664d321..b7a85c6c 100644
--- a/stdlib/test/collections/test_string.mojo
+++ b/stdlib/test/collections/test_string.mojo
@@ -824,6 +824,11 @@ def test_split():
     assert_equal(res6[2], "долор")
     assert_equal(res6[3], "сит")
     assert_equal(res6[4], "амет")
+    var res7 = in6.split("м")
+    assert_equal(res7[0], "Лоре")
+    assert_equal(res7[1], " ипсу")
+    assert_equal(res7[2], " долор сит а")
+    assert_equal(res7[3], "ет")

BTW, I still think it's a good test to add.

I'm not understanding, so the lines from var res7 ... onwards didn't get merged in another PR and you'd like me to add them here?

It's just a diff if you want to complete it with more test. LGTM anyways so don't worry. Thanks for that contribution 🥇

@martinvuyk martinvuyk marked this pull request as ready for review December 6, 2024 15:19
@martinvuyk
Copy link
Contributor Author

@JoeLoser I removed all uses of generic origin functions and went straight for unsafe pointer and byte length for the internal split implementation. Would you mind testing this so that we can merge it before the next stable release? (if you haven't already branched from nightly at this point)

@JoeLoser
Copy link
Collaborator

JoeLoser commented Dec 7, 2024

@JoeLoser I removed all uses of generic origin functions and went straight for unsafe pointer and byte length for the internal split implementation. Would you mind testing this so that we can merge it before the next stable release? (if you haven't already branched from nightly at this point)

Happy to see how this looks internally — do you mind rebasing to fix the conflicts so I can sync it? Otherwise the sync won't work.

Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
@martinvuyk
Copy link
Contributor Author

@JoeLoser fixed all conflicts.

@ConnorGray regarding commit 6acac62, Stringlike is intended for later building more generics on top of it. The uses that are currently in nightly are fine to lower to a concrete type, but once we have parametric traits I'd like to use this trait for encodings (an other things). There is also a solid use case that you can see in this PR which is a generic .find[T: Stringlike](self, sub: T) which allows some very ergonomic code on the user side without implicit casting. The other use case is that I'd like external libraries to be able to conform to this trait and let the stdlib integrate nicely with their custom string types. There are many other functions where I'd like to introduce this trait: .join[T: Stringlike](items: List[T]), strip, endswith, count, replace, and basically every string function which takes in another string. This will also allow us to avoid allocating for non-owning/immutable concrete types interacting with an owning String.

Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants