-
-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up creating and extending packed arrays from iterators up to 63× #1023
base: master
Are you sure you want to change the base?
Conversation
API docs are being generated and will be shortly available at: https://godot-rust.github.io/docs/gdext/pr-1023 |
This uses the iterator size hint to pre-allocate, which leads to 63× speedup in the best case. If the hint is pessimistic, it reads into a buffer to avoid repeated push() calls, which is still 44x as fast as the previous implementation.
30677e3
to
f2e267d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, this sounds like a great improvement! 🚀
Could you elaborate the role of the intermediate stack buffer? Since it's possible to resize the packed array based on size_hint()
, why not do that and write directly from the iterator to self.as_mut_slice()
?
Also, ParamType::owned_to_arg()
is no longer occurring in the resulting code, is that not necessary for genericity?
That's what the "Fast part" does. The buffer is only needed if there are more items after that. I guess there might be iterators whose The alternative (which I implemented initially) is to grow the array in increments of 32 elements, write to
Apparently not. We only implement |
If that's the slow part that only happens on "bad" implementations of Do you know how often this occurs in practice? |
There are at least two categories of iterators that are common in the wild, for which we'd want good performance:
This PR is sufficient to handle them both efficiently. We could eliminate the fast part (case 1) and not lose a lot of performance (maybe incur some memory fragmentation), but that's actually the straightforward and obvious part, so the maintainability gain is small. This PR also happens to deal efficiently with anything in between, i.e. iterators that report a nonzero lower bound but may return more elements. One example of those would be a |
Sounds good, thanks for elaborating! The 2kB buffer (512 ints) is probably also not a big issue, even on mobile/Wasm? |
A cursory search shows stack sizes of at least 1 MB on all platforms. If it becomes a problem after all, it's easy enough to adjust. |
while let Some(item) = iter.next() { | ||
buf[0].write(item); | ||
let mut buf_len = 1; | ||
for (src, dst) in iter::zip(&mut iter, buf.iter_mut().skip(1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If buffer is full, then iterator is advanced but the item is discarded.
Reference: https://doc.rust-lang.org/src/core/iter/adapters/zip.rs.html#165-170
for (src, dst) in iter::zip(&mut iter, buf.iter_mut().skip(1)) { | |
for (dst, src) in iter::zip(buf.iter_mut().skip(1), &mut iter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😱 Yikes, great catch! Maybe this is why I intuitively wrote it in a more explicit way to begin with. iter::zip
looks so symmetrical (which is why I prefer it over Iterator::zip
) but in this case, that's misleading.
I've updated the test to catch this, and rewrote the loop to be more explicit. The new test also caught another bug that all three of us missed: len += buf_len;
was missing at the end of the loop. But I'm confident that it is correct now.
Looks good to me. Small issue i noticed: if iterator panics, data in buffer will not be dropped. It's not a safety issue, but it would be nice to drop buffer properly. struct Buffer<const N: usize, T> {
buf: [MaybeUninit<T>; N],
len: usize,
}
impl<const N: usize, T> Default for Buffer<N, T> {
fn default() -> Self {
Self {
buf: [const { MaybeUninit::uninit() }; N],
len: 0,
}
}
}
impl<const N: usize, T> Drop for Buffer<N, T> {
fn drop(&mut self) {
assert!(self.len <= N);
if N > 0 {
unsafe {
ptr::drop_in_place(ptr::slice_from_raw_parts_mut(
self.buf[0].as_mut_ptr(),
self.len,
));
}
}
}
} |
Great catch yet again. This looks like a good opportunity to make that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds now a significant amount of complexity, so there's quite a chance that we introduce bugs. Given the performance gains, it's probably OK to iron those out over time, but maybe we should bookmark this in case regressions appear in the future 🙂
But I wonder -- are there no higher-level ways to achieve this with the standard library? It seems to be a pattern that occurs every now and then when implementing Extend
... Does anyone know how std
or other crates handle it, also with low-level unsafe
all the time?
while !buf.is_full() { | ||
if let Some(item) = iter.next() { | ||
buf.push(item); | ||
} else { | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be simplified with something like
for item in iter.take(N - buf.len()) { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, take
consumes the iterator by value. Maybe something like (&mut iter).take(N - buf.len())
would work, but I think the version above expresses the intent more clearly and is less prone to bugs.
It'll appear in the changelog, right?
Here's the The problem is that such an implementation would make at least one Godot API call per element, so it wouldn't be any faster than we currently have. The high-level way would be to collect the iterator into a |
This uses the iterator size hint to pre-allocate, which leads to 63× speedup in the best case. If the hint is pessimistic, it reads into a buffer to avoid repeated push() calls, which is still 44x as fast as the previous implementation.