-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utf8proc seems difficult to use efficiently on strings #101
Comments
(It is possible to use As far as string length (in characters), you are right, it doesn't have a lot of basic functions like this (although you can easily compute string length by calling Most of the current utf8proc developers are mainly using it for Julia, so we have only been motivated to implement functions needed for Julia: string normalizations, grapheme detection, some character-related queries. Julia already has a string-length function and UTF8 string iteration and many other functions, so we haven't been motivated to add those to utf8proc. The "goal" of utf8proc, as far as we've been concerned, is mainly to implement functions that we need that require access to a Unicode character database, but we're open to contributions of other useful functions. I agree that a fast case-folded/normalized comparison function that requires no buffers seems possible to write and could be useful, even for Julia; a PR would be welcome. |
Good point; I missed that.
Length computations are useful when computing display sizes required, or maximum potential buffer sizes, or sorting by length, or similar things. I agree they're not nearly so useful as straightforward strlen() but still needed. Thanks for your thoughts @stevengj ! |
For display sizes, you definitely don't want the length in codepoints; e.g. combining characters are zero width. You maybe want the sum of the charwidths, but even this is somewhat ambiguous because the displayed charwidth also depends on the font and terminal. I'm also skeptical of sorting by length for much the same reasons. |
Note that such a function was implemented in Julia, and could be ported to C: https://github.com/JuliaLang/julia/blob/0f6c72c71bc947282ae18715c09f93a22828aab7/stdlib/Unicode/src/Unicode.jl#L268-L340 |
Maybe this isn't an appropriate issue, if so please feel free to close it. I have a string implementation and I need to do some basic UTF8 operations on it: I need to compute the length (in characters not bytes), compare strings in a case-insensitive way (folding), and convert to upper or lowercase strings. I need these done as efficiently as possible as this has a real impact on my system. Then there are a few other more esoteric things I need like reverse a utf8 string etc. but these don't need to be done super-efficiently.
I really would like something small and I only need UTF8, so ICU is too much.
utf8proc seems like a great per-character interface, but it seems difficult to use efficiently on entire strings. For example, there's no simple, fast string length function. Also, the way that the map functions always allocate new memory and can't be used on existing buffers is a major drawback: it necessitates a lot of extra copying in many situations. It seems like a folded comparison function could be written inside utf8proc a good bit more efficiently. Etc.
Maybe that's a goal of utf8proc: to provide a character-based interface and have users compose their own higher-level (string-based) algorithms using them: simplicity taking priority over efficiency? And/or perhaps the way Julia uses utf8proc just matches well with the current interface; it doesn't have a need for writing into existing buffers etc.?
The text was updated successfully, but these errors were encountered: