Arc Forumnew | comments | leaders | submitlogin
3 points by olavk 6172 days ago | link | parent

Thank you very much for the clarification! People got riled up because it sounded like you didn't want Arc to support unicode ever (or that the current support would be removed). As long as the language is in flux its not a problem.

However, fundamental Unicode support probably has to be in place before release 1.0. It will be painful to add at a later time if backwards compatibility is an issue. For example a lot of string processing code might assume that accessing characters by index is constant time. If the internal representation is changed to eg. UTF-8 this might lead to performance issues. On the other hand, if code assumes that strings are equivalent to byte-arrays, it might lead to trouble if they are changed to arrays of 32bit-values.

I believe the simplest solution is to just have characters be 32bit integers. The internal representation of a string is just an array of 32bit characters. Sure this consumes more space, but who cares? As long as strings are a type seperate from byte-arrays, encoding/decoding issues and can be handled in libraries.



1 point by weeble 6172 days ago | link

I think the point is that, in the presence of combining diacritics, even 32 bits isn't enough. A character is (roughly) one "base" 32-bit code plus zero or more "combining" 32-bit codes. And equality between two characters isn't purely structural - you might re-order its combining codes or use a pre-combined code. (Not all combinations have pre-combined codes.)

I will point out that I know very little about Unicode, so I might be a bit off. I can't say that I'm even very interested in the whole Unicode debate, so long as it all gets sorted out at some point in the future.

-----

1 point by tree 6172 days ago | link

The only reason Unicode contains combined forms is for compatibility with existing standards: you cannot invent new code points representing a novel combination of base and combining characters. The Unicode normalization forms deal with these issues.

Unicode support is a complex issue: fundamentally there are the issues of low-level character representation (e.g., internal representation) followed by library support to handle normalization and higher-level text processing operations.

-----

1 point by olavk 6172 days ago | link

True, I should have said unicode code points rather than characters. I believe the fundamentals is that strings should always be sequences of unicode code points, and shouldn't be conflated with byte arrays. The thorny issues of normalization, comparing, sorting, rendering combined characters and so on could be handled with libraries at a later stage.

-----