If the character is sent in an iMessage, for example, the recipient's Messages app will crash when the conversation is opened. Likewise, if the character is pasted into the Safari or Chrome address bar on Mac, the browsers crash. This behavior extends to virtually any system text field on iOS and macOS, resulting in many third-party apps like WhatsApp and Facebook Messenger being affected as well.
Even worse, some users have found that if the character is displayed in an iOS notification, it can cause an entire iPhone or iPad to respring, and in worst-case scenarios, restoring in DFU mode is the only possible solution.
Previously, previously, previously, previously, previously, previously, previously, previously.
How on earth is any valid Unicode character crashing this sort of software in this day and age?!?! This is software that is sold around the world (including, inexplicably -- because seriously Japanese smart phones are better for Japanese than iThings), Japan), and it can't handle Unicode?
Seriously, I thought that the sensible thing to do would be to have a string handling library that did Unicode natively, and then just hand off to the library anything string related.
This library should not be able to crash the system. It should, if given something invalid, just refuse to display it.
Yes, there is no excuse for this.
But on the other hand, the Unicode standard (and the UTF-8 standard, which is not the same thing) are both mind-bogglingly complicated, and properly handling it takes a truly ridiculous amount of code, handling an absurd number of security-sensitive edge cases, which means -- since everything that matters is implemented in languages that don't have automatic bounds checking (thanks again Dennis) -- a vast number of opportunities exist for an overflow to slip in.
And like the, uh, "motivational poster" says,
That UTF-8 stress test document reminds me of the big list of naughty strings
UTF-8 really isn't "mind-bogglingly complicated". It's simpler than most of the half-arsed ideas that had been proposed before it, certainly it's no more complicated than schemes like RLE. It's a joy to work with compared to UTF-16 or UTF-7 and who doesn't love bytes?
I will agree that Unicode is "mind-bogglingly complicated" but only because it's touching human writing systems, each of which is unnecessarily complicated in its own unique way, like people. Ordinary users of any particular system will often deny it is complicated, or, they'll defend this as necessary, e.g. you will see Latin / Cyrillic users saying case is obviously needed, or Han users claiming the enormous glyph vocabulary isn't a real problem even though they know they don't recognise all the characters and most don't have the education needed to look up an inexplicable new character in a dictionary (try to imagine that, you see a word, you know it's a word in your language, you have a dictionary, but you haven't the first idea how to find it in the dictionary... this really happens to users of Chinese languages).
This bug seems to have been in the layer above all this anyway, it's in the code that turns string data structures into pixel data structures, that's why you can sidestep problems by doing things like deleting a Tweet before you read it, and it looks like a null de-reference, which is a goof that we can't blame on Dennis. That code would have been even more complicated without Unicode in order for Apple to ship this product around the world.
Ok, it's true, UTF-8 itself isn't so bad, it's only a hundred lines or so to compress and decompress between that and Unicode wide characters. It does have some sneaky boundary cases that lead to attacks, but that's tractable.
And absolutely, Unicode is complicated because the problem is complicated.
But I do feel like there's a lot of stuff lurking in there like, "Hello, speaker of language A, here's a thing from language B that you've never heard of but that will totally fuck your shit up if you don't special-case it properly." My gut tells me that there could have been a bit more "fail safe" in the design, and thus less of that? But maybe not.
"Handling" Unicode in this case isn't so much manipulating the characters as text strings, as almost all people who use it will do, but as one of the inputs to a high-performance glyph rendering library (the other input being is a font definition with chained tables of glyph dependencies with and a bytecode language for hinting how to render at small resolutions), which has to render combinations of glyphs precisely, including special glyphs that affect one or more future glyphs.
The problem here is not that Unicode is complex - it is, but for lots of cross-cutting concerns. The fact that Cyrillic has characters that look like, but aren't the same as, Latin characters, allowing for visually indistinguishable non-unique strings are a security concern that has nothing to do with iMessage crashing. The fact that all Latin-alphabet languages in the world except Turkish lowercase "I" to "i" is a localisation concern that has nothing to do with iMessage crashing. That Unicode has both composable and precomposed diacritics and can be "normalised" to one or the other has to do with the Right Way (one only way to do it, which should be the composable way) being compromised by also promising bijective mappings to all the world's existing character sets (some of which have precomposed characters) to/from Unicode; it's a headache, but not a headache that will ever crash iMessage.
In general, glyph rendering in this day and age (with a single codebase that can render not just Latin script but also CJK languages, Arabic and Devanagari) is complex. Mind-bogglingly complex. Handling UTF-8 correctly, or correctly manipulating Unicode strings is child's play by comparison.
That said, Apple should have a test case that tries to render every possible combination of n glyphs in each of their system-provided fonts, even if it takes a long time to run.
(fwiw, this bug was fixed already)