I am currently angry over: Half arsed character set conversions

Hello, and welcome to Matthew's list of things that have made him angry recently. Previous instances of things that have made Matthew angry are here.

Now, back to the most recent issue.

slrn is a nice terminal-based newsreader with lots of lovely scripting support that makes it wonderfully useful.

Usenet is an international medium in which people post using many different character sets.

Unicode is a character set that includes approximately every glyph known to man.

UTF-8 is a standard for encoding Unicode in 8-bit streams.

In order to display text on a screen usefully, two things must be known:

Once these are both known, converting from one to the other is easy. In cases where a one to one mapping is non-trivial (say, conversion from ä to 7-bit ascii), there are various representative ways to provide a meaningful output (say, a").

Now, 8-bit characters are not themselves displayable under utf-8 - high-bit characters signify a non-ascii unicode character, and thus an arbitrary Latin-1 character will end up as several bytes. Passing a raw Latin-1 character to a UTF-8 terminal will result in undefined behaviour, as in itself it is not a valid UTF-8 string (if you're really unlucky, it and the following characters will in fact be a valid UTF-8 string and things will be even more confused).

So, doing the following is almost certainly going to be wrong:

And yet slrn does all of these things.