I am currently angry over: Half arsed character set conversions

This issue made me angry on Monday the 4 of September, 2006. Other irritations may be found here . The most recent irritation may be found here

slrn is a nice terminal-based newsreader with lots of lovely scripting support that makes it wonderfully useful.

Usenet is an international medium in which people post using many different character sets.

Unicode is a character set that includes approximately every glyph known to man.

UTF-8 is a standard for encoding Unicode in 8-bit streams.

In order to display text on a screen usefully, two things must be known:

Once these are both known, converting from one to the other is easy. In cases where a one to one mapping is non-trivial (say, conversion from ä to 7-bit ascii), there are various representative ways to provide a meaningful output (say, a").

Now, 8-bit characters are not themselves displayable under utf-8 - high-bit characters signify a non-ascii unicode character, and thus an arbitrary Latin-1 character will end up as several bytes. Passing a raw Latin-1 character to a UTF-8 terminal will result in undefined behaviour, as in itself it is not a valid UTF-8 string (if you're really unlucky, it and the following characters will in fact be a valid UTF-8 string and things will be even more confused).

So, doing the following is almost certainly going to be wrong:

And yet slrn does all of these things.