Foreign Characters

Computers historically have used a rather limited character set defined so as to include American and little else. Known as '7 bit ASCII', it contains 95 printable characters, with space for another hundred or so. The original 95 do not even include the pound sign, and there are many different standards for what to do with the other hundred depending on whether one wishes to add a full Greek alphabet, a full Cyrillic alphabet, or lots of accented Latin characters.

With Windows 2000, Microsoft endorsed an encoding called UTF-16. This was not popular. A single character took either two or four bytes, and old 7-bit ASCII text is not valid UTF-16 text, but is half the size of the corresponding UTF-16 text. UTF-16 decoders also need to worry that as the encoding unit is two bytes, different platforms differ over whether this is stored least significant byte first, or most significant.

The UTF-8 standard offers the best of all worlds, for 7-bit ASCII text is valid UTF-8 text, and characters outside that range are represented by two, three or four byte sequences. As its basic unit is the single byte, there are no ordering ambiguities. It supports just over a million different characters.

Font support is inevitably incomplete. A font containingly glyphs for all UTF-8 characters would be rather large, most of it would be little used. Any decent font will contain accented Latin characters and the Greek alphabet, although not all will contain Cyrillic, Arabic or CJK characters. The Euro sign, whose UTF-8 encoding is E2 82 AC, is a rare example of a common European character whose UTF-8 encoding is more than two bytes, and which fonts, particularly older fonts, may lack.

Many, but not all, modern text-based applications support UTF-8. The email client alpine does, though pine does not. The terminals xterm, gnome-terminal and konsole all do, though mrxvt does not. (Xterm may default to a non-UTF-8 supporting mode. TCM recently changed to make UTF-8 the default.) The word counting program wc offers a choice of counting bytes or characters, as these are different in UTF-8.

Of course, in general supporting foreign characters is painful. The idea that a single character may be of variable length, and that sorting is really hard, is awkward for programmers. The French think that e and é sort equally, so that 101 (the ASCII value of e) is somehow equal to 50089 (the decimal representation of c3 a9, which is the UTF-8 representation of é). As for sorting when faced with the German ß ('double s' ligature)...

Many basic C function have no understanding of UTF-8. Exceptions include strcoll(), toupper(), isalpha(). Many programmers prefer anding with 0xdf to using toupper(), but such tricks work with 7-bit ASCII only.

Controlling Anti-Americanism

Various aspects of programs' behaviour can be deflected away from their default Americanisms by the environment variables including LANG, LC_ALL, LC_COLLATE, LC_CTYPE, LC_MONETARY, LC_NUMERIC and LC_TIME. TCM, since April 2013, has set LC_CTYPE to en_GB.UTF-8 and left the others. Setting LC_COLLATE results in directory listings sorting themselves with no regard to capitalisation, so files called "README" no longer appear first. The value "POSIX" (or sometimes "C") is used for the traditional position of ASCII sorting order, no thousands' separator in numbers, etc.

To find man pages on the subject, try 'man locale'.

UTF-8's Cunning Encoding Scheme

The encoding scheme UTF-8 uses is quite clever. Multibyte sequences have the first bit of the first byte set to one. If that bit is zero, then one has a single byte character, just as with 7-bit ASCII. If it is one, then the number of bytes in the multibyte sequence is encoded by the number of ones before the first zero bit, so two-byte sequences start 110, three byte 1110, and four byte 11110. The continuation bytes all start with their first two bits set to 10. Easy to decode, although not optimally space efficient. A two-byte UTF-8 sequence has bits set to 110xxxxx 10xxxxxx, so encodes just 11 bits of real data, and each extra byte adds just five bits to the amount of data encoded.