X-Nico

10 unusual facts about UTF-8


Climm

It is internationalized; German, English, and other translations are available, and it supports sending and receiving acknowledged and non-acknowledged Unicode-encoded messages (it even understands UTF-8 messages for message types the ICQ protocol does not use them for).

Extended SMTP

SMTPUTF8 — Allow UTF-8 encoding in mailbox names and header fields, RFC 6531

International email

International email (IDN email or Intl email) is email that contains international, UTF-8 encoded, characters (characters which do not exist in the ASCII character set) in the email header.

Luit

For example, instead of running "ssh legacy-machine", a user may have to run "LC ALL=fr FR luit ssh legacy-machine" to properly render French accented characters on a UTF-8 terminal.

The main purpose of luit is to allow "legacy" applications that use character sets other than UTF-8 to work with contemporary terminal emulators.

Multicast DNS

Each such string consists of a length byte followed by that many UTF-8 characters.

OpenXDF

OpenXDF requires the use of a XML 1.0 compliant parser that supports UTF-8 and UTF-16.

Standard streams

As they are used for input and output devices, they generally contain text, a sequence of characters in a predetermined encoding, such as Latin-1 or UTF-8.

UTF-8

In August 1992, this proposal was circulated by an IBM X/Open representative to interested parties.

Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multibyte sequences would include only bytes where the high bit was set.


Similar

UTF-8 | UTF-16 | UTF-7 |

Bush hid the facts

While "Bush hid the facts" is the sentence most commonly presented on the Internet to induce the error, the bug can be triggered by many sentences with characters and spaces in a particular order so that the bytes match the UTF-16LE encoding of valid (if nonsensical) Chinese Unicode characters.

Comparison of Unicode encodings

The next 1,920 characters, U+0080 to U+07FF (encompassing the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Tāna and N'Ko), requires 16 bits to encode in both UTF-8 and UTF-16, and 32 bits in UTF-32.

Content sniffing

For instance, Internet Explorer 7 may be tricked to run JScript in circumvention of its policy by allowing the browser to guess that an HTML-file was encoded in UTF-7.

Extended SMTP

Prior to the availability of 8BITMIME implementations, mail user agents employed several techniques to cope with the seven-bit limitation, such as binary-to-text encodings (including ones provided by MIME) and UTF-7.

Netatalk

In October 2004 Netatalk 2.0 was released, which brought major improvements, including: support for Apple Filing Protocol version 3.1 (providing long UTF-8 filenames, file sizes > 2 gigabytes, full Mac OS X compatibility), CUPS integration, Kerberos V support allowing true "single sign-on", reliable and persistent storage of file and directory IDs and countless bug fixes compared to previous versions.

NewLISP

It also provides the functions expected of a modern scripting language, including support for regular expressions, XML, Unicode (UTF-8), TCP/IP and UDP networking, matrix and array processing, advanced math, statistics and Bayesian statistical analysis, financial mathematics, and distributed computing support.

UTF-16

UTF-16 is used by the Qualcomm BREW operating systems; the .NET environments; and the Qt cross-platform graphical widget toolkit.

IBM iSeries systems designate code page CCSID 13488 for UCS-2 character encoding, CCSID 1200 for UTF-16 encoding, and CCSID 1208 for UTF-8 encoding.

UTF-EBCDIC

IBM EBCDIC-based mainframe operating systems, such as z/OS, usually use UTF-16 for complete Unicode support.