Have you ever encountered the problem that special characters such as the Euro symbol or acute accent would not display correctly and did you not know how to solve it? Don’t worry, you’re not alone. Character set encodings seem to be a little understood topic among Java software developers. And even after having done a lot of research on the subject myself I cannot attest to fully understanding everything there is to know about them. However what I have learned I would like to share with others on my blog.
Given that there is a lot of ground to cover I’m going to split this post up into a number of smaller posts. In this first post I will primarily discuss the history of, and theory behind, character set encodings. Later posts will focus more on common pitfalls and introducing Unicode into your java EE application.
So without further ado let’s take a look at the problem at hand
In the beginning there was ASCII
Conceptually what a character set encoding is, is extremely simple. It’s just a formalized agreement on how to store characters as numbers. We might for instance agree on that the lowercase letter a is stored as the number 1, the letter b stored as number 2 and so on. If we get enough people to agree with our means of representing letters we will have created our own encoding.
When the earliest 8bit computers came about, the most popular encoding was the ASCII encoding. ASCII mapped 95 printable and 33 non printable characters to numbers. You really need not concern yourself with what those non printable characters are for, as it is irrelevant to what I am trying to convey. To the left of this text you find a 4 column table with to the left of the column a number and to the right of the column the character it represents. If I wanted to encode the word Java in ASCII, it would become
74, 97, 118, 97.
Example 1 the method testEncodeJava illustrates this
For the sake of simplicity you can think of computer memory and disk space as very large lists of numbers, for 8 bit computers the largest possible number you can store in such a list is 255.
You can see in the table that ASCII defined mappings for only 128 numbers (0 – 127). Initially everyone was free to create their own mappings for the remaining 128 numbers. Some companies used them to represent mathematical symbols or music notes. IBM mostly used this “extended range” to encode characters that weren’t part of the western alphabet, and therefore not included in ASCII, such as Greek and Hebrew characters. For instance the computers IBM sold in Israel interpreted 130 as the Hebrew letter Gimel (?). Computers sold in America would render it as an e with an acute accent (é) and Greek computers would render it as the Greek capital letter Gamma (?) IBM referred these different uses of the 128-255 range as code pages and created and documented a great number of them.
You can imagine that having all these different incompatible character encodings made it extremely difficult to transport text between systems. If someone used an American IBM pc to write his résumé and send it to Israel, it would arrive as r?sum?. The é is not included in ASCII and the default codepage for Israeli computers would be different.
Example 2 the method testEncodeResume() illustrates this
If the recipient would have known what codepage was used to author the résumé he might have been able to switch to that codepage to display the document correctly but that would mean that the program used to display the document would have had to be able to display text from all known codepages. With tens if not hundreds of different codepages already out there and with new codepages popping up every few weeks this just wasn’t an option.
To remedy this, the International Standards Organization (ISO) created the ISO 8859 standard that defined a number of standardized code pages. The standard was widely adopted and the ISO 8859-1 (also known as Latin-1) character encoding is probably the most frequently used character encoding on the internet today.
As an aside microsoft windows never used used ISO-8859-1 internally instead it used a superset of ISO-8859-1 called Windows-1252. This has long been seen as an example of Microsoft’s embrace, extend and extinguish strategy
More fundamental problems with codepages
However even with the ISO 8859 standard in place there are a number of fundamental problems with codepages.
• It’s impossible to combine, for instance, Greek and Hebrew characters in one file because the Hebrew and Greek characters live in different code pages and only one codepage can be active at any given time.
• Some languages such as Chinese actually have thousands if not tens of thousands of letters. You can never fit all of these characters in a 256 number character encoding let alone in a 128 number extended range on top of ASCII.
To solve the above problems Unicode was created. Unicode is a continuing effort to create a single character set that includes every possible character on the planet. Eg. Greek, Hebrew Chinese and Latin are all included in a single huge set.
Unicode is not all that different from from the encodings we previously discussed. Matter of factly the first 128 numbers in the unicode specification correspond with US-ASCII The major way in which Unicode differs from codepage solutions is the way characters mapped to numbers will eventually be stored to some medium.
I already mentioned that you can think of memory and disk space as large lists of numbers that can hold numbers from 0 up to 255. Unicode specifies mappings for many more characters. How can you store these large numbers? The initial approach people took was to store each character as a combination of 2 numbers. This gave room for 256 * 256 = 65536 characters. Effectively this meant that Unicode documents would take up twice as much diskspace as their iso-8859 counterparts. This was seen as a huge drawback and held back the adoption of Unicode for a long time
The following sequence of bytes illustrates what the UCS-2 encoded string “Java” would look like
254, 255, 0, 74, 0, 97, 0, 118, 0, 97
The two-byte sequence 254, 255 at the beginning of the encoded string is known as the byte order mark or BOM. It indicates that the encoded characters that follow it use big-endian byte order. 255, 254 indicates little-endian order.
Example 3 the method testEndians() illustrates endianness and the byte order mark
A more recent solution is UTF-8. In most cases UTF-8 will use up considerably less space than UCS-2. It stores the 128 most frequently used (ASCII) characters in a single byte. Other characters will be stored as 2 or more bytes. The last bit of a byte indicates if it and the next byte should be seen as a single entity. If that next byte also has its last bit set the next byte will also be included and so on. This means you can store numbers of arbitrary size (as opposed to the 65536 limit imposed by UCS-2) An additional benefit is that because the first 128 unicode codepoints correspond to the ASCII encoding, any valid ASCII file is also a valid UTF-8 file. While UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may be used to mark text as UTF-8. It only identifies a file as UTF-8 and does not state anything about byte order
If you live in a country that does not use a western alphabet UTF-8 might not be the best possible encoding. Because text written in such countries will not use the 1 byte ASCII characters very often.
Example 4 the method testCompareByteSizes() shows how much space is used for various encodings
So what have we established ? Codepage based character encodings such as ISO-8859-1 have a number of fundamental limitations that Unicode elegantly solves. If you live in a western country, encoding your Unicode text using UTF-8 will probably be the most efficient for you
In my next post on this subject I will explain how you can introduce unicode into your java EE application.
You can download the sample code from this blog post