Character encoding, character set, and more
December 21st 2009
These things come up in every project I work in and I have to admit that I have never fully understood every detail about character sets and encodings. So now, I finally had to find out. In this article I will answer at least following questions:
- What is character encoding?
- What is a character set?
- What is the difference between character set and character encoding?
- What is a Unicode and how does that relate to UTF-8, UTF-16, and UTF-32?
- What is the difference between HTTP Content-Type header and HTML meta tag content-type, and why do we need both?
- What about MySQL connection string parameters, database character set and collation?
Let's get started!
We all use characters to form words. In the magic world of computers we are going to need a system that converts characters into bits. Sure, it would be nice to have only one set of rules that could handle each and every case, but as we know, it is not that simple. While there are very old encodings, we will start at ASCII standard which was introduced back in the 1960s. It defines 128 code-character pairs, which together form a coded character set that uses 7 bits to represent each character on a media. Each of these code representations are called code points. So, ASCII standard defines a character repertoire of 128 characters in which the first 32 code points are reserved for control characters.
What are character set and character encoding?
Character set defines characters available in a set and their code points, whereas character encoding defines how code points are represented on a media. In practice they are used interchangeably. Historically they were synonyms because a same standard used to define the characters available as well as the actual encoding rules. Now things have changed and character set can be encoded with different encoding system.
All ASCII, ISO-8859-1, and Unicode character sets use value 65 for a character "A". However, ASCII uses 7 bits for character encoding whereas ISO-8859-1 uses 8 bits.
Unicode, UTF-8, UTF-16 and UTF-32
<a title="Unicode" href="http://unicode.org/">Unicode </a>is a character set. It defines 17 <em>planes</em>, in which each can containing up to 65 536 characters. This enables encoding of 1 114 112 characters. Similar characters are collected on same planes and most of the characters in a languages spoken today are collected in the first one, called <em>Basic Multilingual Plane</em>. Each of the characters in the Unicode character set can be represented using UTF-8/16/32 character encoding.
- UTF-8 encodes characters using 1-4 bytes. The Length of the stream depends on the character encoded. It is backwards compatible with ASCII often making it the choice today.
- UTF-16 is also a variable-length encoding. It uses 2-4 bytes to represent a character.
- UTF-32 is the most straight forward encoding scheme, which always uses 4 bytes to represent a character.
What is the difference between HTTP Content-Type header and HTML meta tag content-type, and why do we need both?
When HTML content is transferred over network, both a server and a client will need to know how to encode and decode a stream. In an initial HTTP request, the client sends the request with an "Accept-Charset" header, which tells the server how it will want a results encoded. In the response, the server should include "Content-Type" HTTP header as well as a content-type meta tag inside HTML and they both should declare same encoding.
There is a <a title="w3tutorial" href="http://www.w3.org/International/tutorials/tutorial-char-enc/">nice article you can read</a>. Basically both have <a title="precedence" href="http://www.w3.org/International/tutorials/tutorial-char-enc/#Slide0400">precedence rules</a> which will tell the client which one to use. In the case of a XHTML response, precedence is
- HTTP header
- Meta tag
In the case where the HTTP header is present, browser will use it. But afterwards when content is saved on a disk either by a user or a proxy, the Content-Type HTTP response header is lost. In this case, used character encoding can be read from the meta tag from the saved file. That is why you need both.
In a Java world, the HTTP response content type can be set using <a title="CharEncoding" href="http://java.sun.com/products/servlet/2.5/docs/servlet-25-mr2/javax/servlet/ServletResponse.html#setCharacterEncoding(java.lang.String)">ServletResponse.setCharacterEncoding()</a> if default ISO-8859-1 is not suitable. If you have Apache, you can use <a title="AddCharset" href="http://httpd.apache.org/docs/2.0/mod/modmime.html#addcharset">AddCharset</a> directive.
How the MySQL database character set and collation work in this picture?
All text in a database is encoded in some format. The MySQL will encode characters into database using character set defined for it. Collation is a set of rules how characters are compared inside character set. Collation tells sorting engine whether a character "A" come before or after a character "B".
MySQL connection string encodings
<a title="MysqlProperties" href="http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html">You can define many properties for MySQL jdbc connection</a>. For example, you can use property "characterEncoding" to tell JDBC driver to encode queries using given character encoding. You can alter a result set encoding by using "characterSetResults" property.
Gimme some pictures!
In a picture below you can see character encoding working: