A Quick Look at charset Usage

Posted: September 18, 2006 Comments(1)

It is very important to write valid markup, but at the same time it’s just as important to make that markup accessible to as many people as possible. An aspect of validation and accessibility that isn’t often talked about is the requirement of a charset declaration.

I went through my list of bookmarks and randomly picked 100 URLs to analyze. I simply checked whether or not a charset was defined, and if there was, I noted which charset was used. The results are as follows:

utf-8 66%
iso-8859-1 19%
None 14%
utf-8 and iso-8859-1 1%

For the most part, these are pretty promising results. Overall, 86% of the sites declared a valid charset, leaving 14% without one. One site actually had two charset declarations, although I’m not exactly sure what would happen as a result of that (which one would be applied).

What is a charset?

To put it simply, a charset is a method defined by you (the designer/developer) which declares how characters outside of the ASCII set will be represented. Some of these characters include “&” (ampersand), “Δ” (Delta), and other ‘special’ characters.

A charset is also used to declare how characters such as foreign language characters are displayed. If at some time you mistakingly found yourself at a Chinese or Japanese website, and only saw those little squares instead of symbols, that could be due to the fact that a proper charset was not declared. It could also, however, be a result of you not having that particular font on your machine.

That is an extremely simple overview of charset, mostly because character encoding is a science in and of itself. If you’d like to read more about the details of character encoding, I would suggest starting out with the Wikipedia entry on character encoding.

How is a charset Declared?

There are generally two ways of declaring a document charset. One method is declaring it server side. Your Web server can actually provide a user agent with the information it needs regarding the character encoding used on the document. The other, more common method, is to include the information in a meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

There is also something called the Byte Order Mark (BOM) to indicate that a document is UTF-8, UTF-16 or UTF-32. Although the Validator says that’s okay, it gives a notification:

The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.

Using a BOM isn’t required to declare a utf-8 charset, but some prefer to use it.

What’s the Difference Between them?

From my quick look, the two types of charsets used were utf-8 and iso-8859-1. Both are completely valid and totally usable, but there are differences between the two. For instance, iso-8859-1 is considered to be a legacy character encoding. The fact that it is considered legacy still doesn’t stop many people from preferring it over utf-8. Older user agents (browsers or other methods of accessing documents) aren’t able to display utf-8, and some designers/developers see that as a risk of using the charset and therefore stick with iso-8859-1. As with character encoding in general, there is an abundance of information available on iso-8859 if you’re interested. For instance you’ll notice that the charset I found to be used was iso-8859-1, not iso-8859. That simply reflects which part of the is iso standard is being used, in this case: 1.

Others prefer the use of utf-8 instead of iso-8859-1. This character encoding method is able to represent any character in the Unicode standard, while retaining the original ASCII encoding (so no conversion is needed), which is attractive to many people. It is also the standard encoding for XML documents (along with utf-16).

Character encoding isn’t only used on the Web, mind you. To be overly general, every non-binary file on your computer has a character encoding either defined by the application that created it, or your operating system itself. It helps to tell applications what to expect when reading the contents of the file.

Always Declare a Character Encoding

As it has been said before, always specify a character encoding. It is important that your content be displayed as you intended, and this is one way to help accomplish that. There is quite a bit of science behind character encoding, and if you’d like to find out a bit more, the following might be helpful:

Get my newsletter

Receive periodic updates right in the mail!
  • This field is for validation purposes and should be left unchanged.

Comments

  1. Nice summary! I want to add a few things. The default encoding should be UTF-8 so I’m not sure you have to declare a charset if you use that one. You can also specify a charset on the server side byt sending the correct HTTP header. This is seen by some as a bit cleaner then requiring parsers to first read the source (what charset?) to even determine the charset.

    The other addition I’m a bit unsure about. I had the feeling IE triggered quirks mode when you used an UTF BOM. It’s a little to late here to test it right now but it’s worthwhile to test that out before deciding to use it.

Leave a Reply

Your email address will not be published.