UTF-8 is your friend – Part 1

This is the first part of a series of three blog posts explaining how crucial it is to implement correctly UTF-8 character encoding in your web application, when dealing with content in various languages or encodings.

Having to deal with websites and their respective character encodings is not as easy a task as it may sound. This is even more true, when retrieving and storing content in various character encodings, and you might get caught in some pitfall if you don’t make sure you are handling the encoding correctly. A real case scenario is that of HTTPMon, which monitors websites in many different languages, such as english, german, greek and french, and so it must be able to parse HTML using any kind of character set.

The magic word to all your encoding problems is here UTF-8, and the key element is to handle your data all the way through, starting from the backend of your web application until the web browser, using this specific unicode character encoding. When I say backend, this means your SQL database or scripts that are running on the server side and processing some kind of data. Let me explain this on three possible levels (i.e. server side scripts, database, and HTML) of a web application and in three consecutive blog posts. This first part will focus on the server side scripting level.

Let’s assume your server side script is processing data, which could be in any kind of character encoding, and feeds these data to your database for storage. In this case you will have to make sure that the non-UTF-8 encoded data first get converted to UTF-8. The difficulty here all resides in the reliable detection of UTF-8 in the content, because in most of the cases there isn’t any ready-made function or method for this purpose.

For HTTPMon, we mainly use PERL as the server side scripting language for the backend, but since there are no ready-made functions to detect UTF-8, we had to come up with one. Luckily enough, we didn’t have to do much because Martin Dürst from the W3C already came up with a magical regular expression, in his Multilingual form encoding article, which does exactly that. All we did was to implement Martin’s regexp in an easy to call PERL subroutine named is_string_utf8 as you can see below.

sub is_string_utf8
    $metadesc =~
           [\x09\x0A\x0D\x20-\x7E]            # ASCII
         | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
         |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
         | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
         |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
         |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
         | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
         |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
        )*\z/x ? return( 1 ) : return( 0 );

Thanks to PERL’s built-in UTF-8 support from the utf8 pragma, we could then simply encode the string with the utf8::encode() if it was not already in UTF-8 as detected by our is_string_utf8() subroutine. Basically, that’s all you need in your server side scripts, but make sure this is done correctly as you will have to store these into a UTF-8 formatted SQL database, which is the subject of part 2 of the “UTF-8 is your friend” blog post.

Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>