Character Encoding Woes

Recently I’ve been plagued with character encoding issues everywhere I go. Inevitably people just do not plan for special characters on their website. English does not generally use them, so it can often slip the mind of developers. Unfortunately special characters are extremely important and if you do not cater for them your website can look unprofessional or just plain bad. This is especially true if you have a team of editors working on your site to provide content – content copied from word processors/publishing software can unintentionally contain special characters such as curly quotes ( ‘ ’ “ ” ) and long dashes ( — ).

As of PHP 5.6 the default character encoding is UTF-8, however any previous PHP settings will override this. PHP 5.5 and prior uses Latin 1 character encoding (ISO 8859-1). It seems that the technologies we use are shifting towards using more comprehensive and standard character encoding methods, such as UTF-8 – which is great for us developers! But in the mean time, what does this mean to us?

We have to make a conscious decision about character encoding. Personally I chose to use UTF-8 – it’s comprehensive and easy to switch to, PHP has a lot of functions built around UTF-8 and is itself making the change. This will requires changes to MySQL and PHP settings. Making this change will also make your cross platform interactions much easier! This is a big plus in my eyes, I’ve spend a lot of time migrating data between systems. This can involve different database technologies, moving onto new server technologies and different underlying languages – this will throw many challenges at you.

HTML

Yes, even your webpage will need to tell the world you are using UTF-8. Without this web browsers may not interpret your site correctly. Simply add this meta tag to your page in the <header> element:

<meta charset="UTF-8">

Forms can also be told to only accept UTF-8 characters:

<form accept-charset="utf-8">

XML

If you’re using XML you will also need to specify UTF-8 as it’s encoding:

<?xml version="1.0" encoding="UTF-8"?>

PHP

PHP settings – php.ini

If you are using a unix server then you can find your php.ini file here: /etc/php.ini

default_charset = "utf-8";

Don’t forget to restart your PHP daemon!

If you are using PHP versions prior to 5.4.0 then you may need to specify character encoding for some PHP functions. Prior to 5.4.0 the default was ISO-8859-1. An example of this is

htmlspecialchars($string, ENT_NOQUOTES, "UTF-8");

MySQL

Before touching anything to do with data MAKE A BACKUP – this is my number one suggestion when ever you are dealing with data. It does not matter if you can easily replicate it, it’s level of importance, anything. If you make some database changes and your database stops working, you’ve just lost a lot of your time and perhaps some invaluable data.

Make. A. Backup.

MySQL settings – my.cnf

If you are using a unix server, the config file is either /etc/my.ini or /etc/my.cnf

[client]
default-character-set=utf8mb4
    
[mysql]
default-character-set=utf8mb4

[mysqld] 
character-set-client-handshake = false #force encoding to uft8 
character-set-server=utf8mb4
collation-server=utf8mb4_general_ci

[mysqld_safe] 
default-character-set=utf8mb4

Don’t forget to restart your MySQL daemon!

PHP – MySQL interactions

If you’re not using WordPress, you’ll need to do the following. If you are, UTF-8 is the default value. Check for this line in your wp-config.php file:

define('DB_CHARSET', 'utf8');

In addition to the config changes above connections to the MySQL server. The method here varies based on what you are using. PDO is highly recommended, it provides a lot of protection against SQL injection and malicious interactions with your database. It also encourages good practise with preparing statements and binding variables. Just add the charset to the PDO connection string as shown below:

$connect = new PDO("mysql:host=$host;dbname=$db;charset=utf8", $user, $pass);

Existing Databases and Tables

If you are applying these changes to an existing database, then you’ll need to modify it, it’s tables, and it’s columns:

# just to emphasize that the connection charset is set to `utf8`
SET NAMES utf8;
# For each database: 
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci; 
# For each table: 
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; 
# For each column: 
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; 
# (Don’t blindly copy-paste this! The exact statement depends on the column type, maximum length, and other properties. The above line is just an example for a `VARCHAR` column.)

Table Tests

Make sure what you’ve just done worked! Run this SQL statement to check the character encoding of your database and tables.

SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name            | Value              |
+--------------------------+--------------------+
| character_set_client     | utf8mb4            |
| character_set_connection | utf8mb4            |
| character_set_database   | utf8mb4            |
| character_set_filesystem | binary             |
| character_set_results    | utf8mb4            |
| character_set_server     | utf8mb4            |
| character_set_system     | utf8               |
| collation_connection     | utf8mb4_general_ci |
| collation_database       | utf8mb4_unicode_ci |
| collation_server         | utf8mb4_general_ci |
+--------------------------+--------------------+

Want more information?

Want to know more? Perhaps written by people who can explain better than me, see the source links at the bottom of the post for where I found information regarding this post (excluding from personal experience of course).

Done!

Excellent! I’m now going to try and follow my own guide, if all works out then I’ll leave this post as it is. If I find any errors or omissions I’ll edit the post.

Published by

MHayward

I am a Web Developer who has been creating websites and hacking at WordPress for over 4 years. I'm a graduate of Surrey University where I studied Computer Science.

Find me on linked in http://uk.linkedin.com/in/mhayward89