Main

1) Making sure the HTML file is saved as UTF-8
2) Changing the charset of the HTML file in the <head> section
3) Sent a header via PHP
4) Using the mb_string module for apache
5) The Database

Extra

1) Communication with Doctrine
2) Communication with AMFPHP

1) Making sure the HTML file is saved as UTF-8

Displaying UTF-8 characters in the browser.

First we make sure that when we save an html file that contains UTF-8 characters, it really is saved as UTF-8 text. I use Eclipse as IDE and you will find the setting at the following location:

  • Windows: Window > Preferences… > General > Workspace : Text file encoding
  • Mac: Eclipse > Preferences… > General > Workspace : Text file encoding

Check the ‘Other’ radio button and choose ‘UTF-8′ from the dropdown list. Press the ‘OK’ button

2) Changing the charset of the HTML file in the <head> section

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8” />
</head>
<body>

When we make an HTML file and use UTF-8 characters, it will display the characters correctly.

Using a PHP server to generate a page

When we use a PHP server to process the php code and receive HTML there is another catch we have to keep in mind. The server could have a server encoding setting which could screw up the UTF-8 characters. We solve this by sending a header in the php script.

3) Sent a header via PHP

At the top of our page we put the following PHP code:

header('Content-type: text/html; charset=utf-8');

The server which processes the page now also sends the UTF-8 characters as UTF-8 to the client.

Manipulating the UTF-8 data in PHP

Everybody uses the strlen() function to get the length of a string. Now there is a problem when using the strlen() function on an UTF-8 string. The strlen() function reads the number of bytes instead of the number of characters. So when we execute the command : echo strlen(’être’);
It will return 5 and not 4 as we expected. This is because the character ê is a character of 2 bytes.

Obviously this is not the behavior we want.

4) Using the mb_string module for apache

The solution to this problem is installing a module ‘mb_string’ which has to be loaded into apache.
It is a module with a set of standard functions that can operate on multi byte strings (such as UTF-8)
More info can be found at http://be2.php.net/mb_string

So instead of using strlen() we use mb_strlen().
But also here there is a catch. The module ‘mb_string’ has an internal encoding setting.
We have to set the internal encoding to UTF-8.

There are 2 ways for doing this

1) add the string ‘UTF-8′ as second parameter to the function
echo mb_strlen('être',‘UTF-8′);

2) or changing the ini setting via the commmand
ini_set('mbstring.internal_encoding','UTF-8');

5) The database

The final step in our way to success is setting up the database correctly.
You have to make sure that your table + the fields have a collation that supports UTF-8 such as utf8_general_ci

This should do the trick to display and manipulate UTF-8 based characters in PHP.

EXTRA (Thanks to Filip Heymans)

1) Communication with Doctrine
2) Communication with AMFPHP

1) Communication with Doctrine

Doctrine is a persistence framework for PHP. Instead of using SQL to create CRUD operations, you will create objects and perform actions on those objects. More info at http://www.doctrine-project.org.
It is simular to Hibernate for Java.

When considering using doctrine, you must also set a property to the UTF-8 encoding or the insert and update statements will not produce the expected result.

Doctrine uses a connection object. You have to set the charset on that connection object via the setCharset method.
$conn
= Doctrine_Manager::connection(‘mysql://username:password@localhost/test’)
$conn->setCharset(’UTF8′);

2) Communication with Amfphp

AMFPHP is a free open-source PHP implementation of the Action Message Format(AMF). AMF allows for binary serialization of Action Script (AS2, AS3) native types and objects to be sent to server side services.

In the gateway.php file in the amphp folder. You have to find the rule

$gateway->setCharsetHandler("utf8_decode", "ISO-8859-1", "ISO-8859-1");

and change it to

$gateway->setCharsetHandler("none", "UTF-8", "UTF-8");

This makes sure you also pass utf-8 encoded characters to your flash or flex application.


No Tags