Issues with MultiLanguage Support
At the present time
UniWakka provides
MultiLanguage support by using 3 different character encodings.
User input is encoded with utf-8. Data are stored in mysql as iso-8859-1 and unicode entities for every character that cannot be represented with a single byte. When converted into html the data are encoded as plain ascii plus unicode entities.
Why? The problem is that PHP does not deals with multibyte character encodings, such as utf-8. To deal with these charset there is the multi byte library (Multybyte String Functions). But this extension is experimental and does not provide perl-like regular expression (the regular expression use in Wakka and
UniWakka).
Given this limitation of PHP, two different approaches could be taken. Both of them have some shortcomings.
- the UniWakka approach: data are stored as signle byte characters and presented in forms (for user input) as utf-8;
- data could be store in mysql as utf-8 and converted into single-byte when PHP must manipulate them.
The shortcoming of the first approach is that mysql fulltext search cannot be used. Characters that cannot be represented with single byte are store as unicode entities. And unicode entities cannot be searched with fulltext search. On the other side, the benefit is that data are converted only when stored in the database.
The problems with the second approach are the following:
- we need mysql 4.1 or above (otherwise charset must be chosen at compilation time or requires administrative privileges);
- that data must be converted into single byte every time we access the database;
On the other side we could use fulltext search.
Now, the big question: do we need fulltext search at the point of requiring mysql 4.1 (not very common among hosting providers)?
Please leave your comments.
--
AndreaRossato
hahahah