]> UniWakka : MultiLanguageIssues

UniWakka : MultiLanguageIssues

HomePage :: PageIndex :: RecentChanges :: RecentlyCommented :: UserSettings :: You are 38.103.63.61

Issues with MultiLanguage Support


At the present time UniWakka provides MultiLanguage support by using 3 different character encodings.

User input is encoded with utf-8. Data are stored in mysql as iso-8859-1 and unicode entities for every character that cannot be represented with a single byte. When converted into html the data are encoded as plain ascii plus unicode entities.

Why? The problem is that PHP does not deals with multibyte character encodings, such as utf-8. To deal with these charset there is the multi byte library (Multybyte String Functions). But this extension is experimental and does not provide perl-like regular expression (the regular expression use in Wakka and UniWakka).

Given this limitation of PHP, two different approaches could be taken. Both of them have some shortcomings.
  1. the UniWakka approach: data are stored as signle byte characters and presented in forms (for user input) as utf-8;
  2. data could be store in mysql as utf-8 and converted into single-byte when PHP must manipulate them.

The shortcoming of the first approach is that mysql fulltext search cannot be used. Characters  that cannot be represented with single byte are store as unicode entities. And unicode entities cannot be searched with fulltext search. On the other side, the benefit is that data are converted only when stored in the database.

The problems with the second approach are the following:
On the other side we could use fulltext search.

Now, the big question: do we need fulltext search at the point of requiring mysql 4.1 (not very common among hosting providers)?

Please leave your comments.
--AndreaRossato


hahahah
There are 2 comments on this page. [Display comments/form]