/program/lib/utf8lib.php - utility-routines for UTF-8
This file deals with the idiosyncrasies of UTF-8.
Reference: The Unicode Consortium, The Unicode Standard, Version 6.0.0, (Mountain View, CA: The Unicode Consortium, 2011, ISBN 978-1-936213-01-6) <http://www.unicode.org/versions/Unicode6.0.0>, chapter 3 (Conformance), page 94
Summary of the way valid code points are stored in 1 to 4 byte sequences.
bits code-points 1st byte 2nd byte 3rd byte 4th byte ---- ----------------------------- --------- --------- --------- --------- 7 0000 0000 0000 0000 0xxx xxxx 0xxx.xxxx 11 0000 0000 0000 0yyy yyxx xxxx 110y.yyyy 10xx.xxxx 16 0000 0000 zzzz yyyy yyxx xxxx 1110.zzzz 10yy.yyyy 10xx.xxxx 21 000w wwzz zzzz yyyy yyxx xxxx 1111.0www 10zz.zzzz 10yy.yyyy 10xx.xxxx bits byte 1 byte 2 byte 3 byte 4 comments ---- ------- ------- ------- ------- -------- 7 00 - 7F U+0000 - U+007F 80 - BF --> ill-formed C0 - C1 80 - BF --> overlong 2-byte 11 C2 - DF 80 - BF U+0080 - U+07FF E0 80 - 9F* 80 - BF --> overlong 3-byte 16 E0 A0 - BF* 80 - BF U+0800 - U+0FFF 16 E1 - EC 80 - BF 80 - BF U+1000 - U+CFFF 16 ED 80 - 9F* 80 - BF U+D000 - U+D7FF ED A0 - BF* 80 - BF --> surrogates (U+D800 - U+DFFF) 16 EE - EF 80 - BF 80 - BF U+E000 - U+FFFF F0 80 - 8F* 80 - BF 80 - BF --> overlong 4-byte 21 F0 90 - BF* 80 - BF 80 - BF U+10000 - U+3FFFF 21 F1 - F3 80 - BF 80 - BF 80 - BF U+40000 - U+FFFFF 21 F4 80 - 8F* 80 - BF 80 - BF U+100000 - U+10FFFF F4 90 - BF* 80 - BF 80 - BF --> invalid planes 11 - 13 F5 - F7 80 - BF 80 - BF 80 - BF --> invalid planes 14 - 1F F8 - FB --> invalid 5-byte sequence FC - FD --> invalid 6-byte sequence FE - FF --> disallowed (BOM)
Note: Non-standard continuation ranges are marked with * (only for byte 2)
compare two UTF8 strings in a case-INsensitive way
This compares two UTF-8 strings caseINsensitive. We do this by comparing the lowercase variant of the strings with the regular strcmp. We use lowercasing because that translation table is 'better' than the upper case one.
Note that this is a quick and dirty approach: we simply use the multibyte strings as-is rather than extracting the actual code points and applying some sort of collation because that would make it much more complicated.
calculate the number of code points encoded in an UTF-8 string
This routine uses a trick to calculate the number of code points available in string $str. By first converting the UTF-8 string to ISO-8859-1 all multi-byte code points (ie. all non-ASCII) are converted to a single byte: the correct ISO-8859-1 character where possible and a '?' for characters not available in the ISO-8859-1 reportoire. Note that utf8_decode() does NOT work very well on ill-formed UTF-8 strings, e.g. it happily interprets codes in invalid planes and translates truncated sequences to chr(0). The good news is that it works for valid UTF-8.
map some UTF-8 characters to comparable ASCII strings
this maps a lot of UTF-8 characters to more or less comparable ASCII strings, e.g. an A-acute is mapped to 'A', etc. Handy when trying to construct readable filenames from letters with diacritics, eg. e-acute 'l' e-grave 'v' 'e' (French for pupil) maps to 'e' 'l' 'e' 'v' e' rather than 'l' 'v' 'e' if the diacriticals would simply be 'eaten'.
Note that UTF-8 characters that are NOT mapped to ASCII are retained, i.e. the result is not plain ASCII but an UTF-8 string with as much characters mapped to ASCII as possible.
fold a UTF-8 string to lower case
this routine tries to use the multibyte routine for folding, but if it is not available we fall back to our own translation table $UTF8_UPPER_LOWER which is derived straight from the Unicode Character Database.
Informal benchmarking yielded no significant speed difference between mb_strtolower() and strtr(). However, our table needs memory to store the 1100+ pairs and for mb_strtolower() this is already taken care of. OTOH: depending on the version of mb_string, our table (UCD 6.0.0 February 2011) may be more complete and up to date. Oh well.
fold a UTF-8 string to upper case (sort of)
this routine tries to use the multibyte routine for folding, but if it is not available we fall back to our own translation table $UTF8_LOWER_UPPER which is derived from the Unicode Character Database. Note that this table has some quirks because the underlying table is injective.
return part of a UTF-8 string
this routine returns a valid UTF-8 substring from $utf8str based on the $start and $length parameters in a way comparable to substr(). If available, we use the mb_substr() replacement, otherwise we perform the requested actions ourselves.
If we do it ourselves, we prefix variable names with 'c_' for characters and 'b_' for bytes. A UTF-8 character takes up to 4 bytes.
check an arbitrary string for UTF-8 conformity
return TRUE if string is valid UTF-8, FALSE otherwise.
the regular expression appears to crash PHP/Apache due to memory exhaustion when used on long(ish) strings. Therefore we validate only the short(ish) strings via a RE, while using the 22% slower algoritm for the longer ones. The pivot point of 4096 is an emperical value. It might mean trouble in the future, with different servers or OS's. We will cross that bridge when we get there... If you want to be on the safe side you could skip the RE-routine completely by lowering the pivot point to below 0.
Here is a broken down version of the RE:
Documentation generated on Tue, 28 Jun 2016 19:12:39 +0200 by phpDocumentor 1.4.0