File/program/lib/utf8lib.php

Description

/program/lib/utf8lib.php - utility-routines for UTF-8

This file deals with the idiosyncrasies of UTF-8.

Reference: The Unicode Consortium, The Unicode Standard, Version 6.0.0, (Mountain View, CA: The Unicode Consortium, 2011, ISBN 978-1-936213-01-6) <http://www.unicode.org/versions/Unicode6.0.0>, chapter 3 (Conformance), page 94

Summary of the way valid code points are stored in 1 to 4 byte sequences.

bits   code-points                     1st byte   2nd byte   3rd byte   4th byte
----   -----------------------------   ---------  ---------  ---------  ---------
   7   0000 0000 0000 0000 0xxx xxxx   0xxx.xxxx
  11   0000 0000 0000 0yyy yyxx xxxx   110y.yyyy  10xx.xxxx
  16   0000 0000 zzzz yyyy yyxx xxxx   1110.zzzz  10yy.yyyy  10xx.xxxx
  21   000w wwzz zzzz yyyy yyxx xxxx   1111.0www  10zz.zzzz  10yy.yyyy  10xx.xxxx

bits    byte 1   byte 2   byte 3   byte 4  comments
----   -------  -------  -------  -------  --------
   7   00 - 7F                             U+0000 - U+007F
       80 - BF                             --> ill-formed
       C0 - C1  80 - BF                    --> overlong 2-byte
  11   C2 - DF  80 - BF                    U+0080 - U+07FF
       E0       80 - 9F* 80 - BF           --> overlong 3-byte
  16   E0       A0 - BF* 80 - BF           U+0800 - U+0FFF
  16   E1 - EC  80 - BF  80 - BF           U+1000 - U+CFFF
  16   ED       80 - 9F* 80 - BF           U+D000 - U+D7FF
       ED       A0 - BF* 80 - BF           --> surrogates (U+D800 - U+DFFF)
  16   EE - EF  80 - BF  80 - BF           U+E000 -   U+FFFF
       F0       80 - 8F* 80 - BF  80 - BF  --> overlong 4-byte
  21   F0       90 - BF* 80 - BF  80 - BF  U+10000 - U+3FFFF
  21   F1 - F3  80 - BF  80 - BF  80 - BF  U+40000 - U+FFFFF
  21   F4       80 - 8F* 80 - BF  80 - BF  U+100000 - U+10FFFF
       F4       90 - BF* 80 - BF  80 - BF  --> invalid planes 11 - 13
       F5 - F7  80 - BF  80 - BF  80 - BF  --> invalid planes 14 - 1F
       F8 - FB                             --> invalid 5-byte sequence
       FC - FD                             --> invalid 6-byte sequence
       FE - FF                             --> disallowed (BOM)

Note: Non-standard continuation ranges are marked with * (only for byte 2)

Constants
USE_MBSTRING = TRUE (line 72)
Functions
utf8_strcasecmp (line 233)

compare two UTF8 strings in a case-INsensitive way

This compares two UTF-8 strings caseINsensitive. We do this by comparing the lowercase variant of the strings with the regular strcmp. We use lowercasing because that translation table is 'better' than the upper case one.

Note that this is a quick and dirty approach: we simply use the multibyte strings as-is rather than extracting the actual code points and applying some sort of collation because that would make it much more complicated.

  • return: result < 0 if string 1 < string 2, result > 0 if string 1 > string 2, 0 when equal
int utf8_strcasecmp (string $utf8str1, string $utf8str2)
  • string $utf8str1: first string
  • string $utf8str2: second string
utf8_strlen (line 164)

calculate the number of code points encoded in an UTF-8 string

This routine uses a trick to calculate the number of code points available in string $str. By first converting the UTF-8 string to ISO-8859-1 all multi-byte code points (ie. all non-ASCII) are converted to a single byte: the correct ISO-8859-1 character where possible and a '?' for characters not available in the ISO-8859-1 reportoire. Note that utf8_decode() does NOT work very well on ill-formed UTF-8 strings, e.g. it happily interprets codes in invalid planes and translates truncated sequences to chr(0). The good news is that it works for valid UTF-8.

  • return: the number of code points in string $str
int utf8_strlen (string $utf8str)
  • string $utf8str: a valid UTF-8 string to examine
utf8_strtoascii (line 320)

map some UTF-8 characters to comparable ASCII strings

this maps a lot of UTF-8 characters to more or less comparable ASCII strings, e.g. an A-acute is mapped to 'A', etc. Handy when trying to construct readable filenames from letters with diacritics, eg. e-acute 'l' e-grave 'v' 'e' (French for pupil) maps to 'e' 'l' 'e' 'v' e' rather than 'l' 'v' 'e' if the diacriticals would simply be 'eaten'.

Note that UTF-8 characters that are NOT mapped to ASCII are retained, i.e. the result is not plain ASCII but an UTF-8 string with as much characters mapped to ASCII as possible.

  • return: a 'best-effort' ASCII-approximation of $utf8str
  • uses: $UTF8_ASCII
string utf8_strtoascii (string $utf8str)
  • string $utf8str: input string possibly with letters with diacriticals etc.
utf8_strtolower (line 186)

fold a UTF-8 string to lower case

this routine tries to use the multibyte routine for folding, but if it is not available we fall back to our own translation table $UTF8_UPPER_LOWER which is derived straight from the Unicode Character Database.

Informal benchmarking yielded no significant speed difference between mb_strtolower() and strtr(). However, our table needs memory to store the 1100+ pairs and for mb_strtolower() this is already taken care of. OTOH: depending on the version of mb_string, our table (UCD 6.0.0 February 2011) may be more complete and up to date. Oh well.

  • return: the lowercase equivalent of $utf8str
string utf8_strtolower (string $utf8str)
  • string $utf8str: a valid UTF-8 string to examine
utf8_strtoupper (line 207)

fold a UTF-8 string to upper case (sort of)

this routine tries to use the multibyte routine for folding, but if it is not available we fall back to our own translation table $UTF8_LOWER_UPPER which is derived from the Unicode Character Database. Note that this table has some quirks because the underlying table is injective.

  • return: the uppercase equivalent of $utf8str
string utf8_strtoupper (string $utf8str)
  • string $utf8str: a valid UTF-8 string to examine
utf8_substr (line 253)

return part of a UTF-8 string

this routine returns a valid UTF-8 substring from $utf8str based on the $start and $length parameters in a way comparable to substr(). If available, we use the mb_substr() replacement, otherwise we perform the requested actions ourselves.

If we do it ourselves, we prefix variable names with 'c_' for characters and 'b_' for bytes. A UTF-8 character takes up to 4 bytes.

  • return: the requested substring of $utf8str or FALSE if $start points beyond the string
string utf8_substr (string $utf8str, int $start, [int $length = NULL])
  • string $utf8str: a valid UTF-8 string to examine
  • int $start: an offset expressed in characters (not bytes)
  • int $length: the length of the string to return (also expressed in characters)
utf8_validate (line 109)

check an arbitrary string for UTF-8 conformity

return TRUE if string is valid UTF-8, FALSE otherwise.

the regular expression appears to crash PHP/Apache due to memory exhaustion when used on long(ish) strings. Therefore we validate only the short(ish) strings via a RE, while using the 22% slower algoritm for the longer ones. The pivot point of 4096 is an emperical value. It might mean trouble in the future, with different servers or OS's. We will cross that bridge when we get there... If you want to be on the safe side you could skip the RE-routine completely by lowering the pivot point to below 0.

Here is a broken down version of the RE:

     $pattern = '/^([\\x00-\\x7F]'.                         // ASCII (including ctrl-chars)             '|[\\xC2-\\xDF][\\x80-\\xBF]'.              // non-overlong 2-byte             '|\\xE0[\\xA0-\\xBF][\\x80-\\xBF]'.         // 3-byte excluding overlongs             '|[\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2}'. // 3-byte (plain)             '|\\xED[\\x80-\\x9F][\\x80-\\xBF]'.         // 3-byte excluding surrogates             '|\\xF0[\\x90-\\xBF][\\x80-\\xBF]{2}'.      // 4-byte excluding overlongs             '|[\\xF1-\\xF3][\\x80-\\xBF]{3}'.           // 4-byte planes 4-15             '|\\xF4[\\x80-\\x8F][\\x80-\\xBF]{2})*$/';  // 4-byte plane 16

bool utf8_validate (string $str)
  • string $str: the string to check

Documentation generated on Tue, 28 Jun 2016 19:12:39 +0200 by phpDocumentor 1.4.0