File/program/lib/utf8lib.php

Description

/program/lib/utf8lib.php - utility-routines for UTF-8

This file deals with the idiosyncrasies of UTF-8.

Reference: The Unicode Consortium, The Unicode Standard, Version 6.0.0, (Mountain View, CA: The Unicode Consortium, 2011, ISBN 978-1-936213-01-6) <http://www.unicode.org/versions/Unicode6.0.0>, chapter 3 (Conformance), page 94

Summary of the way valid code points are stored in 1 to 4 byte sequences.

bits code-points 1st byte 2nd byte 3rd byte 4th byte ---- ----------------------------- --------- --------- --------- --------- 7 0000 0000 0000 0000 0xxx xxxx 0xxx.xxxx 11 0000 0000 0000 0yyy yyxx xxxx 110y.yyyy 10xx.xxxx 16 0000 0000 zzzz yyyy yyxx xxxx 1110.zzzz 10yy.yyyy 10xx.xxxx 21 000w wwzz zzzz yyyy yyxx xxxx 1111.0www 10zz.zzzz 10yy.yyyy 10xx.xxxx

bits byte 1 byte 2 byte 3 byte 4 comments ---- ------- ------- ------- ------- -------- 7 00 - 7F U+0000 - U+007F 80 - BF --> ill-formed C0 - C1 80 - BF --> overlong 2-byte 11 C2 - DF 80 - BF U+0080 - U+07FF E0 80 - 9F* 80 - BF --> overlong 3-byte 16 E0 A0 - BF* 80 - BF U+0800 - U+0FFF 16 E1 - EC 80 - BF 80 - BF U+1000 - U+CFFF 16 ED 80 - 9F* 80 - BF U+D000 - U+D7FF ED A0 - BF* 80 - BF --> surrogates (U+D800 - U+DFFF) 16 EE - EF 80 - BF 80 - BF U+E000 - U+FFFF F0 80 - 8F* 80 - BF 80 - BF --> overlong 4-byte 21 F0 90 - BF* 80 - BF 80 - BF U+10000 - U+3FFFF 21 F1 - F3 80 - BF 80 - BF 80 - BF U+40000 - U+FFFFF 21 F4 80 - 8F* 80 - BF 80 - BF U+100000 - U+10FFFF F4 90 - BF* 80 - BF 80 - BF --> invalid planes 11 - 13 F5 - F7 80 - BF 80 - BF 80 - BF --> invalid planes 14 - 1F F8 - FB --> invalid 5-byte sequence FC - FD --> invalid 6-byte sequence FE - FF --> disallowed (BOM)

Note: Non-standard continuation ranges are marked with * (only for byte 2)

Constants
USE_MBSTRING = TRUE (line 70)
Functions
utf8_strcasecmp (line 181)

compare two UTF8 strings in a case-INsensitive way

This compares two UTF-8 strings caseINsensitive. We do this by comparing the lowercase variant of the strings with the regular strcmp. We use lowercasing because that translation table is 'better' than the upper case one.

Note that this is a quick and dirty approach: we simply use the multibyte strings as-is rather than extracting the actual code points and applying some sort of collation because that would make it much more complicated.

  • return: result < 0 if string 1 < string 2, result > 0 if string 1 > string 2, 0 when equal
int utf8_strcasecmp (string $utf8str1, string $utf8str2)
  • string $utf8str1: first string
  • string $utf8str2: second string
utf8_strlen (line 112)

calculate the number of code points encoded in an UTF-8 string

This routine uses a trick to calculate the number of code points available in string $str. By first converting the UTF-8 string to ISO-8859-1 all multi-byte code points (ie. all non-ASCII) are converted to a single byte: the correct ISO-8859-1 character where possible and a '?' for characters not available in the ISO-8859-1 reportoire. Note that utf8_decode() does NOT work very well on ill-formed UTF-8 strings, e.g. it happily interprets codes in invalid planes and translates truncated sequences to chr(0). The good news is that it works for valid UTF-8.

  • return: the number of code points in string $str
int utf8_strlen (string $utf8str)
  • string $utf8str: a valid UTF-8 string to examine
utf8_strtoascii (line 268)

map some UTF-8 characters to comparable ASCII strings

this maps a lot of UTF-8 characters to more or less comparable ASCII strings, e.g. an A-acute is mapped to 'A', etc. Handy when trying to construct readable filenames from letters with diacritics, eg. e-acute 'l' e-grave 'v' 'e' (French for pupil) maps to 'e' 'l' 'e' 'v' e' rather than 'l' 'v' 'e' if the diacriticals would simply be 'eaten'.

Note that UTF-8 characters that are NOT mapped to ASCII are retained, i.e. the result is not plain ASCII but an UTF-8 string with as much characters mapped to ASCII as possible.

  • return: a 'best-effort' ASCII-approximation of $utf8str
  • uses: $UTF8_ASCII
string utf8_strtoascii (string $utf8str)
  • string $utf8str: input string possibly with letters with diacriticals etc.
utf8_strtolower (line 134)

fold a UTF-8 string to lower case

this routine tries to use the multibyte routine for folding, but if it is not available we fall back to our own translation table $UTF8_UPPER_LOWER which is derived straight from the Unicode Character Database.

Informal benchmarking yielded no significant speed difference between mb_strtolower() and strtr(). However, our table needs memory to store the 1100+ pairs and for mb_strtolower() this is already taken care of. OTOH: depending on the version of mb_string, our table (UCD 6.0.0 February 2011) may be more complete and up to date. Oh well.

  • return: the lowercase equivalent of $utf8str
string utf8_strtolower (string $utf8str)
  • string $utf8str: a valid UTF-8 string to examine
utf8_strtoupper (line 155)

fold a UTF-8 string to upper case (sort of)

this routine tries to use the multibyte routine for folding, but if it is not available we fall back to our own translation table $UTF8_LOWER_UPPER which is derived from the Unicode Character Database. Note that this table has some quirks because the underlying table is injective.

  • return: the uppercase equivalent of $utf8str
string utf8_strtoupper (string $utf8str)
  • string $utf8str: a valid UTF-8 string to examine
utf8_substr (line 201)

return part of a UTF-8 string

this routine returns a valid UTF-8 substring from $utf8str based on the $start and $length parameters in a way comparable to substr(). If available, we use the mb_substr() replacement, otherwise we perform the requested actions ourselves.

If we do it ourselves, we prefix variable names with 'c_' for characters and 'b_' for bytes. A UTF-8 character takes up to 4 bytes.

  • return: the requested substring of $utf8str or FALSE if $start points beyond the string
string utf8_substr (string $utf8str, int $start, [int $length = NULL])
  • string $utf8str: a valid UTF-8 string to examine
  • int $start: an offset expressed in characters (not bytes)
  • int $length: the length of the string to return (also expressed in characters)
utf8_validate (line 85)

check an arbitrary string for UTF-8 conformity

return TRUE if string is valid UTF-8, FALSE otherwise.

  • return: TRUE if valid UTF-8, FALSE otherwise
bool utf8_validate (string $str)
  • string $str: the string to check

Documentation generated on Wed, 11 May 2011 23:45:46 +0200 by phpDocumentor 1.4.0