/program/lib/utf8lib.php - utility-routines for UTF-8
This file deals with the idiosyncrasies of UTF-8.
Reference: The Unicode Consortium, The Unicode Standard, Version 6.0.0, (Mountain View, CA: The Unicode Consortium, 2011, ISBN 978-1-936213-01-6) <http://www.unicode.org/versions/Unicode6.0.0>, chapter 3 (Conformance), page 94
Summary of the way valid code points are stored in 1 to 4 byte sequences.
bits code-points 1st byte 2nd byte 3rd byte 4th byte ---- ----------------------------- --------- --------- --------- --------- 7 0000 0000 0000 0000 0xxx xxxx 0xxx.xxxx 11 0000 0000 0000 0yyy yyxx xxxx 110y.yyyy 10xx.xxxx 16 0000 0000 zzzz yyyy yyxx xxxx 1110.zzzz 10yy.yyyy 10xx.xxxx 21 000w wwzz zzzz yyyy yyxx xxxx 1111.0www 10zz.zzzz 10yy.yyyy 10xx.xxxx bits byte 1 byte 2 byte 3 byte 4 comments ---- ------- ------- ------- ------- -------- 7 00 - 7F U+0000 - U+007F 80 - BF --> ill-formed C0 - C1 80 - BF --> overlong 2-byte 11 C2 - DF 80 - BF U+0080 - U+07FF E0 80 - 9F* 80 - BF --> overlong 3-byte 16 E0 A0 - BF* 80 - BF U+0800 - U+0FFF 16 E1 - EC 80 - BF 80 - BF U+1000 - U+CFFF 16 ED 80 - 9F* 80 - BF U+D000 - U+D7FF ED A0 - BF* 80 - BF --> surrogates (U+D800 - U+DFFF) 16 EE - EF 80 - BF 80 - BF U+E000 - U+FFFF F0 80 - 8F* 80 - BF 80 - BF --> overlong 4-byte 21 F0 90 - BF* 80 - BF 80 - BF U+10000 - U+3FFFF 21 F1 - F3 80 - BF 80 - BF 80 - BF U+40000 - U+FFFFF 21 F4 80 - 8F* 80 - BF 80 - BF U+100000 - U+10FFFF F4 90 - BF* 80 - BF 80 - BF --> invalid planes 11 - 13 F5 - F7 80 - BF 80 - BF 80 - BF --> invalid planes 14 - 1F F8 - FB --> invalid 5-byte sequence FC - FD --> invalid 6-byte sequence FE - FF --> disallowed (BOM)
Note: Non-standard continuation ranges are marked with * (only for byte 2)
Documentation generated on Tue, 28 Jun 2016 19:12:39 +0200 by phpDocumentor 1.4.0