File/program/modules/crew/server/utf8lib.php

Description

/program/lib/utf8lib.php - utility-routines for UTF-8

This file deals with the idiosyncrasies of UTF-8.

Reference: The Unicode Consortium, The Unicode Standard, Version 6.0.0, (Mountain View, CA: The Unicode Consortium, 2011, ISBN 978-1-936213-01-6) <http://www.unicode.org/versions/Unicode6.0.0>, chapter 3 (Conformance), page 94

Summary of the way valid code points are stored in 1 to 4 byte sequences.

bits   code-points                     1st byte   2nd byte   3rd byte   4th byte
----   -----------------------------   ---------  ---------  ---------  ---------
   7   0000 0000 0000 0000 0xxx xxxx   0xxx.xxxx
  11   0000 0000 0000 0yyy yyxx xxxx   110y.yyyy  10xx.xxxx
  16   0000 0000 zzzz yyyy yyxx xxxx   1110.zzzz  10yy.yyyy  10xx.xxxx
  21   000w wwzz zzzz yyyy yyxx xxxx   1111.0www  10zz.zzzz  10yy.yyyy  10xx.xxxx

bits    byte 1   byte 2   byte 3   byte 4  comments
----   -------  -------  -------  -------  --------
   7   00 - 7F                             U+0000 - U+007F
       80 - BF                             --> ill-formed
       C0 - C1  80 - BF                    --> overlong 2-byte
  11   C2 - DF  80 - BF                    U+0080 - U+07FF
       E0       80 - 9F* 80 - BF           --> overlong 3-byte
  16   E0       A0 - BF* 80 - BF           U+0800 - U+0FFF
  16   E1 - EC  80 - BF  80 - BF           U+1000 - U+CFFF
  16   ED       80 - 9F* 80 - BF           U+D000 - U+D7FF
       ED       A0 - BF* 80 - BF           --> surrogates (U+D800 - U+DFFF)
  16   EE - EF  80 - BF  80 - BF           U+E000 -   U+FFFF
       F0       80 - 8F* 80 - BF  80 - BF  --> overlong 4-byte
  21   F0       90 - BF* 80 - BF  80 - BF  U+10000 - U+3FFFF
  21   F1 - F3  80 - BF  80 - BF  80 - BF  U+40000 - U+FFFFF
  21   F4       80 - 8F* 80 - BF  80 - BF  U+100000 - U+10FFFF
       F4       90 - BF* 80 - BF  80 - BF  --> invalid planes 11 - 13
       F5 - F7  80 - BF  80 - BF  80 - BF  --> invalid planes 14 - 1F
       F8 - FB                             --> invalid 5-byte sequence
       FC - FD                             --> invalid 6-byte sequence
       FE - FF                             --> disallowed (BOM)

Note: Non-standard continuation ranges are marked with * (only for byte 2)

Documentation generated on Tue, 28 Jun 2016 19:12:39 +0200 by phpDocumentor 1.4.0