Some important information about UTF-8
--------------------------------------
The de-facto text encoding of today. Here's how codepoints can be represented
with variable length of bytes:
1. Single bytes represent codepoints from 0x00 to 0x7F (identical to 7-bit
ASCII)
2. Multibyte sequences (codepoints from 0x80 to 0x10FFFD):
Header bytes: 0xC0 to 0xFD
- the number of '1' bits above the topmost '0' bit indicates the number of
bytes (including this one) in the whole sequence
- the data payload starts _after_ the topmost '0' bit in this byte
Trailer bytes: 0x80 to 0xBF
- the data payload starts _after_ the topmost '0' bit in this byte
3. Invalid bytes: 0xFE, 0xFF - must never occur in a UTF-8 text
Surrogate pairs (representation of codepoints above 0xFFFF = 65535 with two
codepoints within this range):
1. Convert from a codepoint to the pair:
lead = 0xD7C0 + (codepoint >> 10)
trail = 0xDC00 + (codepoint & 0x3FF)
2. Convert from a pair to the codepoint:
codepoint = (lead << 10) + trail - 0x35FDC00
--- Luxferre ---
Response:
text/plain