UTF-8 decoder

CH-K
4 min readSep 25, 2020
Photo by Paul Zoetemeijer on Unsplash

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format — 8-bit.

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.

A byte is 8 bits (binary data).

A byte array is an array of bytes.

You could use a byte array to store a collection of binary data, for example, the contents of a file. The downside to this is that the entire file contents must be loaded into memory.

Encoding

Since the restriction of the Unicode code-space to 21-bit values in 2003, UTF-8 is defined to encode code points in one to four bytes, depending on the number of significant bits in the numerical value of the code point. The following table shows the structure of the encoding. The x characters are replaced by the bits of the code point.

The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode.

The first 128 characters (US-ASCII) https://en.wikipedia.org/wiki/ASCII#Character_set

Examples of after encoding

After string is encoded, will store in a buffer array:

“asd” — encodeBuffer([0x61, 0x73, 0x64])

“aαbβ” — encodeBuffer([0x61, 0xce, 0xb1, 0x62, 0xce, 0xb2])

“一些華語” — encode → Buffer([0xe4, 0xb8, 0x80, 0xe4, 0xba, 0x9b, 0xe8, 0x8f, 0xaf, 0xe8, 0xaa, 0x9e])

Buffer

Pure JavaScript is Unicode friendly but not nice to binary data. When dealing with TCP streams or the file system, it’s necessary to handle octet streams. Node has several strategies for manipulating, creating, and consuming octet streams.

Raw data is stored in instances of the Buffer class. A Buffer is similar to an array of integers but corresponds to a raw memory allocation outside the V8 heap. A Buffer cannot be resized.

After getting a buffer array, how to decode?

  1. categorize each buffer[i] based on UTF-8 encoding rule
  2. push into a new buffer array accordingly
  3. shift to next index and start from step 1 again until the end of buffer array
loopBufferArray (buffer, i, newBuffer) {  if (buffer[i] < 128) // 1st rule of 0xxxxxxx 
newBuffer.push(buffer[i]) // push 1 byte
loopBufferArray(buffer, i+1, newBuffer)
else if (buffer[i] < 224) // 2nd rule of 110xxxxx
newBuffer.push(buffer.slice(i, i+2)) // push 2 bytes
loopBufferArray(buffer, i+2, newBuffer)
else if (buffer[i] < 240) // 3rd rule of 1110xxxx
newBuffer.push(buffer.slice(i, i+3)) // push 3 bytes
loopBufferArray(buffer, i+3, newBuffer)
else if (buffer[i] < 247) // 4th rule of 11110xxx
newBuffer.push(buffer.slice(i, i+4)) // push 4 bytes
loopBufferArray(buffer, i+4, newBuffer)
}

From the above process, we found out each utf8 character is stored as a buffer array.

“asd” — encode→ Buffer([0x61, 0x73, 0x64]) — decode[Buffer([0x61]), Buffer([0x73]), Buffer([0x64])] → “asd”

“aαbβ” — encode → Buffer([0x61, 0xce, 0xb1, 0x62, 0xce, 0xb2]) — decode[Buffer([0x61]), Buffer([0xce, 0xb1]), Buffer([0x62]), Buffer([0xce 0xb2]] → “aαbβ”

“一些華語” — encode → Buffer([0xe4, 0xb8, 0x80, 0xe4, 0xba, 0x9b, 0xe8, 0x8f, 0xaf, 0xe8, 0xaa, 0x9e]) — decode[Buffer([0xe4, 0xb8, 0x80]), Buffer([0xe4, 0xba, 0x9b]), Buffer([0xe8, 0x8f, 0xaf]), Buffer([0xe8, 0xaa, 0x9e])] → “一些華語”

Some definition

Octet (computing)

The octet is a unit of digital information in computing and telecommunications that consists of eight bits. The term is often used when the term byte might be ambiguous, as the byte has historically been used for storage units of a variety of sizes.

Unicode

Unicode is an information technology (IT) standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems.

--

--