UTF-8 decoder

4 min readSep 25, 2020

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format — 8-bit.

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.

A byte is 8 bits (binary data).
A byte array is an array of bytes.

You could use a byte array to store a collection of binary data, for example, the contents of a file. The downside to this is that the entire file contents must be loaded into memory.

Encoding

Since the restriction of the Unicode code-space to 21-bit values in 2003, UTF-8 is defined to encode code points in one to four bytes, depending on the number of significant bits in the numerical value of the code point. The following table shows the structure of the encoding. The x characters are replaced by the bits of the code point.

The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode.

The first 128 characters (US-ASCII) https://en.wikipedia.org/wiki/ASCII#Character_set

Examples of after encoding

After string is encoded, will store in a buffer array:

“asd” — encode → Buffer([0x61, 0x73, 0x64])

“aαbβ” — encode → Buffer([0x61, 0xce, 0xb1, 0x62, 0xce, 0xb2])

“一些華語” — encode → Buffer([0xe4, 0xb8, 0x80, 0xe4, 0xba, 0x9b, 0xe8, 0x8f, 0xaf, 0xe8, 0xaa, 0x9e])

Buffer

Pure JavaScript is Unicode friendly but not nice to binary data. When dealing with TCP streams or the file system, it’s necessary to handle octet streams. Node has several strategies for manipulating, creating, and consuming octet streams.

Raw data is stored in instances of the Buffer class. A Buffer is similar to an array of integers but corresponds to a raw memory allocation outside the V8 heap. A Buffer cannot be resized.

After getting a buffer array, how to decode?

categorize each buffer[i] based on UTF-8 encoding rule
push into a new buffer array accordingly
shift to next index and start from step 1 again until the end of buffer array

loopBufferArray (buffer, i, newBuffer) {  if (buffer[i] < 128) // 1st rule of 0xxxxxxx 
    newBuffer.push(buffer[i]) // push 1 byte
    loopBufferArray(buffer, i+1, newBuffer)  else if (buffer[i] < 224) // 2nd rule of 110xxxxx
    newBuffer.push(buffer.slice(i, i+2)) // push 2 bytes
    loopBufferArray(buffer, i+2, newBuffer)  else if (buffer[i] < 240) // 3rd rule of 1110xxxx
    newBuffer.push(buffer.slice(i, i+3)) // push 3 bytes
    loopBufferArray(buffer, i+3, newBuffer)  else if (buffer[i] < 247) // 4th rule of 11110xxx
    newBuffer.push(buffer.slice(i, i+4)) // push 4 bytes
    loopBufferArray(buffer, i+4, newBuffer)}

From the above process, we found out each utf8 character is stored as a buffer array.

“asd” — encode→ Buffer([0x61, 0x73, 0x64]) — decode →[Buffer([0x61]), Buffer([0x73]), Buffer([0x64])] → “asd”

“aαbβ” — encode → Buffer([0x61, 0xce, 0xb1, 0x62, 0xce, 0xb2]) — decode →[Buffer([0x61]), Buffer([0xce, 0xb1]), Buffer([0x62]), Buffer([0xce 0xb2]] → “aαbβ”

“一些華語” — encode → Buffer([0xe4, 0xb8, 0x80, 0xe4, 0xba, 0x9b, 0xe8, 0x8f, 0xaf, 0xe8, 0xaa, 0x9e]) — decode →[Buffer([0xe4, 0xb8, 0x80]), Buffer([0xe4, 0xba, 0x9b]), Buffer([0xe8, 0x8f, 0xaf]), Buffer([0xe8, 0xaa, 0x9e])] → “一些華語”

Some definition

Octet (computing)

The octet is a unit of digital information in computing and telecommunications that consists of eight bits. The term is often used when the term byte might be ambiguous, as the byte has historically been used for storage units of a variety of sizes.

Unicode

Unicode is an information technology (IT) standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems.

References:

UTF-8

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the…

en.wikipedia.org

ASCII

ASCII ( ASS-kee ), abbreviated from American Standard Code for Information Interchange, is a character encoding…

en.wikipedia.org

Java byte Array - byte Array in Java

Java byte Array is used to store byte data type values only . The default value of the elements in a byte array is 0 …

www.hudatutorials.com

UTF-8 Decode - Convert UTF-8 to Text - Online - Browserling Web Developer Tools

World's simplest UTF8 decoder. Just paste your UTF8-encoded data in the form below, press UTF8 Decode button, and you…

www.browserling.com

UTF-8 encoder/decoder

An online, on-the-fly UTF-8 encoder/decoder.

mothereff.in

Octet (computing)

The octet is a unit of digital information in computing and telecommunications that consists of eight bits. The term is…

en.wikipedia.org

Buffer

Edit description

node.readthedocs.io

Node.js Buffers - Create, Write and Read - Examples

Node.js Buffers Node.js Buffers - Node.js Buffer is a class that helps to handle and work with octet streams. Octet…

www.tutorialkart.com

Node.js | Buffer.from() Method - GeeksforGeeks

The Buffer.from() method is used to create a new buffer containing the specified string, array or buffer. Syntax…

www.geeksforgeeks.org

Node.js v14.11.0 Documentation

Edit description

nodejs.org

UTF-8 decoder

Encoding

Examples of after encoding

Buffer

After getting a buffer array, how to decode?

Some definition

Octet (computing)

Unicode

References:

UTF-8

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the…

ASCII

ASCII ( ASS-kee ), abbreviated from American Standard Code for Information Interchange, is a character encoding…

Java byte Array - byte Array in Java

Java byte Array is used to store byte data type values only . The default value of the elements in a byte array is 0 …

UTF-8 Decode - Convert UTF-8 to Text - Online - Browserling Web Developer Tools

World's simplest UTF8 decoder. Just paste your UTF8-encoded data in the form below, press UTF8 Decode button, and you…

UTF-8 encoder/decoder

An online, on-the-fly UTF-8 encoder/decoder.

Octet (computing)

The octet is a unit of digital information in computing and telecommunications that consists of eight bits. The term is…

Buffer

Edit description

Node.js Buffers - Create, Write and Read - Examples

Node.js Buffers Node.js Buffers - Node.js Buffer is a class that helps to handle and work with octet streams. Octet…

Node.js | Buffer.from() Method - GeeksforGeeks

The Buffer.from() method is used to create a new buffer containing the specified string, array or buffer. Syntax…

Node.js v14.11.0 Documentation

Edit description

Written by CH-K