What Is Unicode Encoding?

Unicode is a universal character encoding standard that assigns a unique number — called a code point — to every character used in human languages. Before Unicode, there were dozens of incompatible encoding systems. A document written in Russian on one system was unreadable on another. Unicode solves this by providing a single, unified character set that covers over 150 scripts and 140,000 characters.

Unicode itself is not an encoding. It is a character repertoire. The encoding layer — how code points are stored as bytes — is handled by encoding forms like UTF-8, UTF-16, and UTF-32. Understanding the distinction between a code point and its byte representation is the key to avoiding encoding bugs.

Code Points

Every Unicode character has a code point written as U+ followed by a hexadecimal number. The letter A is U+0041, the euro sign is U+20AC, and the emoji 😀 is U+1F600.

Code points are organised into 17 planes, each containing 65,536 characters. The first plane (Plane 0) is the Basic Multilingual Plane (BMP), which contains the most common characters including Latin, Greek, Cyrillic, CJK, and many others. Planes 1 through 16 contain supplementary characters like emoji, historical scripts, and rare CJK characters.

Plane	Range	Name
0	U+0000 to U+FFFF	Basic Multilingual Plane (BMP)
1	U+10000 to U+1FFFF	Supplementary Multilingual Plane (SMP)
2	U+20000 to U+2FFFF	Supplementary Ideographic Plane (SIP)
3-13	U+30000 to U+DFFFF	Unassigned
14	U+E0000 to U+EFFFF	Supplementary Special-purpose Plane (SSP)
15-16	U+F0000 to U+10FFFF	Private Use Planes

UTF-8, UTF-16, UTF-32

These three encoding forms differ in how they map code points to bytes.

UTF-8

UTF-8 is the dominant encoding on the web. It uses 1 to 4 bytes per character and is backward-compatible with ASCII. Characters in the ASCII range (U+0000 to U+007F) use one byte. European scripts use two bytes. CJK characters and emoji use three or four bytes.

Code Point Range	Byte 1	Byte 2	Byte 3	Byte 4
U+0000 - U+007F	0xxxxxxx
U+0080 - U+07FF	110xxxxx	10xxxxxx
U+0800 - U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
U+10000 - U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

UTF-8 advantages:

ASCII text is valid UTF-8 (no conversion needed)
Self-synchronising: you can always find character boundaries
No byte order issues (little-endian / big-endian)
Most space-efficient for Latin-script text

UTF-16

UTF-16 uses 2 bytes (one code unit) for BMP characters and 4 bytes (two code units, called surrogate pairs) for supplementary characters. Windows, Java, and JavaScript use UTF-16 internally.

UTF-16 has byte order issues. A UTF-16 file must specify whether it is little-endian (UTF-16LE) or big-endian (UTF-16BE), typically using a BOM (Byte Order Mark) at the start.

UTF-32

UTF-32 uses exactly 4 bytes for every code point. It is simple but wasteful. A purely ASCII file becomes four times larger. UTF-32 is rarely used in practice.

Comparison

Feature	UTF-8	UTF-16	UTF-32
Bytes per character	1-4	2 or 4	4
ASCII compatibility	Yes	No	No
Web usage	~98%	~2%	<0.01%
BOM needed	No	Yes	Yes
Common in	Web, Linux, JSON	Windows, Java, JS	Internal processing

Encoding Bugs

The most common encoding bug is treating bytes as characters. When you take a UTF-8 string and interpret it as Latin-1 (ISO 8859-1), accented characters appear as garbled symbols. This happens when the encoding is not declared or when a tool assumes the wrong encoding.

Another common issue is counting bytes instead of characters. In UTF-8, the string "café" is 4 characters but 5 bytes because é uses two bytes. Functions like PHP's strlen count bytes by default, while mb_strlen counts characters.

Programming Examples

// PHP - always use mb_* functions for UTF-8
$text = 'café ☕';
echo strlen($text);      // 8 bytes (wrong count)
echo mb_strlen($text, 'UTF-8');  // 6 characters

// Encode/decode Unicode code points
echo mb_convert_encoding($text, 'UTF-16', 'UTF-8');

# Python 3 - strings are Unicode by default
text = 'café ☕'
print(len(text))         # 6 characters
bytes_utf8 = text.encode('utf-8')
bytes_utf16 = text.encode('utf-16')
print(bytes_utf8)        # b'caf\xc3\xa9 \xe2\x98\x95'

// JavaScript - strings are UTF-16 internally
let text = 'café ☕';
console.log(text.length);     // 6 (but emoji may count as 2)

// Encode as UTF-8 bytes
let encoder = new TextEncoder();
let bytes = encoder.encode(text);
console.log(bytes);  // Uint8Array(8)

Online Tool

The Unicode Encoder & Decoder tool on Help2Code converts text to Unicode code points and between UTF-8, UTF-16, and UTF-32 representations. It is useful for debugging encoding issues and learning how Unicode works.

Conclusion

Unicode encoding is essential knowledge for any developer working with international text. UTF-8 is the standard for the web and should be your default. Understanding the difference between code points and byte representations will help you avoid the most common text encoding bugs. Use the Unicode Encoder & Decoder to experiment with encodings and debug encoding issues.

What Is Unicode Encoding? Code Points, UTF-8, and UTF-16 Explained

What Is Unicode Encoding?

Code Points

UTF-8, UTF-16, UTF-32

UTF-8

UTF-16

UTF-32

Comparison

Encoding Bugs

Programming Examples

Online Tool

Conclusion

Related Articles

UTF-8 vs UTF-16: What Developers Must Know

Free Tools for Debugging Text Encoding Issues

What Is an SMS Counter? Understanding GSM 7-Bit and UCS-2 Encoding

Related Tools