What Is Unicode Encoding?
Unicode is a universal character encoding standard that assigns a unique number — called a code point — to every character used in human languages. Before Unicode, there were dozens of incompatible encoding systems. A document written in Russian on one system was unreadable on another. Unicode solves this by providing a single, unified character set that covers over 150 scripts and 140,000 characters.
Unicode itself is not an encoding. It is a character repertoire. The encoding layer — how code points are stored as bytes — is handled by encoding forms like UTF-8, UTF-16, and UTF-32. Understanding the distinction between a code point and its byte representation is the key to avoiding encoding bugs.
Code Points
Every Unicode character has a code point written as U+ followed by a hexadecimal number. The letter A is U+0041, the euro sign is U+20AC, and the emoji 😀 is U+1F600.
Code points are organised into 17 planes, each containing 65,536 characters. The first plane (Plane 0) is the Basic Multilingual Plane (BMP), which contains the most common characters including Latin, Greek, Cyrillic, CJK, and many others. Planes 1 through 16 contain supplementary characters like emoji, historical scripts, and rare CJK characters.
| Plane | Range | Name |
|---|---|---|
| 0 | U+0000 to U+FFFF | Basic Multilingual Plane (BMP) |
| 1 | U+10000 to U+1FFFF | Supplementary Multilingual Plane (SMP) |
| 2 | U+20000 to U+2FFFF | Supplementary Ideographic Plane (SIP) |
| 3-13 | U+30000 to U+DFFFF | Unassigned |
| 14 | U+E0000 to U+EFFFF | Supplementary Special-purpose Plane (SSP) |
| 15-16 | U+F0000 to U+10FFFF | Private Use Planes |
UTF-8, UTF-16, UTF-32
These three encoding forms differ in how they map code points to bytes.
UTF-8
UTF-8 is the dominant encoding on the web. It uses 1 to 4 bytes per character and is backward-compatible with ASCII. Characters in the ASCII range (U+0000 to U+007F) use one byte. European scripts use two bytes. CJK characters and emoji use three or four bytes.
| Code Point Range | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
|---|---|---|---|---|
| U+0000 - U+007F | 0xxxxxxx | |||
| U+0080 - U+07FF | 110xxxxx | 10xxxxxx | ||
| U+0800 - U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
| U+10000 - U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
UTF-8 advantages:
- ASCII text is valid UTF-8 (no conversion needed)
- Self-synchronising: you can always find character boundaries
- No byte order issues (little-endian / big-endian)
- Most space-efficient for Latin-script text
UTF-16
UTF-16 uses 2 bytes (one code unit) for BMP characters and 4 bytes (two code units, called surrogate pairs) for supplementary characters. Windows, Java, and JavaScript use UTF-16 internally.
UTF-16 has byte order issues. A UTF-16 file must specify whether it is little-endian (UTF-16LE) or big-endian (UTF-16BE), typically using a BOM (Byte Order Mark) at the start.
UTF-32
UTF-32 uses exactly 4 bytes for every code point. It is simple but wasteful. A purely ASCII file becomes four times larger. UTF-32 is rarely used in practice.
Comparison
| Feature | UTF-8 | UTF-16 | UTF-32 |
|---|---|---|---|
| Bytes per character | 1-4 | 2 or 4 | 4 |
| ASCII compatibility | Yes | No | No |
| Web usage | ~98% | ~2% | <0.01% |
| BOM needed | No | Yes | Yes |
| Common in | Web, Linux, JSON | Windows, Java, JS | Internal processing |
Encoding Bugs
The most common encoding bug is treating bytes as characters. When you take a UTF-8 string and interpret it as Latin-1 (ISO 8859-1), accented characters appear as garbled symbols. This happens when the encoding is not declared or when a tool assumes the wrong encoding.
Another common issue is counting bytes instead of characters. In UTF-8, the string "café" is 4 characters but 5 bytes because é uses two bytes. Functions like PHP's strlen count bytes by default, while mb_strlen counts characters.
Programming Examples
// PHP - always use mb_* functions for UTF-8
$text = 'café ☕';
echo strlen($text); // 8 bytes (wrong count)
echo mb_strlen($text, 'UTF-8'); // 6 characters
// Encode/decode Unicode code points
echo mb_convert_encoding($text, 'UTF-16', 'UTF-8');
# Python 3 - strings are Unicode by default
text = 'café ☕'
print(len(text)) # 6 characters
bytes_utf8 = text.encode('utf-8')
bytes_utf16 = text.encode('utf-16')
print(bytes_utf8) # b'caf\xc3\xa9 \xe2\x98\x95'
// JavaScript - strings are UTF-16 internally
let text = 'café ☕';
console.log(text.length); // 6 (but emoji may count as 2)
// Encode as UTF-8 bytes
let encoder = new TextEncoder();
let bytes = encoder.encode(text);
console.log(bytes); // Uint8Array(8)
Online Tool
The Unicode Encoder & Decoder tool on Help2Code converts text to Unicode code points and between UTF-8, UTF-16, and UTF-32 representations. It is useful for debugging encoding issues and learning how Unicode works.
Conclusion
Unicode encoding is essential knowledge for any developer working with international text. UTF-8 is the standard for the web and should be your default. Understanding the difference between code points and byte representations will help you avoid the most common text encoding bugs. Use the Unicode Encoder & Decoder to experiment with encodings and debug encoding issues.