What Is Unicode Encoding? Code Points, UTF-8, and UTF-16 Explained

16 Jun 2026 891 words

What Is Unicode Encoding?

Unicode is a universal character encoding standard that assigns a unique number — called a code point — to every character used in human languages. Before Unicode, there were dozens of incompatible encoding systems. A document written in Russian on one system was unreadable on another. Unicode solves this by providing a single, unified character set that covers over 150 scripts and 140,000 characters.

Unicode itself is not an encoding. It is a character repertoire. The encoding layer — how code points are stored as bytes — is handled by encoding forms like UTF-8, UTF-16, and UTF-32. Understanding the distinction between a code point and its byte representation is the key to avoiding encoding bugs.

Code Points

Every Unicode character has a code point written as U+ followed by a hexadecimal number. The letter A is U+0041, the euro sign is U+20AC, and the emoji 😀 is U+1F600.

Code points are organised into 17 planes, each containing 65,536 characters. The first plane (Plane 0) is the Basic Multilingual Plane (BMP), which contains the most common characters including Latin, Greek, Cyrillic, CJK, and many others. Planes 1 through 16 contain supplementary characters like emoji, historical scripts, and rare CJK characters.

Plane Range Name
0 U+0000 to U+FFFF Basic Multilingual Plane (BMP)
1 U+10000 to U+1FFFF Supplementary Multilingual Plane (SMP)
2 U+20000 to U+2FFFF Supplementary Ideographic Plane (SIP)
3-13 U+30000 to U+DFFFF Unassigned
14 U+E0000 to U+EFFFF Supplementary Special-purpose Plane (SSP)
15-16 U+F0000 to U+10FFFF Private Use Planes

UTF-8, UTF-16, UTF-32

These three encoding forms differ in how they map code points to bytes.

UTF-8

UTF-8 is the dominant encoding on the web. It uses 1 to 4 bytes per character and is backward-compatible with ASCII. Characters in the ASCII range (U+0000 to U+007F) use one byte. European scripts use two bytes. CJK characters and emoji use three or four bytes.

Code Point Range Byte 1 Byte 2 Byte 3 Byte 4
U+0000 - U+007F 0xxxxxxx
U+0080 - U+07FF 110xxxxx 10xxxxxx
U+0800 - U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000 - U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-8 advantages:

  • ASCII text is valid UTF-8 (no conversion needed)
  • Self-synchronising: you can always find character boundaries
  • No byte order issues (little-endian / big-endian)
  • Most space-efficient for Latin-script text

UTF-16

UTF-16 uses 2 bytes (one code unit) for BMP characters and 4 bytes (two code units, called surrogate pairs) for supplementary characters. Windows, Java, and JavaScript use UTF-16 internally.

UTF-16 has byte order issues. A UTF-16 file must specify whether it is little-endian (UTF-16LE) or big-endian (UTF-16BE), typically using a BOM (Byte Order Mark) at the start.

UTF-32

UTF-32 uses exactly 4 bytes for every code point. It is simple but wasteful. A purely ASCII file becomes four times larger. UTF-32 is rarely used in practice.

Comparison

Feature UTF-8 UTF-16 UTF-32
Bytes per character 1-4 2 or 4 4
ASCII compatibility Yes No No
Web usage ~98% ~2% <0.01%
BOM needed No Yes Yes
Common in Web, Linux, JSON Windows, Java, JS Internal processing

Encoding Bugs

The most common encoding bug is treating bytes as characters. When you take a UTF-8 string and interpret it as Latin-1 (ISO 8859-1), accented characters appear as garbled symbols. This happens when the encoding is not declared or when a tool assumes the wrong encoding.

Another common issue is counting bytes instead of characters. In UTF-8, the string "café" is 4 characters but 5 bytes because é uses two bytes. Functions like PHP's strlen count bytes by default, while mb_strlen counts characters.

Programming Examples

// PHP - always use mb_* functions for UTF-8
$text = 'café ☕';
echo strlen($text);      // 8 bytes (wrong count)
echo mb_strlen($text, 'UTF-8');  // 6 characters

// Encode/decode Unicode code points
echo mb_convert_encoding($text, 'UTF-16', 'UTF-8');
# Python 3 - strings are Unicode by default
text = 'café ☕'
print(len(text))         # 6 characters
bytes_utf8 = text.encode('utf-8')
bytes_utf16 = text.encode('utf-16')
print(bytes_utf8)        # b'caf\xc3\xa9 \xe2\x98\x95'
// JavaScript - strings are UTF-16 internally
let text = 'café ☕';
console.log(text.length);     // 6 (but emoji may count as 2)

// Encode as UTF-8 bytes
let encoder = new TextEncoder();
let bytes = encoder.encode(text);
console.log(bytes);  // Uint8Array(8)

Online Tool

The Unicode Encoder & Decoder tool on Help2Code converts text to Unicode code points and between UTF-8, UTF-16, and UTF-32 representations. It is useful for debugging encoding issues and learning how Unicode works.

Conclusion

Unicode encoding is essential knowledge for any developer working with international text. UTF-8 is the standard for the web and should be your default. Understanding the difference between code points and byte representations will help you avoid the most common text encoding bugs. Use the Unicode Encoder & Decoder to experiment with encodings and debug encoding issues.


About this article

Learn what Unicode encoding is, how code points work, and the differences between UTF-8, UTF-16, and UTF-32.


Related Articles


Related Tools

Help2Code Logo
Menu