UTF-8 vs UTF-16: What Developers Must Know

UTF-8 and UTF-16 are both Unicode encoding formats, but they store characters differently. Choosing between them affects file sizes, performance, interoperability, and developer experience. While most developers can safely default to UTF-8 for nearly all projects, understanding the differences between these encodings is important for making informed decisions, especially when working with systems that have specific encoding requirements. This comprehensive guide explains how each encoding works, compares their characteristics, and provides clear recommendations.

Understanding Unicode and Encoding

Before comparing UTF-8 and UTF-16, it helps to understand the relationship between Unicode and encoding. Unicode is a universal character set that assigns a unique number (called a code point) to every character across all writing systems. As of version 15.0, Unicode defines over 149,000 characters covering 161 scripts. However, Unicode does not specify how to store these code points in bytes. That is where encodings like UTF-8 and UTF-16 come in. They define the rules for converting code points into sequences of bytes.

How UTF-8 Works

UTF-8 is a variable-width encoding that uses 1 to 4 bytes per character. The encoding is designed to be backward-compatible with ASCII, which is a crucial feature.

UTF-8 Byte Sequences

Code Point Range	Byte 1	Byte 2	Byte 3	Byte 4
U+0000 to U+007F (ASCII)	0xxxxxxx	-	-	-
U+0080 to U+07FF	110xxxxx	10xxxxxx	-	-
U+0800 to U+FFFF	1110xxxx	10xxxxxx	10xxxxxx	-
U+10000 to U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

The leading bits in the first byte indicate how many bytes follow:

A byte starting with 0 is a single-byte ASCII character.
A byte starting with 110 indicates a 2-byte sequence.
A byte starting with 1110 indicates a 3-byte sequence.
A byte starting with 11110 indicates a 4-byte sequence.
Bytes starting with 10 are continuation bytes that follow the leading byte.

This self-synchronizing property means that even if you start reading in the middle of a UTF-8 stream, you can quickly find the start of the next character by scanning for bytes that do not start with 10.

How UTF-16 Works

UTF-16 is also a variable-width encoding, but it uses either 2 or 4 bytes per character.

UTF-16 Code Units

Characters in the Basic Multilingual Plane (BMP), which covers U+0000 to U+FFFF, are encoded as a single 16-bit code unit (2 bytes). This covers most characters used in modern writing, including all ASCII and Latin characters, Greek, Cyrillic, CJK (Chinese, Japanese, Korean) ideographs, and many symbols.
Characters outside the BMP, called supplementary characters (U+10000 to U+10FFFF), are encoded as a pair of 16-bit code units called surrogate pairs (4 bytes total). Surrogate pairs use the range U+D800 to U+DFFF, which is reserved specifically for this purpose.

Byte Order in UTF-16

UTF-16 is sensitive to byte order. The two bytes of each code unit can be stored as big-endian (most significant byte first) or little-endian (least significant byte first). To indicate the byte order, UTF-16 files often start with a Byte Order Mark (BOM), which is the character U+FEFF. The BOM appears as:

FE FF in big-endian
FF FE in little-endian

Without a BOM, software must guess the byte order, which can lead to misinterpretation of the data.

Key Differences Between UTF-8 and UTF-16

Storage Efficiency

Character Category	UTF-8	UTF-16
ASCII (English text, digits, basic punctuation)	1 byte per character	2 bytes per character
Latin/European (accented characters, ñ, ü)	2 bytes	2 bytes
Greek, Cyrillic, Arabic, Hebrew	2 bytes	2 bytes
CJK ideographs (Chinese, Japanese, Korean)	3 bytes	2 bytes
Emoji and supplementary characters	4 bytes	4 bytes

Key insight: UTF-8 is more efficient for text dominated by ASCII characters, which covers most English-language content and programming code. UTF-16 is more efficient for text heavy in CJK characters, where it saves 1 byte per character compared to UTF-8.

ASCII Compatibility

UTF-8 is fully backward-compatible with ASCII. Any valid ASCII text is also valid UTF-8. This means existing ASCII text files, C source code, HTML files, and configuration files work without any conversion. UTF-16 is not ASCII-compatible. Every ASCII character takes 2 bytes in UTF-16, with one byte being zero. This breaks compatibility with all ASCII-based tools and libraries.

Web Usage

UTF-8 dominates the web. According to W3Techs, as of 2025, over 98 percent of websites use UTF-8. HTML5 specifies UTF-8 as the default encoding. JSON, XML, YAML, and most modern data formats mandate or strongly recommend UTF-8. UTF-16 web pages are extremely rare.

Programming Language Support

JavaScript: Uses UTF-16 internally for strings. This means that supplementary characters (like most emoji) are represented as two UTF-16 code units (4 bytes), which can cause issues with string length, indexing, and iteration.
Python 3: Strings are stored internally as one of four encodings (Latin-1, UCS-2, UTF-16, or UTF-32) depending on the largest code point. File I/O defaults to UTF-8.
Java: Uses UTF-16 internally for strings, similar to JavaScript.
Go: Uses UTF-8 internally for strings, making range-based iteration over strings natural.
Rust: Uses UTF-8 internally for strings (String and &str types).
C/C++: Support both encodings through library functions. The choice depends on the platform and requirements.

Operating System Support

OS	Default Encoding	Notes
Linux	UTF-8	Nearly all modern distros default to UTF-8
macOS	UTF-8	The filesystem (APFS) uses UTF-8 for filenames
Windows (modern)	UTF-16	Windows NT API uses UTF-16 natively. UTF-8 support has improved significantly in Windows 10 and 11

File Size Comparison

For a file containing 1,000 characters:

Content Type	UTF-8 Size	UTF-16 Size	Winner
English text	~1,000 bytes	~2,000 bytes + BOM	UTF-8
French/Spanish text	~1,000-1,200 bytes	~2,000 bytes + BOM	UTF-8
Chinese text	~3,000 bytes	~2,000 bytes + BOM	UTF-16
Mixed (English + CJK)	~2,500 bytes	~2,000 bytes + BOM	UTF-16

When to Use UTF-8

UTF-8 is the right choice for almost all new projects:

Web content: HTML, CSS, JavaScript, JSON, XML, SVG, and APIs.
Configuration files: YAML, TOML, INI, and environment files.
Source code: Nearly all programming languages recommend UTF-8 for source files.
Data interchange: CSV files, log files, and text-based protocols.
Cross-platform applications: UTF-8 works consistently across Linux, macOS, and Windows (with proper configuration).
Network protocols: HTTP, SMTP, and most application-level protocols use UTF-8.

When to Use UTF-16

UTF-16 remains relevant in specific scenarios:

Windows system programming: Windows API functions that use wchar_t (which is UTF-16 on Windows). If you are writing Windows desktop applications using Win32 or COM, UTF-16 is often unavoidable.
Java and .NET internals: Both Java and .NET use UTF-16 for internal string representation. This matters when you are working with raw memory, serialization, or performance-critical code.
Some database systems: SQL Server uses UTF-16 for NCHAR, NVARCHAR, and NTEXT columns. MySQL and PostgreSQL default to UTF-8 but support UTF-16.
Legacy systems: Older systems and file formats may require UTF-16. Migration strategies should include careful encoding conversion.

Byte Order Mark (BOM) Considerations

UTF-8 technically does not need a BOM because it is byte-order independent. However, some Windows applications (like Notepad) add a BOM (the bytes EF BB BF) to the beginning of UTF-8 files. This can cause issues with Unix tools, PHP output buffering, and HTTP headers. It is generally recommended to save UTF-8 files without a BOM unless you have a specific compatibility requirement.

UTF-16 requires either a BOM or external knowledge of the byte order. When exchanging UTF-16 data between systems with different architectures, the BOM is essential for correct interpretation.

Migration Between Encodings

If you need to convert between UTF-8 and UTF-16, most programming languages provide built-in functions:

Python: 'text'.encode('utf-8') or 'text'.encode('utf-16')
JavaScript: TextEncoder and TextDecoder APIs
Command line: iconv -f UTF-8 -t UTF-16 input.txt > output.txt

Recommendation

Choose UTF-8 for almost all new projects. It is the dominant encoding for the web, most programming languages, and modern systems. The space saved by UTF-16 for CJK-heavy content rarely justifies the compatibility costs. UTF-8's ASCII compatibility, self-synchronizing property, and widespread tool support make it the clear winner for general-purpose use. Reserve UTF-16 for specific scenarios where Windows API compatibility, Java/.NET internals, or legacy system requirements force its use.

UTF-8 vs UTF-16: What Developers Must Know

UTF-8 vs UTF-16: What Developers Must Know

Understanding Unicode and Encoding

How UTF-8 Works

UTF-8 Byte Sequences

How UTF-16 Works

UTF-16 Code Units

Byte Order in UTF-16

Key Differences Between UTF-8 and UTF-16

Storage Efficiency

ASCII Compatibility

Web Usage

Programming Language Support

Operating System Support

File Size Comparison

When to Use UTF-8

When to Use UTF-16

Byte Order Mark (BOM) Considerations

Migration Between Encodings

Recommendation

Related Articles

What Is Unicode Encoding? Code Points, UTF-8, and UTF-16 Explained

Free Tools for Debugging Text Encoding Issues

What Is an SMS Counter? Understanding GSM 7-Bit and UCS-2 Encoding

Related Tools