UTF-8 vs UTF-16: What Developers Must Know
UTF-8 and UTF-16 are both Unicode encoding formats, but they store characters differently. Choosing between them affects file sizes, performance, interoperability, and developer experience. While most developers can safely default to UTF-8 for nearly all projects, understanding the differences between these encodings is important for making informed decisions, especially when working with systems that have specific encoding requirements. This comprehensive guide explains how each encoding works, compares their characteristics, and provides clear recommendations.
Understanding Unicode and Encoding
Before comparing UTF-8 and UTF-16, it helps to understand the relationship between Unicode and encoding. Unicode is a universal character set that assigns a unique number (called a code point) to every character across all writing systems. As of version 15.0, Unicode defines over 149,000 characters covering 161 scripts. However, Unicode does not specify how to store these code points in bytes. That is where encodings like UTF-8 and UTF-16 come in. They define the rules for converting code points into sequences of bytes.
How UTF-8 Works
UTF-8 is a variable-width encoding that uses 1 to 4 bytes per character. The encoding is designed to be backward-compatible with ASCII, which is a crucial feature.
UTF-8 Byte Sequences
| Code Point Range | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
|---|---|---|---|---|
| U+0000 to U+007F (ASCII) | 0xxxxxxx | - | - | - |
| U+0080 to U+07FF | 110xxxxx | 10xxxxxx | - | - |
| U+0800 to U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | - |
| U+10000 to U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
The leading bits in the first byte indicate how many bytes follow:
- A byte starting with
0is a single-byte ASCII character. - A byte starting with
110indicates a 2-byte sequence. - A byte starting with
1110indicates a 3-byte sequence. - A byte starting with
11110indicates a 4-byte sequence. - Bytes starting with
10are continuation bytes that follow the leading byte.
This self-synchronizing property means that even if you start reading in the middle of a UTF-8 stream, you can quickly find the start of the next character by scanning for bytes that do not start with 10.
How UTF-16 Works
UTF-16 is also a variable-width encoding, but it uses either 2 or 4 bytes per character.
UTF-16 Code Units
-
Characters in the Basic Multilingual Plane (BMP), which covers U+0000 to U+FFFF, are encoded as a single 16-bit code unit (2 bytes). This covers most characters used in modern writing, including all ASCII and Latin characters, Greek, Cyrillic, CJK (Chinese, Japanese, Korean) ideographs, and many symbols.
-
Characters outside the BMP, called supplementary characters (U+10000 to U+10FFFF), are encoded as a pair of 16-bit code units called surrogate pairs (4 bytes total). Surrogate pairs use the range U+D800 to U+DFFF, which is reserved specifically for this purpose.
Byte Order in UTF-16
UTF-16 is sensitive to byte order. The two bytes of each code unit can be stored as big-endian (most significant byte first) or little-endian (least significant byte first). To indicate the byte order, UTF-16 files often start with a Byte Order Mark (BOM), which is the character U+FEFF. The BOM appears as:
FE FFin big-endianFF FEin little-endian
Without a BOM, software must guess the byte order, which can lead to misinterpretation of the data.
Key Differences Between UTF-8 and UTF-16
Storage Efficiency
| Character Category | UTF-8 | UTF-16 |
|---|---|---|
| ASCII (English text, digits, basic punctuation) | 1 byte per character | 2 bytes per character |
| Latin/European (accented characters, ñ, ü) | 2 bytes | 2 bytes |
| Greek, Cyrillic, Arabic, Hebrew | 2 bytes | 2 bytes |
| CJK ideographs (Chinese, Japanese, Korean) | 3 bytes | 2 bytes |
| Emoji and supplementary characters | 4 bytes | 4 bytes |
Key insight: UTF-8 is more efficient for text dominated by ASCII characters, which covers most English-language content and programming code. UTF-16 is more efficient for text heavy in CJK characters, where it saves 1 byte per character compared to UTF-8.
ASCII Compatibility
UTF-8 is fully backward-compatible with ASCII. Any valid ASCII text is also valid UTF-8. This means existing ASCII text files, C source code, HTML files, and configuration files work without any conversion. UTF-16 is not ASCII-compatible. Every ASCII character takes 2 bytes in UTF-16, with one byte being zero. This breaks compatibility with all ASCII-based tools and libraries.
Web Usage
UTF-8 dominates the web. According to W3Techs, as of 2025, over 98 percent of websites use UTF-8. HTML5 specifies UTF-8 as the default encoding. JSON, XML, YAML, and most modern data formats mandate or strongly recommend UTF-8. UTF-16 web pages are extremely rare.
Programming Language Support
- JavaScript: Uses UTF-16 internally for strings. This means that supplementary characters (like most emoji) are represented as two UTF-16 code units (4 bytes), which can cause issues with string length, indexing, and iteration.
- Python 3: Strings are stored internally as one of four encodings (Latin-1, UCS-2, UTF-16, or UTF-32) depending on the largest code point. File I/O defaults to UTF-8.
- Java: Uses UTF-16 internally for strings, similar to JavaScript.
- Go: Uses UTF-8 internally for strings, making range-based iteration over strings natural.
- Rust: Uses UTF-8 internally for strings (String and &str types).
- C/C++: Support both encodings through library functions. The choice depends on the platform and requirements.
Operating System Support
| OS | Default Encoding | Notes |
|---|---|---|
| Linux | UTF-8 | Nearly all modern distros default to UTF-8 |
| macOS | UTF-8 | The filesystem (APFS) uses UTF-8 for filenames |
| Windows (modern) | UTF-16 | Windows NT API uses UTF-16 natively. UTF-8 support has improved significantly in Windows 10 and 11 |
File Size Comparison
For a file containing 1,000 characters:
| Content Type | UTF-8 Size | UTF-16 Size | Winner |
|---|---|---|---|
| English text | ~1,000 bytes | ~2,000 bytes + BOM | UTF-8 |
| French/Spanish text | ~1,000-1,200 bytes | ~2,000 bytes + BOM | UTF-8 |
| Chinese text | ~3,000 bytes | ~2,000 bytes + BOM | UTF-16 |
| Mixed (English + CJK) | ~2,500 bytes | ~2,000 bytes + BOM | UTF-16 |
When to Use UTF-8
UTF-8 is the right choice for almost all new projects:
- Web content: HTML, CSS, JavaScript, JSON, XML, SVG, and APIs.
- Configuration files: YAML, TOML, INI, and environment files.
- Source code: Nearly all programming languages recommend UTF-8 for source files.
- Data interchange: CSV files, log files, and text-based protocols.
- Cross-platform applications: UTF-8 works consistently across Linux, macOS, and Windows (with proper configuration).
- Network protocols: HTTP, SMTP, and most application-level protocols use UTF-8.
When to Use UTF-16
UTF-16 remains relevant in specific scenarios:
- Windows system programming: Windows API functions that use
wchar_t(which is UTF-16 on Windows). If you are writing Windows desktop applications using Win32 or COM, UTF-16 is often unavoidable. - Java and .NET internals: Both Java and .NET use UTF-16 for internal string representation. This matters when you are working with raw memory, serialization, or performance-critical code.
- Some database systems: SQL Server uses UTF-16 for
NCHAR,NVARCHAR, andNTEXTcolumns. MySQL and PostgreSQL default to UTF-8 but support UTF-16. - Legacy systems: Older systems and file formats may require UTF-16. Migration strategies should include careful encoding conversion.
Byte Order Mark (BOM) Considerations
UTF-8 technically does not need a BOM because it is byte-order independent. However, some Windows applications (like Notepad) add a BOM (the bytes EF BB BF) to the beginning of UTF-8 files. This can cause issues with Unix tools, PHP output buffering, and HTTP headers. It is generally recommended to save UTF-8 files without a BOM unless you have a specific compatibility requirement.
UTF-16 requires either a BOM or external knowledge of the byte order. When exchanging UTF-16 data between systems with different architectures, the BOM is essential for correct interpretation.
Migration Between Encodings
If you need to convert between UTF-8 and UTF-16, most programming languages provide built-in functions:
- Python:
'text'.encode('utf-8')or'text'.encode('utf-16') - JavaScript:
TextEncoderandTextDecoderAPIs - Command line:
iconv -f UTF-8 -t UTF-16 input.txt > output.txt
Recommendation
Choose UTF-8 for almost all new projects. It is the dominant encoding for the web, most programming languages, and modern systems. The space saved by UTF-16 for CJK-heavy content rarely justifies the compatibility costs. UTF-8's ASCII compatibility, self-synchronizing property, and widespread tool support make it the clear winner for general-purpose use. Reserve UTF-16 for specific scenarios where Windows API compatibility, Java/.NET internals, or legacy system requirements force its use.