Fix: Python UnicodeDecodeError – 'utf-8' codec can't decode byte

Q: How do I fix "Python UnicodeDecodeError – 'utf-8' codec can't decode byte"?

How to fix Python UnicodeDecodeError 'utf-8' codec can't decode byte in position invalid start byte, covering chardet detection, encoding fallback, BOM handling, pandas CSV encoding, and PYTHONIOENCODING.

The Error

You try to read a file in Python and get this traceback:

Traceback (most recent call last):
  File "app.py", line 2, in <module>
    content = f.read()
              ^^^^^^^^
  File "/usr/lib/python3.12/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 128: invalid start byte

Or a variation like:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 54: invalid start byte

The hex byte value (0xe9, 0xff, 0xc0, etc.) and position change depending on your file, but the error is the same. Python tried to read the file as UTF-8, hit a byte that isn’t valid UTF-8, and crashed.

Why This Happens

Python 3 defaults to UTF-8 when opening text files. If the file was saved in a different encoding — Latin-1 (ISO-8859-1), Windows-1252 (cp1252), Shift-JIS, GB2312, or any other non-UTF-8 encoding — some bytes in that file won’t be valid UTF-8 sequences.

Here’s what’s happening at the byte level. UTF-8 uses specific patterns to represent characters:

Single-byte characters (ASCII): 0x00–0x7F
Multi-byte sequences start with specific lead bytes: 0xC2–0xF4
Continuation bytes: 0x80–0xBF

When Python encounters a byte like 0xe9 that starts a 3-byte UTF-8 sequence but isn’t followed by the correct continuation bytes, it raises UnicodeDecodeError. In Latin-1 encoding, 0xe9 is simply the character é — a single byte, no continuation needed. The mismatch between what Python expects (UTF-8) and what the file actually is (Latin-1) causes the crash.

The most common scenarios:

Legacy files. Files created by older Windows applications are often saved in cp1252 (Windows-1252), not UTF-8. This includes CSV exports from Excel, log files from legacy software, and database dumps.
Mixed-encoding data. A database or API returns data with inconsistent encoding. Some rows are UTF-8, others are Latin-1.
BOM (Byte Order Mark). A file starts with 0xFF 0xFE or 0xFE 0xFF — a BOM from UTF-16 encoding. Python’s default utf-8 codec doesn’t handle this.
Binary data in a text file. The file contains embedded binary data (images, compressed chunks) that you’re trying to read as text.
Terminal/locale mismatch. Your system’s locale or PYTHONIOENCODING doesn’t match the data you’re processing.

Fix 1: Detect the File’s Actual Encoding with chardet

Don’t guess the encoding — detect it. The chardet library analyzes the byte patterns in a file and tells you the most likely encoding.

Install it:

pip install chardet

Then detect the encoding before reading:

import chardet

# Read the file as raw bytes first
with open("data.csv", "rb") as f:
    raw_data = f.read()

# Detect the encoding
result = chardet.detect(raw_data)
print(result)
# {'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

# Now read with the detected encoding
with open("data.csv", "r", encoding=result["encoding"]) as f:
    content = f.read()

The confidence value tells you how sure chardet is about its guess. Anything above 0.7 is usually reliable. Below that, you may need to try the encoding manually or inspect the file.

For large files, you don’t need to read the entire thing for detection. Feed chunks instead:

import chardet

detector = chardet.UniversalDetector()
with open("large_file.csv", "rb") as f:
    for line in f:
        detector.feed(line)
        if detector.done:
            break
detector.close()

print(detector.result)

Pro Tip: If chardet is too slow for your use case, try charset-normalizer instead. It’s the library requests uses internally, and it’s often faster while maintaining good accuracy. Install it with pip install charset-normalizer and use from charset_normalizer import from_bytes. If you’re having trouble installing either package, check our guide on fixing ModuleNotFoundError.

Fix 2: Open the File with the Correct Encoding

If you already know the file’s encoding, pass it directly to open():

# For Latin-1 encoded files
with open("data.txt", "r", encoding="latin-1") as f:
    content = f.read()

# For Windows-1252 encoded files (common with Windows-created files)
with open("data.txt", "r", encoding="cp1252") as f:
    content = f.read()

# For Shift-JIS (common with Japanese text)
with open("data.txt", "r", encoding="shift_jis") as f:
    content = f.read()

Latin-1 (ISO-8859-1) is a special case. It maps every byte value from 0x00 to 0xFF to a character, which means it never raises a UnicodeDecodeError. Opening any file with encoding="latin-1" will always succeed. This makes it useful as a fallback, but be aware that the decoded text may contain wrong characters if the file isn’t actually Latin-1.

cp1252 vs. Latin-1: Windows-1252 is almost identical to Latin-1, but it defines extra characters in the 0x80–0x9F range (like curly quotes, em dashes, and the euro sign). If you’re dealing with Windows-origin files, cp1252 is usually the better choice over latin-1.

Here’s a practical fallback pattern:

def read_file_safe(filepath):
    """Try UTF-8 first, fall back to cp1252."""
    try:
        with open(filepath, "r", encoding="utf-8") as f:
            return f.read()
    except UnicodeDecodeError:
        with open(filepath, "r", encoding="cp1252") as f:
            return f.read()

This covers the vast majority of files you’ll encounter in Western-language environments.

Fix 3: Use errors=‘replace’ or errors=‘ignore’

If you need to read a file and don’t care about a few garbled characters, use the errors parameter:

# Replace undecodable bytes with the Unicode replacement character (�)
with open("data.txt", "r", encoding="utf-8", errors="replace") as f:
    content = f.read()

# Silently skip undecodable bytes
with open("data.txt", "r", encoding="utf-8", errors="ignore") as f:
    content = f.read()

errors="replace" substitutes each bad byte with � (U+FFFD). You can see where the problems are, and the rest of the text is intact.

errors="ignore" drops the bad bytes entirely. The output is clean but you lose data — characters disappear without a trace.

There’s also a third option:

# Replace bad bytes with XML/HTML escape sequences
with open("data.txt", "r", encoding="utf-8", errors="xmlcharrefreplace") as f:
    content = f.read()

errors="xmlcharrefreplace" converts bad bytes to XML character references like é. This is useful if you need to preserve the original byte values for debugging.

Common Mistake: Don’t use errors="ignore" as a permanent fix in production code. You’re silently losing data. If the file is a CSV, missing characters can shift column values. If it’s a config file, you might lose critical settings. Use errors="replace" instead so you can at least spot where data was mangled, or better yet, detect and use the correct encoding.

Fix 4: Handle BOM (Byte Order Mark)

If your error is specifically UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0 or byte 0xfe in position 0, the file likely has a BOM (Byte Order Mark).

A BOM is a special marker at the beginning of a file that indicates its encoding and byte order. Common BOMs:

BOM Bytes	Encoding
`EF BB BF`	UTF-8 with BOM
`FF FE`	UTF-16 LE
`FE FF`	UTF-16 BE
`FF FE 00 00`	UTF-32 LE

Python has a built-in encoding that handles the UTF-8 BOM automatically:

# utf-8-sig strips the BOM if present, reads normally if not
with open("data.txt", "r", encoding="utf-8-sig") as f:
    content = f.read()

For UTF-16 files, Python’s utf-16 codec handles the BOM automatically:

with open("data.txt", "r", encoding="utf-16") as f:
    content = f.read()

If you’re not sure whether a file has a BOM, check the first few bytes:

with open("data.txt", "rb") as f:
    start = f.read(4)

if start[:3] == b"\xef\xbb\xbf":
    encoding = "utf-8-sig"
elif start[:2] == b"\xff\xfe":
    encoding = "utf-16-le"
elif start[:2] == b"\xfe\xff":
    encoding = "utf-16-be"
else:
    encoding = "utf-8"

with open("data.txt", "r", encoding=encoding) as f:
    content = f.read()

Note: Windows Notepad and some older text editors add a BOM to UTF-8 files by default. If you save a file in Notepad as “UTF-8”, it adds a BOM. Choose “UTF-8 without BOM” if available, or use a modern editor like VS Code which defaults to BOM-free UTF-8.

Fix 5: Set Encoding in pandas read_csv

If you get the UnicodeDecodeError while reading a CSV with pandas, pass the encoding parameter:

import pandas as pd

# Default (fails on non-UTF-8 files)
# df = pd.read_csv("data.csv")  # UnicodeDecodeError

# Specify the correct encoding
df = pd.read_csv("data.csv", encoding="cp1252")

If you don’t know the encoding, use chardet first:

import chardet
import pandas as pd

with open("data.csv", "rb") as f:
    result = chardet.detect(f.read(100000))  # Read first 100KB

df = pd.read_csv("data.csv", encoding=result["encoding"])

pandas also supports the encoding_errors parameter (added in pandas 1.3.0):

# Replace bad bytes instead of crashing
df = pd.read_csv("data.csv", encoding="utf-8", encoding_errors="replace")

For Excel files exported as CSV, cp1252 is almost always the right encoding on Windows systems. Excel on macOS may use mac_roman. If the CSV was exported from a database, check the database’s character set configuration.

When dealing with large CSV files that take a long time to process and you encounter this error midway through, consider using chunksize with error handling so you don’t lose all progress:

import pandas as pd

chunks = []
for chunk in pd.read_csv("data.csv", encoding="cp1252", chunksize=10000):
    chunks.append(chunk)

df = pd.concat(chunks, ignore_index=True)

Fix 6: Set Database Connection Charset

If the data causing the error comes from a database, the problem may be at the connection level. The database client needs to know what encoding to use when transferring data.

MySQL / MariaDB:

import mysql.connector

conn = mysql.connector.connect(
    host="localhost",
    user="root",
    password="password",
    database="mydb",
    charset="utf8mb4"  # Use utf8mb4, not utf8
)

Warning: MySQL’s utf8 charset is actually UTF-8 limited to 3 bytes (no emoji, no rare CJK characters). Always use utf8mb4 for full UTF-8 support.

PostgreSQL with psycopg2:

import psycopg2

conn = psycopg2.connect(
    host="localhost",
    dbname="mydb",
    user="postgres",
    password="password",
    options="-c client_encoding=UTF8"
)

SQLAlchemy:

from sqlalchemy import create_engine

# MySQL
engine = create_engine("mysql+pymysql://user:pass@localhost/mydb?charset=utf8mb4")

# PostgreSQL
engine = create_engine("postgresql://user:pass@localhost/mydb?client_encoding=utf8")

If your database contains data that was stored with inconsistent encodings (a common problem with legacy databases), you may need to clean the data at the database level before reading it in Python. This is often the case when applications didn’t enforce encoding on input, leading to a mix of Latin-1, cp1252, and UTF-8 data in the same column.

Fix 7: Set PYTHONIOENCODING for Terminal/Subprocess Issues

Sometimes the error happens not when reading files, but when printing output or piping data between processes. This is a terminal encoding issue.

Set the PYTHONIOENCODING environment variable:

Linux/macOS:

export PYTHONIOENCODING=utf-8
python script.py

Windows (Command Prompt):

set PYTHONIOENCODING=utf-8
python script.py

Windows (PowerShell):

$env:PYTHONIOENCODING = "utf-8"
python script.py

You can also set it permanently in your shell profile (.bashrc, .zshrc, etc.):

export PYTHONIOENCODING=utf-8

If the issue happens specifically with subprocess calls, set the encoding there:

import subprocess

result = subprocess.run(
    ["some_command"],
    capture_output=True,
    text=True,
    encoding="utf-8",
    errors="replace"  # Don't crash on bad bytes
)
print(result.stdout)

On Windows, you might also need to set the console code page:

import sys
import io

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="utf-8", errors="replace")

If your Python installation itself isn’t being found, check our guide on fixing “python: command not found”.

Fix 8: Convert the File to UTF-8

Sometimes the cleanest solution is to convert the file to UTF-8 once, then use it normally. This is especially useful for files you’ll read repeatedly.

Using Python:

import chardet

# Step 1: Detect the current encoding
with open("data.txt", "rb") as f:
    raw = f.read()
    detected = chardet.detect(raw)

print(f"Detected: {detected['encoding']} (confidence: {detected['confidence']})")

# Step 2: Decode with the detected encoding, re-encode as UTF-8
text = raw.decode(detected["encoding"])
with open("data.txt", "w", encoding="utf-8") as f:
    f.write(text)

Using iconv (Linux/macOS command line):

iconv -f WINDOWS-1252 -t UTF-8 data.txt > data_utf8.txt

Using PowerShell (Windows):

Get-Content data.txt -Encoding Default | Set-Content data_utf8.txt -Encoding UTF8

After conversion, all your existing code that uses encoding="utf-8" (or relies on the default) will work without changes.

Still Not Working?

If you’ve tried the fixes above and still get UnicodeDecodeError, try these less common solutions:

Check if the file is actually binary. Some files that look like text files are actually binary (compressed, encrypted, or serialized data). Open in binary mode to check:

with open("mystery_file", "rb") as f:
    print(f.read(100))

If the output looks like random bytes rather than readable text with occasional bad characters, the file isn’t a text file at all. Read it in binary mode ("rb") and handle accordingly.

Inspect the exact problematic byte. The error message tells you the position. Find the offending byte:

with open("data.txt", "rb") as f:
    data = f.read()

# If error says "position 128"
pos = 128
print(f"Byte at position {pos}: {hex(data[pos])}")
print(f"Context: {data[max(0, pos-10):pos+10]}")

This helps you determine whether the issue is a single corrupted byte or a systemic encoding mismatch.

Check your Python script’s own encoding. If the error is in your Python source file itself (not in a file you’re reading), add an encoding declaration at the top:

# -*- coding: utf-8 -*-

This is rarely needed in Python 3 (which defaults to UTF-8 for source files), but it’s required in Python 2 and can help if your editor saves files in a different encoding. If you’re seeing other syntax-related errors in your scripts, check our guide on fixing Python IndentationError.

Handle mixed-encoding data line by line. If a file has mostly UTF-8 data with a few lines in a different encoding (common in log files), process it line by line:

results = []
with open("mixed.log", "rb") as f:
    for line_num, line in enumerate(f, 1):
        try:
            decoded = line.decode("utf-8")
        except UnicodeDecodeError:
            decoded = line.decode("cp1252", errors="replace")
            print(f"Warning: Line {line_num} was not UTF-8")
        results.append(decoded)

Check for null bytes. Some files contain null bytes (0x00) that indicate they’re UTF-16 encoded but have been misidentified:

with open("data.txt", "rb") as f:
    content = f.read(100)

if b"\x00" in content:
    print("File may be UTF-16 encoded")
    with open("data.txt", "r", encoding="utf-16") as f:
        text = f.read()

Watch for circular import issues masking encoding errors. In rare cases, if your Python project has circular imports, the real error might be masked. If you’re getting unexpected errors during module loading, check our guide on fixing circular imports in Python.

Debug recursive file processing. If you’re processing files recursively and the error appears deep in a directory tree, add error handling to identify which file is causing the problem:

import os

for root, dirs, files in os.walk("data_directory"):
    for filename in files:
        filepath = os.path.join(root, filename)
        try:
            with open(filepath, "r", encoding="utf-8") as f:
                content = f.read()
        except UnicodeDecodeError as e:
            print(f"Encoding error in {filepath}: {e}")
        except Exception as e:
            print(f"Other error in {filepath}: {e}")

If your recursive processing is hitting Python’s recursion limit, see our guide on fixing Python RecursionError.

Set the locale on Linux servers. If the error happens only in production (SSH, cron jobs, Docker containers), the locale may not be set:

# Check current locale
locale

# If LANG is empty or set to "C" / "POSIX", set it:
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8

In a Dockerfile:

ENV LANG=en_US.UTF-8
ENV LC_ALL=en_US.UTF-8

Python 2 to 3 migration issues. If you’re porting code from Python 2 to Python 3, the handling of strings changed fundamentally. Python 2 strings are byte strings by default, while Python 3 strings are Unicode by default. Code that worked in Python 2 without any encoding declarations will often break in Python 3. The fix is always the same: explicitly specify the encoding when opening files.