Universal character encoding detector.
chardet 7.0 is a ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x, just much faster and more accurate. Python 3.10+, zero runtime dependencies, works on PyPy.
98.1% accuracy on 2,510 test files. 43x faster than chardet 6.0.0 and 6.8x faster than charset-normalizer. Language detection for every result. MIT licensed.
| chardet 7.0 (mypyc) | chardet 7.0 (pure) | chardet 6.0.0 | charset-normalizer | |
|---|---|---|---|---|
| Accuracy (2,510 files) | 98.1% | 98.1% | 88.2% | 78.5% |
| Speed | 546 files/s | 383 files/s | 13 files/s | 80 files/s |
| Language detection | 95.1% | 95.1% | -- | -- |
| Peak memory | 26.2 MiB | 26.3 MiB | 29.5 MiB | 101.2 MiB |
| Streaming detection | yes | yes | yes | no |
| Encoding era filtering | yes | yes | no | no |
| Supported encodings | 99 | 99 | 84 | 99 |
| License | MIT | MIT | LGPL | MIT |
pip install chardetimport chardet
# Plain ASCII is reported as its superset Windows-1252 by default,
# keeping with WHATWG guidelines for encoding detection.
chardet.detect(b"Hello, world!")
# {'encoding': 'Windows-1252', 'confidence': 1.0, 'language': 'en'}
# UTF-8 with typographic punctuation
chardet.detect("It\u2019s a lovely day \u2014 let\u2019s grab coffee.".encode("utf-8"))
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': 'es'}
# Japanese EUC-JP
chardet.detect("これは日本語のテストです。文字コードの検出を行います。".encode("euc-jp"))
# {'encoding': 'euc-jis-2004', 'confidence': 1.0, 'language': 'ja'}
# Get all candidate encodings ranked by confidence
text = "Le café est une boisson très populaire en France et dans le monde entier."
results = chardet.detect_all(text.encode("windows-1252"))
for r in results:
print(r["encoding"], r["confidence"])
# windows-1252 0.44
# iso-8859-15 0.44
# mac-roman 0.42
# cp858 0.42For large files or network streams, use UniversalDetector to feed data incrementally:
from chardet import UniversalDetector
detector = UniversalDetector()
with open("unknown.txt", "rb") as f:
for line in f:
detector.feed(line)
if detector.done:
break
result = detector.close()
print(result)Restrict detection to specific encoding eras to reduce false positives:
from chardet import detect_all
from chardet.enums import EncodingEra
data = "Москва является столицей Российской Федерации и крупнейшим городом страны.".encode("windows-1251")
# All encoding eras are considered by default — 4 candidates across eras
for r in detect_all(data):
print(r["encoding"], round(r["confidence"], 2))
# windows-1251 0.5
# mac-cyrillic 0.47
# kz-1048 0.22
# ptcp154 0.22
# Restrict to modern web encodings — 1 confident result
for r in detect_all(data, encoding_era=EncodingEra.MODERN_WEB):
print(r["encoding"], round(r["confidence"], 2))
# windows-1251 0.5chardetect somefile.txt
# somefile.txt: utf-8 with confidence 0.99
chardetect --minimal somefile.txt
# utf-8
# Pipe from stdin
cat somefile.txt | chardetect- MIT license (previous versions were LGPL)
- Ground-up rewrite — 12-stage detection pipeline using BOM detection, structural probing, byte validity filtering, and bigram statistical models
- 43x faster than chardet 6.0.0 with mypyc (30x pure Python), 6.8x faster than charset-normalizer
- 98.1% accuracy — +9.9pp vs chardet 6.0.0, +19.6pp vs charset-normalizer
- Language detection — 95.1% accuracy across 49 languages, returned with every result
- 99 encodings — full coverage including EBCDIC, Mac, DOS, and Baltic/Central European families
EncodingErafiltering — scope detection to modern web encodings, legacy ISO/Mac/DOS, mainframe, or all- Optional mypyc compilation — 1.42x additional speedup on CPython
- Thread-safe —
detect()anddetect_all()are safe to call concurrently; scales on free-threaded Python - Same API —
detect(),detect_all(),UniversalDetector, and thechardetectCLI all work as before
Full documentation is available at chardet.readthedocs.io.