Charset Normalizer Versions Save

Truly universal encoding detector in pure Python

3.3.2

6 months ago

3.3.2 (2023-10-31)

Fixed

  • Unintentional memory usage regression when using large payloads that match several encodings (#376)
  • Regression on some detection cases showcased in the documentation (#371)

Added

  • Noise (md) probe that identifies malformed Arabic representation due to the presence of letters in isolated form (credit to my wife, thanks!)

3.3.1

6 months ago

3.3.1 (2023-10-22)

Changed

  • Optional mypyc compilation upgraded to version 1.6.1 for Python >= 3.8
  • Improved the general detection reliability based on reports from the community

3.3.0

7 months ago

3.3.0 (2023-09-30)

Added

  • Allow to execute the CLI (e.g. normalizer) through python -m charset_normalizer.cli or python -m charset_normalizer
  • Support for 9 forgotten encodings that are supported by Python but unlisted in encoding.aliases as they have no alias (#323)

Removed

  • (internal) Redundant utils.is_ascii function and unused function is_private_use_only
  • (internal) charset_normalizer.assets is moved inside charset_normalizer.constant

Changed

  • (internal) Unicode code blocks in constants are updated using the latest v15.0.0 definition to improve detection
  • Optional mypyc compilation upgraded to version 1.5.1 for Python >= 3.8

Fixed

  • Unable to properly sort CharsetMatch when both chaos/noise and coherence were close due to an unreachable condition in __lt__ (#350)

3.2.0

10 months ago

3.2.0 (2023-06-07)

Changed

  • Typehint for function from_path no longer enforce PathLike as its first argument
  • Minor improvement over the global detection reliability

Added

  • Introduce function is_binary that relies on main capabilities, and is optimized to detect binaries
  • Propagate enable_fallback argument throughout from_bytes, from_path, and from_fp that allow a deeper control over the detection (default True)
  • Explicit support for Python 3.12

Fixed

  • Edge case detection failure where a file would contain 'very-long' camel-cased word (Issue #289)

3.1.0

1 year ago

3.1.0 (2023-03-06)

Added

  • Argument should_rename_legacy for legacy function detect and disregard any new arguments without errors (PR #262)

Removed

  • Support for Python 3.6 (PR #260)

Changed

  • Optional speedup provided by mypy/c 1.0.1

3.0.1

1 year ago

3.0.1 (2022-11-18)

Fixed

  • Multi-bytes cutter/chunk generator did not always cut correctly (PR #233)

Changed

  • Speedup provided using mypy/c 0.990 on Python >= 3.7

3.0.0

1 year ago

3.0.0 (2022-10-20)

Added

  • Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
  • Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
  • Add parameter language_threshold in from_bytes, from_path and from_fp to adjust the minimum expected coherence ratio
  • normalizer --version now specify if the current version provides extra speedup (meaning mypyc compilation whl)

Changed

  • Build with static metadata (not pyproject.toml yet)
  • Make language detection stricter
  • Optional: Module md.py can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1

Fixed

  • CLI with opt --normalize fail when using full path for files
  • TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha characters have been fed to it
  • Sphinx warnings when generating the documentation

Removed

  • Coherence detector no longer returns 'Simple English' instead returns 'English'
  • Coherence detector no longer returns 'Classical Chinese' instead returns 'Chinese'
  • Breaking: Method first() and best() from CharsetMatch
  • UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflicts with ASCII)
  • Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches
  • Breaking: Top-level function normalize
  • Breaking: Properties chaos_secondary_pass, coherence_non_latin and w_counter from CharsetMatch
  • Support for the backport unicodedata2

This is the last version (3.0.x) to support Python 3.6 We plan to drop it for 3.1.x

3.0.0rc1

1 year ago

This is the last pre-release. If everything goes well, I will publish the stable tag.

3.0.0rc1 (2022-10-18)

Added

  • Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
  • Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
  • Add parameter language_threshold in from_bytes, from_path and from_fp to adjust the minimum expected coherence ratio

Changed

  • Build with static metadata using 'build' frontend
  • Make language detection stricter

Fixed

  • CLI with opt --normalize fail when using full path for files
  • TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha characters have been fed to it

Removed

  • Coherence detector no longer returns 'Simple English' instead returns 'English'
  • Coherence detector no longer returns 'Classical Chinese' instead returns 'Chinese'

3.0.0b2

1 year ago

3.0.0b2 (2022-08-21)

Added

  • normalizer --version now specify if current version provide extra speedup (meaning mypyc compilation whl)

Removed

  • Breaking: Method first() and best() from CharsetMatch
  • UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII)

Fixed

  • Sphinx warnings when generating the documentation

2.1.1

1 year ago

2.1.1 (2022-08-19)

Deprecated

  • Function normalize scheduled for removal in 3.0

Changed

  • Removed useless call to decode in fn is_unprintable (#206)

Fixed

  • Third-party library (i18n xgettext) crashing not recognizing utf_8 (PEP 263) with underscore from @aleksandernovikov (#204)