Lingua Versions Save

The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

v1.2.2

1 year ago

Bug Fixes

Due to a bug in the Moshi JSON serialization library, language detection was not possible in certain cases. (#144, #147)
Lingua could not be used properly when a security manager was enabled in the JVM. (#141)

v1.2.1

1 year ago

Bug Fixes

An exception was thrown when trying to detect the language of unigrams and bigrams in low accuracy mode which operates only with trigrams and larger strings. This has been fixed.

v1.2.0

1 year ago

Features

The library can now be used as a Java 9 module. Thanks to @Marcono1234 for helping with the implementation. (#120, #138)
The new method LanguageDetectorBuilder.withLowAccuracyMode() has been introduced. By activating it, detection accuracy for short text is reduced in favor of a smaller memory footprint and faster detection performance. (#136)

Improvements

The memory footprint has been reduced significantly by applying several internal optimizations. Thanks to @Marcono1234, @fvasco and @sigpwned for their help. (#101, #127)
Several language model files have become obsolete and could be deleted without decreasing detection accuracy. This results in a smaller memory footprint and a 36% smaller jar file.

Bug Fixes

A bug in the rule engine has been fixed that caused incorrect language detection for certain texts. Thanks to @bdecarne who has found it.

Other changes

Due to a refactoring of how the internal thread pool works, the method LanguageDetector.destroy() has been deprecated in favor of the newly introduced method LanguageDetector.unloadLanguageModels().

v1.1.1

2 years ago

Improvements

The new method LanguageDetector.destroy() has been introduced that frees internal resources to prevent memory leaks within application server deployments. (#110, #116)
Language model loading performance has been improved by creating a manually optimized internal thread pool. This replaces the coroutines used in the previous release. (#116)

Bug Fixes

The character â was erroneously not treated as a possible indicator for the French language. (#115)
Language detection was non-deterministic when multiple alphabets had the same occurrence count. (#105)

v1.1.0

3 years ago

Languages

There is now support for the Maori language which was contributed to the Rust implementation of Lingua. (#93)

Features

Language models are now loaded asynchronously and in parallel using Kotlin coroutines, making this step more performant. (#84)
Language Models can now be loaded either lazily (default) or eagerly. (#79)
Instead of loading multiple copies of the language models into memory for each separate instance of LanguageDetector, multiple instances now share the same language models and access them asynchronously. (#91)

Improvements

Language detection for sentences with more than 120 characters now performs more quickly by iterating through trigrams only which is enough to achieve high detection accuracy.
Textual input that includes logograms from Chinese, Japanese or Korean is now split at each logogram and not only at whitespace. This provides for more reliable language detection for sentences that include multi-language content. (#85)

Bug Fixes

For an odd number of words as input, the method LanguageDetector.computeLanguageConfidenceValues computed wrong values under certain circumstances. (#87)
When Lingua was used in projects with an explictly set Kotlin version which differed from Lingua's implicitly set version in the Gradle script, several errors occurred during runtime. By explicitly setting Lingua's Kotlin version, these errors are now hopefully gone. (#88, #89)
Errors in the rule engine for the Latvian language have been resolved. (#92)

v1.0.3

3 years ago

Bug Fixes

When two languages had exactly the same confidence values, one of them was erroneously removed from the result map. Thanks to @mmedek for reporting this bug. (#72)
There was still a problem with the classification of texts consisting of certain alphabets. Thanks to @nicolabertoldi for reporting this bug. (#76)
The language detection for Spanish did not take the rarely used accented characters á, é, í, ó, ú and ü into account. Thanks to @joeporter for reporting this bug. (#73)
A bug in the rule engine led to weak detection accuracy for Macedonian and Serbian. This has been fixed.

Other Changes

The Kotlin compiler and runtime have been updated to version 1.4. This includes the current stable release 1.0.0 of the kotlinx-serialization framework.
The accuracy report files have been moved to their own Gradle source set. This allows for separate compilation of unit tests and accuracy report tests, leading to more flexible and slightly faster compilation.

v1.0.2

3 years ago

Bug Fixes

The language mapping for character ë was incorrect which has been fixed. Thanks to @sandernugterenedia for reporting this bug. (#66)
The implementation of LanguageDetector made use of functionality that was introduced in Java 8 which made the library unusable for Java 6 and 7. Thanks to @levant916 for reporting this bug. (#69)
The Gradle shadow plugin has been added so that ./gradlew jarWithDependencies produces a jar file whose dependencies do not conflict anymore with the same dependencies of different versions in the same project. (#67)

v1.0.1

3 years ago

Bug Fixes

If no ngram probabilities were found for a given input text, a NullPointerException would be thrown. Thanks to @fsonntag for finding and fixing this bug. (#63)

v1.0.0

3 years ago

Languages

added 9 new languages, this time with a focus on Africa: Ganda, Shona, Sotho, Swahili, Tsonga, Tswana, Xhosa, Yoruba, Zulu
removed language Norwegian in favor of Bokmal and Nynorsk (#59)

Features

LanguageDetector can now provide confidence scores for each evaluated language. (#11)
The public API for creating language model (LanguageModelFilesWriter) and test data files (TestDataFilesWriter) has been stabilized. (#37)
New convenience methods have been added to LanguageDetectorBuilder in order to build LanguageDetector from languages written in a certain script. (#61)

Improvements

The rule-based detection algorithm has been made less sensitive so that single words in a different language cannot mislead the algorithm so easily.
The fastutil library has been added again to reduce memory consumption. (#58)
The language model-based algorithm has been optimized so that language detection performs approximately 25% faster now. (#58)
Support for the Kotlin linter ktlint has been added to help with a consistent coding style. (#47)
Third-party dependencies have been updated to their latest versions. (#36)

Bug Fixes

Incorrect regex character classes caused the library to not work properly on Android. (#32)

Test Coverage

Test coverage has been extended from 59% to 72%.

Documentation

The README contains a new section describing how users can add their own languages to Lingua.

Other changes

There is a breaking change in this release:

Methods with the prefix fromAllBuiltIn... have been renamed to fromAll... to make them more succinct and clear. (#61)

v0.6.1

4 years ago

Bug Fixes

The rule-based engine did not take language subset filtering from public api into account (#23).
It was possible to pass through Language.UNKNOWN within the public api (#24).
Fixed a bug in the rule-based engine's alphabet detection algorithm which could be misled by single characters (#25).