The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
There are some breaking changes in this release:
LanguageDetector.detectLanguagesOf(text: Iterable<String>)
has been removed because the sorting order of the returned languages was undefined for input collections such as a HashSet. From now on, the method LanguageDetector.detectLanguageOf(text: String)
will be the only one to be used.LanguageDetector
can now be built with the following additional methods:
LanguageDetectorBuilder.fromIsoCodes639_1(vararg isoCodes: IsoCode639_1)
LanguageDetectorBuilder.fromIsoCodes639_3(vararg isoCodes: IsoCode639_3)
LanguageDetectorBuilder.fromIsoCodes(isoCode: String, vararg isoCodes: String)
The LanguageDetectorBuilder
now supports the additional method withMinimumRelativeDistance()
that allows to specify the minimum distance between the logarithmized and summed up probabilities for each possible language. If two or more languages yield nearly the same probability for a given input text, it is likely that the wrong language may be returned. By specifying a higher value for the minimum relative distance, Language.UNKNOWN
is returned instead of risking false positives.
Test report generation can now use multiple CPU cores, allowing to run as many reports as CPU cores are available. This has been implemented as an additional attribute for the respective Gradle task: ./gradlew writeAccuracyReports -PcpuCores=...
The REPL now allows to freely specify the languages you want to try out by entering the desired ISO 639-1 codes. Before, it has only been possible to choose between certain language combinations.
Thanks to the great work of contributor Bernhard Geisberger, two bugs could be fixed.
The fix in pull request #8 solves the problem of not being able to recreate the MapDB cache files automatically in case the data has been corrupted.
The fix in pull request #9 makes the class LanguageDetector
completely thread-safe. Previously, in some rare cases it was possible that two threads mutated one of the internal variables at the same time, yielding inaccurate language detection results.
Thank you, Bernhard.
This release took some time, but here it is.
Language models are now lazy-loaded into memory upon first access and not already when an instance of LanguageDetector
is created. This way, if the rule-based engine can filter out some unlikely languages, their language models are not loaded into memory as they are not necessary at that point. So the overall memory consumption is further reduced.
The fastutil library is used to compress the probability values of the language models in memory. They are now stored as primitive data types (double
) instead of objects (Double
) which reduces memory consumption by approximately 500 MB if all language models are selected.
This minor update fixes a critical bug reported in issue #1.
kotlin.KotlinNullPointerException
. This has been fixed in this release. Instead, Language.UNKNOWN
is now returned as expected.This minor update contains some significant detection accuracy improvements.
LanguageDetectorBuilder.fromIsoCodes()
now accepts vararg
arguments instead of a List
in order to have a consistent API with the other methods of LanguageDetectorBuilder
LanguageDetectorBuilder.fromIsoCodes()
which does not exist, then an IllegalArgumentException
is thrown. Previously, Language.UNKNOWN
was returned. However, this could lead to bugs as a LanguageDetector
with Language.UNKNOWN
was built. This is now prevented.This major release offers a lot of new features, including new languages. Finally! :-)
language
and do not need to run the reports for all languages anymore: mvn test -P accuracy-reports -D detector=lingua -D language=German
.com.github.pemistahl.lingua.api
. Breaking changes herein are tried to keep to a minimum in 0.*.*
versions and will not be performed anymore starting from version 1.0.0
. All other code is stored in com.github.pemistahl.lingua.internal
and is subject to change without any further notice.com.github.pemistahl.lingua.api.LanguageDetectorBuilder
which is now responsible for building and configuring instances of com.github.pemistahl.lingua.api.LanguageDetector
/accuracy-reports/accuracy-reports-analysis-notebook.ipynb
.This minor version update provides the following:
This minor version update provides the following:
This release provides both new features and bug fixes. It is the first release that has been published to JCenter. Publication on Maven Central will follow soon.
mvn test -P accuracy-reports
)LanguageDetector.detectLanguageFrom()
to LanguageDetector.detectLanguageOf()
to use the grammatically correct English preposition0.1.0
, the now called method LanguageDetector.detectLanguageOf()
returned null
for strings whose language could not be detected reliably. Now, Language.UNKNOWN
is returned instead in those cases to prevent NullPointerException
s especially in Java code.This is the very first release of Lingua. It aims at accurate language detection results for both long and especially short text. Detection on short text fragments such as Twitter messages is a weak spot of many similar libraries.
Supported languages so far: