Bitextor generates translation memories from multilingual websites
I've seen things you people wouldn't believe. Roy Batty, The Preverticant
paragraphIdentification
by @lpla in https://github.com/bitextor/bitextor/pull/241
directories
and directioriesFile
documentation, by @aarongaliano in https://github.com/bitextor/bitextor/pull/247
PDFprocessing
option (previously PDFextract
). Now it is a list that allows you to choose whether to use pdf2html, pdfextract or Apache Tika (new PDF processor), by @aarongaliano in https://github.com/bitextor/bitextor/pull/247
multilang
option (if activated, warc2text will extract content in different languages from the same document), by @aarongaliano in https://github.com/bitextor/bitextor/pull/247
bicleanerExtraArgs
to pass extra arguments to Bicleaner(-AI) by @lpla in https://github.com/bitextor/bitextor/pull/250
Full Changelog: https://github.com/bitextor/bitextor/compare/v8.2...v8.3
bitextor-v8.3.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v8.3.zip
tarball or cloning the repo v8.3
tag.
We will support Bitextor 8.x
branch until the next major version is released.
I told you to run. , The Huntsman
Full Changelog: https://github.com/bitextor/bitextor/compare/v8.1.1...v8.2
bitextor-v8.2.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v8.2.zip
tarball or cloning the repo v8.2
tag.
We will support Bitextor 8.x
branch until the next major version is released.
dnf
commands.tests/run-tests.sh
to run those tests in both sequential (low resource server, using bash variable CI="true"
) or parallel.wget
crawler, as it has issues with URLs without extension.bicleaner
will enable or disable Bicleaner, and bicleanerModel
will contain the path to the model.bicleanerGenerateModel
instead of checking out if the model provided through bicleanerModel
config setting exists or not.generateDic
instead of checking out whether the dictionary exists or not.bitextor-v8.1.1.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v8.1.1.zip
tarball or cloning the repo v8.0
tag.
We will support Bitextor 8.x
branch until the next major version is released.
"Oh my God! A snake! Help me!", Dr. Robert Burke
bitextor-v8.1.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v8.1.zip
tarball or cloning the repo v8.0
tag.
We will support Bitextor 8.x
branch until the next major version is released.
bitextor-buildTMX.py
dedup option
warc2text
removed URLs lowercasingbitextor-v8.0.1.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v8.0.1.zip
tarball or cloning the repo v8.0.1
tag.
We will support Bitextor 8.x
branch until the next major version is released.
"We have unfinished business.", Beatrix
deferred-annotation-reconstructor.sh
giawarc
(now deprecated) and warc2preprocess
.warc2preprocess
.bitextor-warc2htmlwarc.py
and bitextor-warc2preprocess.py
:
lxml
text extraction parsing library option, and html5lib
as optional and additional parsing
html5lib
is the cleanest supported parser but also the slowestalcazar
as all code and references from upstream vanished.ftfy
is now disabled by default.document-aligner
folder)
slurm
, nmt
workflow for MarianNMT or pdf-extract
(replaced by wrappers in WARC processors).edge
tag in Dockerhub).bitextor-v8.0.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v8.0.zip
tarball or cloning the repo v8.0
tag.
We will support Bitextor 8.x
branch until the next major version is released.
hunalign
options and rules, will be integrated soon.lxml
text extraction parsing library option to bitextor-warc2htmlwarc.py" and
html5lib` optional and additional parsing.
warc2preprocess
preprocessor.html5lib
is the cleanest supported parser (like a web browser) but also the slowest.simple
text extraction parser in bitextor-warc2preprocess.py
for some table tags and new HTML5 tags.ftfy
is now disabled by default.wordTokenizers
and sentenceSplitters
are not defined.requirements.txt
before installing them.bleualign-cpp
code.
kpu/preprocess
and cache
fixes.Docker image will be updated once v8.0.0 gets released.
bitextor-v8.0.0-pre.zip
tarball does include submodules code, you still need to compile binaries like bleualign. If you start compiling the project after cloning from the git repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v8.0.0-pre.zip
tarball or cloning the repo v8.0.0-pre
tag.
warc2htmlwarc.py
optional non-compressed output.bicleaner
and bifixer
cached call from Bitextor, improving performance.heritrix
waiting time while creating initial crawling files.bicleaner
cached call by creating a Bitextor optional parameter called bicleanerCacheWithSents
.cld3
input in bitextor-warc2preprocess.py
, making most documents being detected as 'English'.<span>
by adding a space after their content, in the warc2preprocess
text extractor simple
.requirements.txt
for security and dependency issues.v7.3.2
.We started integrating Bitextor 8.0 development branches into master
branch. If you don't need latest features but a more stable code, please use released versions/tags or the stable branch 7.x
.
bitextor-v7.3.2.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v7.3.2.zip
tarball or cloning the repo v7.3.2
tag.
We will support Bitextor 7.x
branch until Bitextor 8 is released.
/home/user
) when used in config filesbitextor-warc2htmlwarc.py
stdin and stdout run mode.bitextor-v7.3.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v7.3.zip
tarball or cloning the repo v7.3
tag.
We will support Bitextor 7.x
branch until Bitextor 8 is released.
"Always look on the end (side) of life", PEP-373
plainTextHashes
option (incremental recrawling using mmh3).cld3
in both giawarc
and warc2preprocess
WARC processors, with optional install and use instructions.onlyPreprocessing
, preprocessLangs
and targetLangs
to allow processing more than two languages in the same run.
wordTokenizers
and sentenceSplitters
and added new ones like reverseOutputPair
.hunalign
sentence alignment.dataDir
as a folder with the data produced during WARC preprocessing step.bitextor-v7.3.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v7.3.zip
tarball or cloning the repo v7.3
tag.
We will support Bitextor 7.x
branch until Bitextor 8 is released.