Bitextor generates translation memories from multilingual websites
bicleaner
and bifixer
bifixer
silent error outputbifixer
output when using flag --ignore_duplicates
Note: the bitextor-v7.2.1.zip tarball does include submodules code. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.2.1.zip tarball or cloning the repo v7.2.1 tag.
Even young Luke Skywalker had to face some some pythons on his way to Yoda! This release brings with it the Force, thanks to all the Jedis who helped:
WARCFiles: ["/home/user/warc1.warc.gz", "/home/user/warc2.warc.gz"]
.gz
compressed format (records individually compressed).warc3-wet
completely replaced with warcio
(thanks @zuny26!)xzlang
Snakefile parameters allows grouping preprocessing output by languages (create a separate file for each language found, not just LANG1 and LANG2) (thanks @zuny26!)giawarc
WARC preprocessor (thanks @wwaites!)
giawarc: true
in Snakemake config parametersbitextor-warc2preprocess
for the HTML/XML Python parser selectolax
pdf-extract
is now installed and used using Pypi package (thanks @dionwiggins!)bleualign-cpp
implementation is now an external dependency
restorative-cleaning
is deprecated. Now bifixer
is replacing it! (thanks @mbanon!)giawarc
in Dependencies
kenlm
instructions to install it from upstream (thanks @kpu!)Note: the bitextor-v7.2.zip
tarball does include submodules code. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v7.2.zip
tarball or cloning the repo v7.2
tag.
wget
.python-pdfextract
wrapper.idnum
value in TMX when entries were duplicated.run-tests.sh
being silent when any dependency is missing.Note: the bitextor-v7.1.2.zip tarball does include submodules code. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.1.2.zip tarball or cloning the repo v7.1.2 tag.
warc3-wet
dependency. Still used in several options of Bitextor.Note: the bitextor-v7.1.1.zip
tarball does include submodules code. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v7.1.1.zip
tarball or cloning the repo v7.1.1
tag.
"Snakes. Why'd it have to be snakes?", Indy
warc3-wet
WARC library with the more robust warcio
(thanks @PinzhenChen!)wget
as a new crawler with WARC support.
wget
, creepy
or httrack
in "crawler" Snakemake config option.pdftohtml
and pdf-extract
(thanks @dionwiggins!)
pdf-converter: pdftohtml
or pdf-converter: pdf-extract
.wget
support PDFs + HTMLs crawling.
crawler: wget
and crawlFileTypes: "html,pdf"
settings for that.requirements.txt
versions, fixing issues with Python 3.7 (thanks @wwaites!).restorative-cleaning
now supports cyrillic langs, slavic diacritics and provides better fixing of mojibake (thanks @mbanon!)robots.txt
'Crawl-delay' valuebleualign-cpp
issues with swapped arguments and output scores (thanks @mespla!).features/bitextor-structuredistance.py
crashing with non HTML/XML files (thanks @txellgb!)Note: the bitextor-v7.1.zip
tarball does include submodules code. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v7.1.zip
tarball or cloning the repo v7.1
tag.
What a release! We redesigned Bitextor to be more flexible and scalable, running it like a make
but in Python. Say thanks to snakemake
!
Furthermore, we made all Python scripts compatible with Python 3.5-3.8, so good news for long term support. Let's take a look to the full changelog:
bitextor.sh
reworked.
snakemake/Snakefile
.snakemake/nmt/Snakefile
.
snakemake/example/tests
.ftfy
.bleualign
(C++ implementation) as an alternative sentence aligner.pip
packages.README.md
.Note: the bitextor-v7.zip
tarball does include submodules code. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v7.zip
tarball or cloning the repo v7
tag or 7.x
branch.
Just a small bugfix for our solid-rock Bitextor 6. Thanks for all your feedback!
Note that we will support Bitextor 6 until Bitextor 7 is released.
sed
"\s" group (spaces) when rearranging hunalign
dictionaries (thanks @JiaYueHuang!)Note: the bitextor-v6.0.1.zip
tarball does include submodules code. If you start compiling the project from repository clone, first you need to git submodule update --init --recursive
. Also, you can't perform this command on the source code .tar.gz
and .zip
packages generated by Github, so we recommend the bitextor-v6.0.1.zip
tarball or cloning the repo v6.0.1
tag or 6.x
branch (for latest stable release of series 6).
Here it is! The final release of Bitextor 6! Hope you enjoy this nearly-stable release, as we expect this to be the last one before we get into Bitextor 7 development with full Python 3 support (and Python 2 deprecation), stand-off annotation and more goodies.
Note that we will support Bitextor 6 until Bitextor 7 is released.
bitextor-crawl
(thanks @bhaddow!)--ignore-boilerpipe-cleaning
or -B
) now also working with HTTRack (thanks @mespla!)kpu/preprocess
submodule to deal with Moses dependencies, and kpu/kenlm
repo for latest Zipporah (thanks @kpu!)bitextor-ett2lett
script and Python language detection part refactored to bitextor-lett-language-detector
(thanks @hieuhoang!)Note: the bitextor-v6.zip
tarball does include submodules code. If you start compiling the project from repository clone, first you need to git submodule update --init --recursive
. Also, you can't perform this command on the source code .tar.gz
and .zip
packages generated by Github, so we recommend the bitextor-v6.zip
tarball or cloning the repo v6
tag.
After so much feedback from v6.0.0-rc.1, we fixed lots of issues, so we are releasing v6.0.0-rc.2! Wiki has also been updated.
README.md
about issues with Tensorflow with specific versions and AMD CPUsrequirements.txt
httrack
now downloads redirected pages (thanks @hieuhoang!).jar
files from the repository. Now they are downloaded using mvn
(Maven), which now is a new dependency-v
or --vocabulary
arguments are not mandatory anymore if --paracrawl-aligner-command
is used)Note: the bitextor-v6.0.0-rc.2.zip
tarball does include submodules code. If you start compiling the project from repository clone, first you need to git submodule update --init --recursive
. Also, you can't perform this command on the source code .tar.gz
and .zip
packages generated by Github, so we recommend the bitextor-v6.0.0-rc.2.zip
tarball or cloning the repo.
Hi there! Here we go with the v6.0.0-rc.1 of Bitextor. This release is related to the code release at Paracrawl project. There are lots of changes since v5.0 of Bitextor and it is the first release since we moved into Github.
README.md
with new dependencies, commands and troubleshootingtika
input/output managementnltk
as sentence splitterbitextor
to control most parts of the pipeline and long named versions of them (see --help
)mkcls
with clustercat
and giza-pp
with mgiza
bitextor
. See README.md.bicleaner
and zipporah
classifiers and thresholds for filteringhttrack
as alternative crawler--jhu-lett
)--jhu-aligner-command TRANSLATIONCOMMAND
)Note: the bitextor-v6.0.0-rc.1.zip
tarball does not include submodules code. If you start compiling the project from this tarball, first you need to git submodule update --init --recursive
. Also, you can't perform this command on the source code .tar.gz
and .zip
packages, so we recommend the bitextor-v6.0.0-rc.1.zip
tarball or cloning the repo.