Bitextor Versions Save

Bitextor generates translation memories from multilingual websites

v7.2.1

4 years ago
  • Updated submodules
    • Fixed and updated requirements from bicleaner and bifixer
    • Fixed bifixer silent error output
    • Fixed bifixer output when using flag --ignore_duplicates
  • Fixed possible tabs in URLs from bad-formed WARCs
  • Improved documentation

Note: the bitextor-v7.2.1.zip tarball does include submodules code. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.2.1.zip tarball or cloning the repo v7.2.1 tag.

v7.2

4 years ago

Even young Luke Skywalker had to face some some pythons on his way to Yoda! This release brings with it the Force, thanks to all the Jedis who helped:

v7.2 Changelog

  • Now you can set a list of WARC files in addition to the URLs as input for Bitextor (thanks @zuny26!)
    • For example, in config file for Snakemake:
      • WARCFiles: ["/home/user/warc1.warc.gz", "/home/user/warc2.warc.gz"]
  • Switched to the WARC standard .gz compressed format (records individually compressed).
  • warc3-wet completely replaced with warcio (thanks @zuny26!)
  • xzlang Snakefile parameters allows grouping preprocessing output by languages (create a separate file for each language found, not just LANG1 and LANG2) (thanks @zuny26!)
  • This has the benefit of avoiding repeating preprocessing step when processing the same domain for different pair of languages.
  • Added support for giawarc WARC preprocessor (thanks @wwaites!)
    • Activate it using giawarc: true in Snakemake config parameters
    • Installation instructions in README.md
  • Added support in bitextor-warc2preprocess for the HTML/XML Python parser selectolax
    • Select which parser with parser in Snakemake options. Options are 'alcazar', 'bs4' (default) and 'modest'.
      • NOTE: it does not do anything giawarc: true or xzlang: true
  • pdf-extract is now installed and used using Pypi package (thanks @dionwiggins!)
  • Fixed sentence splitter in MT-based document alignment for the target language (thanks @kirefu!)
  • bleualign-cpp implementation is now an external dependency
  • Replaced MD5 with MurmurHash3 in Creepy crawler, WARC preprocessors and deferred module
  • Some PEP8 code compliance changes and code cleaning using latest Snakecharm
  • Updated submodules
    • restorative-cleaning is deprecated. Now bifixer is replacing it! (thanks @mbanon!)
  • Updated documentation.
  • General system stability improvements to enhance the user's experience.

Note: the bitextor-v7.2.zip tarball does include submodules code. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.2.zip tarball or cloning the repo v7.2 tag.

v7.1.2

4 years ago
  • Fixed input in idx2ridx.
  • Fixed WARC arbitrary compression using wget.
  • Fixed PDFextract inefficient call (starting JVM every PDF). Using python-pdfextract wrapper.
  • Fixed idnum value in TMX when entries were duplicated.
  • Fixed some missing dependencies.
  • Fixed run-tests.sh being silent when any dependency is missing.
  • Updated submodules.

Note: the bitextor-v7.1.2.zip tarball does include submodules code. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.1.2.zip tarball or cloning the repo v7.1.2 tag.

v7.1.1

4 years ago
  • Bugfix for WARC preprocessing crash using WARC files without HTTP response in the payload.
  • Bugfix for warc3-wet dependency. Still used in several options of Bitextor.

Note: the bitextor-v7.1.1.zip tarball does include submodules code. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.1.1.zip tarball or cloning the repo v7.1.1 tag.

v7.1

4 years ago

"Snakes. Why'd it have to be snakes?", Indy

v7.1 Changelog

  • Deleted html5lib parsing in WARC preprocess for improved performance.
  • Added WARC record size limit (thanks @hieuhoang!).
  • Fixed duplicate detection in WARC (thanks @hisashi-ito!).
  • Replaced warc3-wet WARC library with the more robust warcio (thanks @PinzhenChen!)
  • Added wget as a new crawler with WARC support.
    • Choose crawler setting wget, creepy or httrack in "crawler" Snakemake config option.
  • Added PDF detection and extraction support from WARC records using pdftohtml and pdf-extract (thanks @dionwiggins!)
    • Choose the PDF converter engine with Snakemake config parameter pdf-converter: pdftohtml or pdf-converter: pdf-extract.
    • Only wget support PDFs + HTMLs crawling.
      • Use crawler: wget and crawlFileTypes: "html,pdf" settings for that.
  • Added OpenDocument, Office Open XML and ePub formats detection and extraction from WARC records.
  • Added filters for unsupported documents in WARC file preprocessing (thanks @wwaites!)
  • Added back support for deferred crawling.
  • Now there is no default tokenizer and sentence splitter. Specifying it is mandatory.
  • Improved MT-based document aligner, with better performance and more accurate default options.
  • Updated dependencies and Python requirements.txt versions, fixing issues with Python 3.7 (thanks @wwaites!).
  • Updated submodules
    • restorative-cleaning now supports cyrillic langs, slavic diacritics and provides better fixing of mojibake (thanks @mbanon!)
  • 'Crawl-delay' is now parsed.
    • Crawlers will use the highest given delay:
      • Snakemake default
      • User defined
      • robots.txt 'Crawl-delay' value
  • Fixed bleualign-cpp issues with swapped arguments and output scores (thanks @mespla!).
  • Fixed bug in features/bitextor-structuredistance.py crashing with non HTML/XML files (thanks @txellgb!)
  • Added feature request, bug report and Code of Conduct documents.
  • Updated documentation.
  • General system stability improvements to enhance the user's experience.

Note: the bitextor-v7.1.zip tarball does include submodules code. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.1.zip tarball or cloning the repo v7.1 tag.

v7

5 years ago

What a release! We redesigned Bitextor to be more flexible and scalable, running it like a make but in Python. Say thanks to snakemake!

Furthermore, we made all Python scripts compatible with Python 3.5-3.8, so good news for long term support. Let's take a look to the full changelog:

v7 Changelog

  • bitextor.sh reworked.
    • Now it calls rules from snakemake/Snakefile.
    • Also, it uses NMT rules if using translation-based document alignment from snakemake/nmt/Snakefile.
      • Example YAML config files at snakemake/example/tests.
  • Crawling output format changed to WARC.
  • Reworked file formats.
    • The ETT, LETT and LETTR formats do not exist anymore. Now, there is one file per column to avoid redundancy.
  • Sentence and word tokeniser paths can be specified by the user.
  • Deleted Zipporah and Ulysses.
  • Replaced Tika with Python3 ftfy.
  • Added optional restorative cleaning.
  • Updated installation instructions.
  • Added bleualign (C++ implementation) as an alternative sentence aligner.
  • Reworked Makefile.
    • No need to install Bitextor to run it (or any of the included scripts).
  • Updated submodules, APT dependencies and pip packages.
  • PEP 8 style guidelines in Python scripts.
  • Fully compatible with Python 3.
  • Added computational requirements documentation and reworked README.md.
  • General system stability improvements to enhance the user's experience.

Note: the bitextor-v7.zip tarball does include submodules code. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.zip tarball or cloning the repo v7 tag or 7.x branch.

v6.0.1

5 years ago

Just a small bugfix for our solid-rock Bitextor 6. Thanks for all your feedback!

Note that we will support Bitextor 6 until Bitextor 7 is released.

How do I install Bitextor?

How do I run Bitextor?

Any example to check if it is working?

v6.0.1 Changelog

  • Fix sed "\s" group (spaces) when rearranging hunalign dictionaries (thanks @JiaYueHuang!)

Note: the bitextor-v6.0.1.zip tarball does include submodules code. If you start compiling the project from repository clone, first you need to git submodule update --init --recursive. Also, you can't perform this command on the source code .tar.gz and .zip packages generated by Github, so we recommend the bitextor-v6.0.1.zip tarball or cloning the repo v6.0.1 tag or 6.x branch (for latest stable release of series 6).

v6

5 years ago

Here it is! The final release of Bitextor 6! Hope you enjoy this nearly-stable release, as we expect this to be the last one before we get into Bitextor 7 development with full Python 3 support (and Python 2 deprecation), stand-off annotation and more goodies.

Note that we will support Bitextor 6 until Bitextor 7 is released.

How do I install Bitextor?

How do I run Bitextor?

Any example to check if it is working?

v6 Changelog

  • Assume http if no protocol given with URL to Creepy crawler bitextor-crawl (thanks @bhaddow!)
  • Option to avoid boilerplate (--ignore-boilerpipe-cleaning or -B) now also working with HTTRack (thanks @mespla!)
  • Added kpu/preprocess submodule to deal with Moses dependencies, and kpu/kenlm repo for latest Zipporah (thanks @kpu!)
  • Restored old I/O behavior of bitextor-ett2lett script and Python language detection part refactored to bitextor-lett-language-detector (thanks @hieuhoang!)
  • Some checks, renames, minor changes and stability fixes

Note: the bitextor-v6.zip tarball does include submodules code. If you start compiling the project from repository clone, first you need to git submodule update --init --recursive. Also, you can't perform this command on the source code .tar.gz and .zip packages generated by Github, so we recommend the bitextor-v6.zip tarball or cloning the repo v6 tag.

v6.0.0-rc.2

5 years ago

After so much feedback from v6.0.0-rc.1, we fixed lots of issues, so we are releasing v6.0.0-rc.2! Wiki has also been updated.

How do I install Bitextor?

How do I run Bitextor?

Any example to check if it is working?

6.0.0-rc.2 Changelog

  • Added information to the README.md about issues with Tensorflow with specific versions and AMD CPUs
  • Added requirements.txt
  • Improved error and logs management (thanks @bhaddow!)
  • httrack now downloads redirected pages (thanks @hieuhoang!)
  • Updated all submodules to fix licensing problems
  • Translated and expanded some ancient comments and variables
  • Moved some inline/embedded Python code blocks to a script files
  • Shifted to NLTK as a default sentence tokeniser instead of Ulysses
  • Deleted Java .jar files from the repository. Now they are downloaded using mvn (Maven), which now is a new dependency
  • Fixed use of Paracrawl Document Aligner without dictionary file (-v or --vocabulary arguments are not mandatory anymore if --paracrawl-aligner-command is used)
  • Some checks, renames, and minor changes and stability fixes

Note: the bitextor-v6.0.0-rc.2.zip tarball does include submodules code. If you start compiling the project from repository clone, first you need to git submodule update --init --recursive. Also, you can't perform this command on the source code .tar.gz and .zip packages generated by Github, so we recommend the bitextor-v6.0.0-rc.2.zip tarball or cloning the repo.

v6.0.0-rc.1

5 years ago

Hi there! Here we go with the v6.0.0-rc.1 of Bitextor. This release is related to the code release at Paracrawl project. There are lots of changes since v5.0 of Bitextor and it is the first release since we moved into Github.

How do I install Bitextor?

How do I run Bitextor?

Any example to check if it is working?

6.0.0-rc.1 Changelog

  • Updated documentation and README.md with new dependencies, commands and troubleshooting
  • Added original repositories for most of compiled dependencies (mgiza, clustercat, bicleaner...)
  • Fixed encoding errors in tika input/output management
  • Added option to use nltk as sentence splitter
  • Added lots of parameters and options for bitextor to control most parts of the pipeline and long named versions of them (see --help)
  • Replaced mkcls with clustercat and giza-pp with mgiza
  • Added option for a config file in bitextor. See README.md.
  • Added ELRC metrics and filters
  • Added bicleaner and zipporah classifiers and thresholds for filtering
  • Added httrack as alternative crawler
  • Added a JHU processing script for processing crawler content (option --jhu-lett)
  • Added an alternative document aligner translate based (Paracrawl) (option --jhu-aligner-command TRANSLATIONCOMMAND)
  • Minor changes and bugfixes

Note: the bitextor-v6.0.0-rc.1.zip tarball does not include submodules code. If you start compiling the project from this tarball, first you need to git submodule update --init --recursive. Also, you can't perform this command on the source code .tar.gz and .zip packages, so we recommend the bitextor-v6.0.0-rc.1.zip tarball or cloning the repo.