Bitextor Versions Save

Bitextor generates translation memories from multilingual websites

v8.3

11 months ago

I've seen things you people wouldn't believe. Roy Batty, The Preverticant

What's Changed

New Contributors

Full Changelog: https://github.com/bitextor/bitextor/compare/v8.2...v8.3

Notes

bitextor-v8.3.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.3.zip tarball or cloning the repo v8.3 tag.

We will support Bitextor 8.x branch until the next major version is released.

v8.2

2 years ago

I told you to run. , The Huntsman

What's Changed

Full Changelog: https://github.com/bitextor/bitextor/compare/v8.1.1...v8.2

Notes

bitextor-v8.2.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.2.zip tarball or cloning the repo v8.2 tag.

We will support Bitextor 8.x branch until the next major version is released.

v8.1.1

2 years ago
  • Added support for Fedora installation. Check INSTALL.md for dnf commands.
  • Fixed tests/run-tests.sh to run those tests in both sequential (low resource server, using bash variable CI="true") or parallel.
  • Removed default file type filter in wget crawler, as it has issues with URLs without extension.
  • Bicleaner model training and dictionary generation options reworked:
    • bicleaner will enable or disable Bicleaner, and bicleanerModel will contain the path to the model.
    • Bicleaner model training will need to be explicitly enabled with bicleanerGenerateModel instead of checking out if the model provided through bicleanerModel config setting exists or not.
    • Dictionary generation will need to be set through generateDic instead of checking out whether the dictionary exists or not.
  • Updated Python requirements.
  • Minor bug fixes.

Notes

bitextor-v8.1.1.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.1.1.zip tarball or cloning the repo v8.0 tag.

We will support Bitextor 8.x branch until the next major version is released.

v8.1

2 years ago

"Oh my God! A snake! Help me!", Dr. Robert Burke

v8.1 Changelog

  • Major rework on paths and installation folders to allow Bitextor to be installed in a specific location
    • Check out installation instructions and details in INSTALL.md
  • Replaced Tensorflow and Keras in the dictionary-based document aligner with scikit-learn
  • General clean up of Python code
  • Updated submodules and Python requirements versions

Notes

bitextor-v8.1.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.1.zip tarball or cloning the repo v8.0 tag.

We will support Bitextor 8.x branch until the next major version is released.

v8.0.1

2 years ago
  • Deferred crawling standoff annotation reconstruction script has been rewritten for better performance
    • This one benefits from LRU dict as a limited-size hash memory-based cache
    • Uses native warcio and Moses sentence splitter (Python port)
  • Fix bitextor-buildTMX.py dedup option
    • Dedup was keeping sentences strings from the best score from Bifixer, but the other columns from the last occurrence (url, deferred crawling standoff annotation, bicleaner score...)
  • Bitextor now validates if a provided host is not valid
  • Updated submodules
    • warc2text removed URLs lowercasing
  • Added more tests to the CI, including Bitextor with deferred crawling standoff annotation and its reconstruction.
  • Updated requirements and submodules to their latest stable version.

Notes

bitextor-v8.0.1.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.0.1.zip tarball or cloning the repo v8.0.1 tag.

We will support Bitextor 8.x branch until the next major version is released.

v8.0.0

3 years ago

"We have unfinished business.", Beatrix

v8.0 Changelog

  • Deep rewrite of Bitextor Snakefile for a vast performance improvement.
  • Added a new crawler: linguacrawl, specialized in full TLD crawling.
  • Added a new method for deferred crawling only using Murmurhash hashes at the sentence alignment step.
    • A reconstructor is also provided: deferred-annotation-reconstructor.sh
  • Added sharding, which groups domains into 1 GB shards for a more balanced job running, done via giashard (Golang Internet Archive SHARDing).
  • A new WARC processor has been implemented in C++: warc2text
    • It is faster than the previous text extraction tool giawarc (now deprecated) and warc2preprocess.
    • Although it has the same features as giawarc, it still lacks features like PDF processing or boilerplate removal that are available in warc2preprocess.
  • Multiple improvements to bitextor-warc2htmlwarc.py and bitextor-warc2preprocess.py:
    • Added lxml text extraction parsing library option, and html5lib as optional and additional parsing
      • html5lib is the cleanest supported parser but also the slowest
    • Deleted alcazar as all code and references from upstream vanished.
    • Fixed ‘simple’ text extraction parser for some table tags and new HTML5 tags.
    • ftfy is now disabled by default.
  • New translation based document aligner written in C++ (document-aligner folder)
    • Faster and less memory requirements than the previous Python code.
  • Moses tokenizers are now used by default through an efficient wrapper.
    • This will run by default if "wordTokenizers" is not defined in Bitextor configuration.
    • This is the recommended option if your language is supported by Moses.
  • Moses sentence splitter original script has been replaced with a faster port by Mediacloud.
    • This will run by default if "sentenceSplitters" is not defined in Bitextor configuration.
    • This is the recommended option if your language is supported by the latest Moses release version of the sentence splitter script.
  • Added support for Biroamer
  • Deprecated autotools and replaced them with CMake.
  • Refactored and updated requirements and submodules for lots of performance and security improvements.
    • Now you can ignore the Python dependencies from modules you don't need to run by commenting those lines in requirements.txt before installing them.
    • Updated Snakemake to v6.0.5:
    • Refactored bleualign-cpp code to improve efficiency and memory requirements.
    • pdf-extract now processes text with sentence-join (consult Bitextor documentation for instructions)
    • Deleted old and deprecated files and folders, like slurm, nmt workflow for MarianNMT or pdf-extract (replaced by wrappers in WARC processors).
  • General system stability improvements to enhance the user's experience.

  • Conda release builds are up.
  • Docker builds have the same automatic build system, adding nightlies from Github master branch pushes (edge tag in Dockerhub).
  • Continuous integration has been activated through Github Actions.
  • Discussions are now open in Github! Use them to chat about releases or topics that don't fit in issues section.
  • Discord server is also up for a more live chat with other users and developers! Also there are some bots to keep you updated with some news about Bitextor development and related projects.

Notes

bitextor-v8.0.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.0.zip tarball or cloning the repo v8.0 tag.

We will support Bitextor 8.x branch until the next major version is released.

v8.0.0-pre

3 years ago

v8.0.0-pre Changelog

  • Deep rewrite of Bitextor Snakefile for a vastly performance improve.
    • Still missing dictionary-based document aligner and hunalign options and rules, will be integrated soon.
    • We recommend revising Bitextor README.md to check new option naming or formats.
    • Some intermediate files also changed, so reusing old runs would introduce issues.
  • Added sharding mode, which groups domains into 1 GB shards for a more balanced job running.
  • Added lxml text extraction parsing library option to bitextor-warc2htmlwarc.py" and html5lib` optional and additional parsing.
    • This is needed for proper deferred crawling in newest Bitextor code.
      • Deferred crawling is still only supported under warc2preprocess preprocessor.
    • html5lib is the cleanest supported parser (like a web browser) but also the slowest.
  • Fixed simple text extraction parser in bitextor-warc2preprocess.py for some table tags and new HTML5 tags.
  • ftfy is now disabled by default.
  • Moses sentence splitter and tokenizer are now used by default through an efficient Python wrapper.
    • This will happen if wordTokenizers and sentenceSplitters are not defined.
    • This is the recommended option if your language is supported by these scripts.
  • Updated README.md.
  • Refactored and updated requirements and submodules for lots of performance improvements.
    • Now you can ignore the Python dependencies from modules you don't need to run by commenting those lines in requirements.txt before installing them.
    • Deferred crawling functions now can be easily imported.
    • Refactored bleualign-cpp code.
      • Faster and less memory requirements.
    • New translation based document aligner written in C++.
      • Faster and less memory requirements than the previous Python code.
    • New base64 scripts from kpu/preprocess and cache fixes.
    • Bifixer now filters sentence pairs if one side has with more than 1024 characters.
  • General system stability improvements to enhance the user's experience.

Notes

Docker image will be updated once v8.0.0 gets released.

bitextor-v8.0.0-pre.zip tarball does include submodules code, you still need to compile binaries like bleualign. If you start compiling the project after cloning from the git repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.0.0-pre.zip tarball or cloning the repo v8.0.0-pre tag.

v7.3.2

4 years ago
  • Fixed warc2htmlwarc.py optional non-compressed output.
  • Fixed bicleaner and bifixer cached call from Bitextor, improving performance.
  • Fixed paths in test files.
  • Fixed heritrix waiting time while creating initial crawling files.
  • Fixed some deprecation errors from exceptions and old options.
  • Fixed TMX and TXT deduplicated output, now writes first occurrence text of a deduplicated sentence.
  • Fixed reproducibility issues using bicleaner cached call by creating a Bitextor optional parameter called bicleanerCacheWithSents.
  • Updated submodules to fix some bugs.
    • Bifixer: fixed crash on empty segments.
    • Bicleaner: version 0.13, less aggressive hardrules for short sentences (3-word sentences).
  • Fixed cld3 input in bitextor-warc2preprocess.py, making most documents being detected as 'English'.
  • Fixed extracted text from <span> by adding a space after their content, in the warc2preprocess text extractor simple.
  • Updated some requirements.txt for security and dependency issues.
  • Updated latest docker image and tagged as v7.3.2.

Notes

We started integrating Bitextor 8.0 development branches into master branch. If you don't need latest features but a more stable code, please use released versions/tags or the stable branch 7.x.

bitextor-v7.3.2.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.3.2.zip tarball or cloning the repo v7.3.2 tag.

We will support Bitextor 7.x branch until Bitextor 8 is released.

v7.3.1

4 years ago
  • Fixed example and test config files typos and new Bicleaner model filenames
  • Fixed tilde paths (~ as /home/user) when used in config files
  • Fixed warcio HTTPHeader modification without recalculating content length (reported upstream for more details)
  • Fixed bitextor-warc2htmlwarc.py stdin and stdout run mode.

Notes

bitextor-v7.3.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.3.zip tarball or cloning the repo v7.3 tag.

We will support Bitextor 7.x branch until Bitextor 8 is released.

v7.3

4 years ago

"Always look on the end (side) of life", PEP-373

v7.3 Changelog

Notes

bitextor-v7.3.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v7.3.zip tarball or cloning the repo v7.3 tag.

We will support Bitextor 7.x branch until Bitextor 8 is released.