Unstructured Versions Save

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

0.6.0

1 year ago

0.6.0

Enhancements

  • Adds an ssl_verify kwarg to partition and partition_html to enable turning off SSL verification for HTTP requests. SSL verification is on by default.
  • Allows users to pass in ocr language to partition_pdf and partition_image through the ocr_language kwarg. ocr_language corresponds to the code for the language pack in Tesseract. You will need to install the relevant Tesseract language pack to use a given language.

Features

  • Table extraction is now possible for pdfs from partition and partition_pdf.
  • Adds support for extracting attachments from .msg files

Fixes

0.5.13

1 year ago

0.5.13

Enhancements

  • Allow headers to be passed into partition when url is used.

Features

  • bytes_string_to_string cleaning brick for bytes string output.

Fixes

  • Fixed typo in call to exactly_one in partition_json
  • unstructured-documents encode xml string if document_tree is None in _read_xml.
  • Update to _read_xml so that Markdown files with embedded HTML process correctly.
  • Fallback to "fast" strategy only emits a warning if the user specifies the "hi_res" strategy.
  • unstructured-partition-text_type exceeds_cap_ratio fix returns and how capitalization ratios are calculated
  • partition_pdf and partition_text group broken paragraphs to avoid fragmented NarrativeText elements.
  • .json files resolved as "application/json" on centos7 (or other installs with older libmagic libs)

0.5.12

1 year ago

0.5.12

Enhancements

  • Add OS mimetypes DB to docker image, mainly for unstructured-api compat.
  • Use the image registry as a cache when building Docker images.
  • Adds the ability for partition_text to group together broken paragraphs.

Features

  • Add --partition-by-api parameter to unstructured-ingest
  • Added partition_rtf for processing rich text files.
  • partition now accepts a url kwarg in addition to file and filename.

Fixes

  • Allow encoding to be passed into replace_mime_encodings.
  • unstructured-ingest connector-specific dependencies are imported on demand.
  • unstructured-ingest --flatten-metadata supported for local connector.
  • unstructured-ingest fix runtime error when using --metadata-include.

0.5.11

1 year ago

0.5.11

Enhancements

Features

Fixes

  • Guard against null style attribute in docx document elements
  • Update HTML encoding to better support foreign language characters

0.5.10

1 year ago

0.5.10

Enhancements

  • Updated inference package
  • Add sender, recipient, date, and subject to element metadata for emails

Features

  • Added --download-only parameter to unstructured-ingest

Fixes

  • FileNotFound error when filename is provided but file is not on disk

0.5.9

1 year ago

0.5.9

Enhancements

Features

Fixes

  • Convert file to str in helper split_by_paragraph for partition_text

0.5.8

1 year ago

0.5.8

Enhancements

  • Update elements_to_json to return string when filename is not specified
  • elements_from_json may take a string instead of a filename with the text kwarg
  • detect_filetype now does a final fallback to file extension.
  • Empty tags are now skipped during the depth check for HTML processing.

Features

  • Add local file system to unstructured-ingest
  • Add --max-docs parameter to unstructured-ingest
  • Added partition_msg for processing MSFT Outlook .msg files.

Fixes

  • convert_file_to_text now passes through the source_format and target_format kwargs. Previously they were hard coded.
  • Partitioning functions that accept a text kwarg no longer raise an error if an empty string is passed (and empty list of elements is returned instead).
  • partition_json no longer fails if the input is an empty list.
  • Fixed bug in chunk_by_attention_window that caused the last word in segments to be cut-off in some cases.

BREAKING CHANGES

  • stage_for_transformers now returns a list of elements, making it consistent with other staging bricks

0.5.7

1 year ago

0.5.7

Enhancements

  • Refactored codebase using exactly_one
  • Adds ability to pass headers when passing a url in partition_html()
  • Added optional content_type and file_filename parameters to partition() to bypass file detection

Features

  • Add --flatten-metadata parameter to unstructured-ingest
  • Add --fields-include parameter to unstructured-ingest

Fixes

0.5.6

1 year ago

0.5.6

  • Fix problem with PDF partition (duplicated test)

Enhancements

  • contains_english_word(), used heavily in text processing, is 10x faster.

Features

  • Add --metadata-include and --metadata-exclude parameters to unstructured-ingest
  • Add clean_non_ascii_chars to remove non-ascii characters from unicode string

Fixes

  • Fixes duplicated elements issue with partition_pdf(..., strategy="fast")

0.5.4

1 year ago

0.5.4

Enhancements

  • Added Biomedical literature connector for ingest cli.
  • Add FsspecConnector to easily integrate any existing fsspec filesystem as a connector.
  • Rename s3_connector.py to s3.py for readability and consistency with the rest of the connectors.
  • Now S3Connector relies on s3fs instead of on boto3, and it inherits from FsspecConnector.
  • Adds an UNSTRUCTURED_LANGUAGE_CHECKS environment variable to control whether or not language specific checks like vocabulary and POS tagging are applied. Set to "true" for higher resolution partitioning and "false" for faster processing.
  • Improves detect_filetype warning to include filename when provided.
  • Adds a "fast" strategy for partitioning PDFs with PDFMiner. Also falls back to the "fast" strategy if detectron2 is not available.
  • Start deprecation life cycle for unstructured-ingest --s3-url option, to be deprecated in favor of --remote-url.

Features

  • Add AzureBlobStorageConnector based on its fsspec implementation inheriting from FsspecConnector
  • Add partition_epub for partitioning e-books in EPUB3 format.

Fixes

  • Fixes processing for text files with message/rfc822 MIME type.
  • Open xml files in read-only mode when reading contents to construct an XMLDocument.