Unstructured Versions Save

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

0.6.10

11 months ago

0.6.10

Enhancements

  • XLS support from auto partition

Features

Fixes

0.6.9

11 months ago

0.6.9

Enhancements

  • fast strategy for pdf now keeps element bounding box data
  • setup.py refactor

Features

Fixes

  • Adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.
  • Adds additional MIME types for CSV

0.6.8

1 year ago

0.6.8

Enhancements

Features

  • Add partition_csv for CSV files.

Fixes

0.6.7

1 year ago

0.6.7

Enhancements

  • Deprecate --s3-url in favor of --remote-url in CLI
  • Refactor out non-connector-specific config variables
  • Add file_directory to metadata
  • Add page_name to metadata. Currently used for the sheet name in XLSX documents.
  • Added a --partition-strategy parameter to unstructured-ingest so that users can specify partition strategy in CLI. For example, --partition-strategy fast.
  • Added metadata for filetype.
  • Add Discord connector to pull messages from a list of channels
  • Refactor unstructured/file-utils/filetype.py to better utilise hashmap to return mime type.
  • Add local declaration of DOCX_MIME_TYPES and XLSX_MIME_TYPES for test_filetype.py.

Features

  • Add partition_xml for XML files.
  • Add partition_xlsx for Microsoft Excel documents.

Fixes

  • Supports hml filetype for partition as a variation of html filetype.
  • Makes pytesseract a function level import in partition_pdf so you can use the "fast" or "hi_res" strategies if pytesseract is not installed. Also adds the required_dependencies decorator for the "hi_res" and "ocr_only" strategies.
  • Fix to ensure filename is tracked in metadata for docx tables.

0.6.6

1 year ago

0.6.6

Enhancements

  • Adds an "auto" strategy that chooses the partitioning strategy based on document characteristics and function kwargs. This is the new default strategy for partition_pdf and partition_image. Users can maintain existing behavior by explicitly setting strategy="hi_res".
  • Added an additional trace logger for NLP debugging.
  • Add get_date method to ElementMetadata for converting the datestring to a datetime object.
  • Cleanup the filename attribute on ElementMetadata to remove the full filepath.

Features

  • Added table reading as html with URL parsing to partition_docx in docx
  • Added metadata field for text_as_html for docx files

Fixes

  • fileutils/file_type check json and eml decode ignore error
  • partition_email was updated to more flexibly handle deviations from the RFC-2822 standard. The time in the metadata returns None if the time does not match RFC-2822 at all.
  • Include all metadata fields when converting to dataframe or CSV

0.6.5

1 year ago

0.6.5

Enhancements

  • Added support for SpooledTemporaryFile file argument.

Features

Fixes

0.6.4

1 year ago

0.6.4

Enhancements

  • Added an "ocr_only" strategy for partition_pdf. Refactored the strategy decision logic into its own module.

Features

Fixes

0.6.3

1 year ago

0.6.3

Enhancements

  • Add an "ocr_only" strategy for partition_image.

Features

  • Added partition_multiple_via_api for partitioning multiple documents in a single REST API call.
  • Added stage_for_baseplate function to prepare outputs for ingestion into Baseplate.
  • Added partition_odt for processing Open Office documents.

Fixes

  • Updates the grouping logic in the partition_pdf fast strategy to group together text in the same bounding box.

0.6.2

1 year ago

0.6.2

Enhancements

  • Added logic to partition_pdf for detecting copy protected PDFs and falling back to the hi res strategy when necessary.

Features

  • Add partition_via_api for partitioning documents through the hosted API.

Fixes

  • Fix how exceeds_cap_ratio handles empty (returns True instead of False)
  • Updates detect_filetype to properly detect JSONs when the MIME type is text/plain.

0.6.1

1 year ago

0.6.1

Enhancements

  • Updated the table extraction parameter name to be more descriptive

Features

Fixes