Unstructured Versions Save

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

0.6.0

1 year ago

0.6.0

Enhancements

Adds an ssl_verify kwarg to partition and partition_html to enable turning off SSL verification for HTTP requests. SSL verification is on by default.
Allows users to pass in ocr language to partition_pdf and partition_image through the ocr_language kwarg. ocr_language corresponds to the code for the language pack in Tesseract. You will need to install the relevant Tesseract language pack to use a given language.

Features

Table extraction is now possible for pdfs from partition and partition_pdf.
Adds support for extracting attachments from .msg files

Fixes

0.5.13

1 year ago

0.5.13

Enhancements

Allow headers to be passed into partition when url is used.

Features

bytes_string_to_string cleaning brick for bytes string output.

Fixes

Fixed typo in call to exactly_one in partition_json
unstructured-documents encode xml string if document_tree is None in _read_xml.
Update to _read_xml so that Markdown files with embedded HTML process correctly.
Fallback to "fast" strategy only emits a warning if the user specifies the "hi_res" strategy.
unstructured-partition-text_type exceeds_cap_ratio fix returns and how capitalization ratios are calculated
partition_pdf and partition_text group broken paragraphs to avoid fragmented NarrativeText elements.
.json files resolved as "application/json" on centos7 (or other installs with older libmagic libs)

0.5.12

1 year ago

0.5.12

Enhancements

Add OS mimetypes DB to docker image, mainly for unstructured-api compat.
Use the image registry as a cache when building Docker images.
Adds the ability for partition_text to group together broken paragraphs.

Features

Add --partition-by-api parameter to unstructured-ingest
Added partition_rtf for processing rich text files.
partition now accepts a url kwarg in addition to file and filename.

Fixes

Allow encoding to be passed into replace_mime_encodings.
unstructured-ingest connector-specific dependencies are imported on demand.
unstructured-ingest --flatten-metadata supported for local connector.
unstructured-ingest fix runtime error when using --metadata-include.

0.5.11

1 year ago

0.5.11

Enhancements

Features

Fixes

Guard against null style attribute in docx document elements
Update HTML encoding to better support foreign language characters

0.5.10

1 year ago

0.5.10

Enhancements

Updated inference package
Add sender, recipient, date, and subject to element metadata for emails

Features

Added --download-only parameter to unstructured-ingest

Fixes

FileNotFound error when filename is provided but file is not on disk

0.5.9

1 year ago

0.5.9

Enhancements

Features

Fixes

Convert file to str in helper split_by_paragraph for partition_text

0.5.8

1 year ago

0.5.8

Enhancements

Update elements_to_json to return string when filename is not specified
elements_from_json may take a string instead of a filename with the text kwarg
detect_filetype now does a final fallback to file extension.
Empty tags are now skipped during the depth check for HTML processing.

Features

Add local file system to unstructured-ingest
Add --max-docs parameter to unstructured-ingest
Added partition_msg for processing MSFT Outlook .msg files.

Fixes

convert_file_to_text now passes through the source_format and target_format kwargs. Previously they were hard coded.
Partitioning functions that accept a text kwarg no longer raise an error if an empty string is passed (and empty list of elements is returned instead).
partition_json no longer fails if the input is an empty list.
Fixed bug in chunk_by_attention_window that caused the last word in segments to be cut-off in some cases.

BREAKING CHANGES

stage_for_transformers now returns a list of elements, making it consistent with other staging bricks

0.5.7

1 year ago

0.5.7

Enhancements

Refactored codebase using exactly_one
Adds ability to pass headers when passing a url in partition_html()
Added optional content_type and file_filename parameters to partition() to bypass file detection

Features

Add --flatten-metadata parameter to unstructured-ingest
Add --fields-include parameter to unstructured-ingest

Fixes

0.5.6

1 year ago

0.5.6

Fix problem with PDF partition (duplicated test)

Enhancements

contains_english_word(), used heavily in text processing, is 10x faster.

Features

Add --metadata-include and --metadata-exclude parameters to unstructured-ingest
Add clean_non_ascii_chars to remove non-ascii characters from unicode string

Fixes

Fixes duplicated elements issue with partition_pdf(..., strategy="fast")

0.5.4

1 year ago

0.5.4

Enhancements

Added Biomedical literature connector for ingest cli.
Add FsspecConnector to easily integrate any existing fsspec filesystem as a connector.
Rename s3_connector.py to s3.py for readability and consistency with the rest of the connectors.
Now S3Connector relies on s3fs instead of on boto3, and it inherits from FsspecConnector.
Adds an UNSTRUCTURED_LANGUAGE_CHECKS environment variable to control whether or not language specific checks like vocabulary and POS tagging are applied. Set to "true" for higher resolution partitioning and "false" for faster processing.
Improves detect_filetype warning to include filename when provided.
Adds a "fast" strategy for partitioning PDFs with PDFMiner. Also falls back to the "fast" strategy if detectron2 is not available.
Start deprecation life cycle for unstructured-ingest --s3-url option, to be deprecated in favor of --remote-url.

Features

Add AzureBlobStorageConnector based on its fsspec implementation inheriting from FsspecConnector
Add partition_epub for partitioning e-books in EPUB3 format.

Fixes

Fixes processing for text files with message/rfc822 MIME type.
Open xml files in read-only mode when reading contents to construct an XMLDocument.