Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
ssl_verify
kwarg to partition
and partition_html
to enable turning off
SSL verification for HTTP requests. SSL verification is on by default.partition_pdf
and partition_image
through
the ocr_language
kwarg. ocr_language
corresponds to the code for the language pack
in Tesseract. You will need to install the relevant Tesseract language pack to use a
given language.partition
and partition_pdf
..msg
filespartition
when url
is used.bytes_string_to_string
cleaning brick for bytes string output.exactly_one
in partition_json
None
in _read_xml
._read_xml
so that Markdown files with embedded HTML process correctly.partition_pdf
and partition_text
group broken paragraphs to avoid fragmented NarrativeText
elements.partition_text
to group together broken paragraphs.partition_rtf
for processing rich text files.partition
now accepts a url
kwarg in addition to file
and filename
.replace_mime_encodings
.elements_to_json
to return string when filename is not specifiedelements_from_json
may take a string instead of a filename with the text
kwargdetect_filetype
now does a final fallback to file extension.unstructured-ingest
--max-docs
parameter to unstructured-ingest
partition_msg
for processing MSFT Outlook .msg files.convert_file_to_text
now passes through the source_format
and target_format
kwargs.
Previously they were hard coded.text
kwarg no longer raise an error if an empty
string is passed (and empty list of elements is returned instead).partition_json
no longer fails if the input is an empty list.chunk_by_attention_window
that caused the last word in segments to be cut-off
in some cases.stage_for_transformers
now returns a list of elements, making it consistent with other
staging bricksexactly_one
content_type
and file_filename
parameters to partition()
to bypass file detection--flatten-metadata
parameter to unstructured-ingest
--fields-include
parameter to unstructured-ingest
contains_english_word()
, used heavily in text processing, is 10x faster.--metadata-include
and --metadata-exclude
parameters to unstructured-ingest
clean_non_ascii_chars
to remove non-ascii characters from unicode stringpartition_pdf(..., strategy="fast")
FsspecConnector
to easily integrate any existing fsspec
filesystem as a connector.s3_connector.py
to s3.py
for readability and consistency with the
rest of the connectors.S3Connector
relies on s3fs
instead of on boto3
, and it inherits
from FsspecConnector
.UNSTRUCTURED_LANGUAGE_CHECKS
environment variable to control whether or not language
specific checks like vocabulary and POS tagging are applied. Set to "true"
for higher
resolution partitioning and "false"
for faster processing.detect_filetype
warning to include filename when provided.unstructured-ingest --s3-url
option, to be deprecated in
favor of --remote-url
.AzureBlobStorageConnector
based on its fsspec
implementation inheriting
from FsspecConnector
partition_epub
for partitioning e-books in EPUB3 format.message/rfc822
MIME type.