Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
--s3-url
in favor of --remote-url
in CLIfile_directory
to metadatapage_name
to metadata. Currently used for the sheet name in XLSX documents.--partition-strategy
parameter to unstructured-ingest so that users can specify
partition strategy in CLI. For example, --partition-strategy fast
.unstructured/file-utils/filetype.py
to better utilise hashmap to return mime type.test_filetype.py
.partition_xml
for XML files.partition_xlsx
for Microsoft Excel documents.hml
filetype for partition as a variation of html filetype.pytesseract
a function level import in partition_pdf
so you can use the "fast"
or "hi_res"
strategies if pytesseract
is not installed. Also adds the
required_dependencies
decorator for the "hi_res"
and "ocr_only"
strategies.filename
is tracked in metadata for docx
tables."auto"
strategy that chooses the partitioning strategy based on document
characteristics and function kwargs. This is the new default strategy for partition_pdf
and partition_image
. Users can maintain existing behavior by explicitly setting
strategy="hi_res"
.get_date
method to ElementMetadata
for converting the datestring to a datetime
object.filename
attribute on ElementMetadata
to remove the full filepath.partition_docx
in docxfileutils/file_type
check json and eml decode ignore errorpartition_email
was updated to more flexibly handle deviations from the RFC-2822 standard.
The time in the metadata returns None
if the time does not match RFC-2822 at all.partition_image
.partition_multiple_via_api
for partitioning multiple documents in a single REST
API call.stage_for_baseplate
function to prepare outputs for ingestion into Baseplate.partition_odt
for processing Open Office documents.partition_pdf
fast strategy to group together text
in the same bounding box.partition_pdf
for detecting copy protected PDFs and falling back
to the hi res strategy when necessary.partition_via_api
for partitioning documents through the hosted API.exceeds_cap_ratio
handles empty (returns True
instead of False
)detect_filetype
to properly detect JSONs when the MIME type is text/plain
.