Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
parse_email
for partition_eml
so that unstructured-api
passes the smoke testspartition_email
now works if there is no message content"fast"
strategy for partition_pdf
so that it's able to recursivelyMIME
encodings for eml
files with one of the common encodings if a unicode
error occursdetect_file_encoding
eml
fileshtml_assemble_articles
kwarg to partition_html
to enable users to capture
control whether content outside of <article>
tags is captured when
<article>
tags are present.xml
attribute on element
before looking for pagebreaks in partition_docx
..docx
and .doc
when user or renderer
created page breaks are present.partition_docx
to include headers and footers in the output.partition_tsv
and associated tests. Make additional changes to detect_filetype
.partition_via_api
since we now require valid/empty api keysNone
instead of 1
when page number is not present in the metadata.
A page number of None
indicates that page numbers are not being tracked for the document
or that page numbers do not apply to the element in question..None
.partition_pdf
for fast
strategy--fast
strategy on PDF documentspartition_rst
for processed ReStructured Text documents.partition_via_api
and partition_multiple_via_api
detect_filetype
and partition
.grpcio
import issue on weaviate.schema.validate_schema
for python 3.9 and 3.10detectron2
from source in Dockerfilestrategy
parameter down from partition
for partition_image
text/plain
MIME typeconvert_office_doc
no longers prints file conversion info messages to stdout.partition_via_api
reflects the actual filetype for the file processed in the API.elements_to_json
and elements_from_json
read_txt_file
utility function to keep using spooled_to_bytes_io_if_needed
for xmlread_txt_file
utility function to handle file-like object from URLencoding
from partition_pdf
None
default for encodingtabulate
explicitly to dependenciesmetadata.page_number
of pptx filesstage_for_weaviate
to stage unstructured
outputs for upload to Weaviate, along with
a helper function for defining a class to use in Weaviate schemas.detectron2
from source is no longer required when using the local-inference
extra..pptx
parsing to include text in tables._add_element_metadata
that caused all elements to have page_number=1
in the element metadata..log
as a file extension for TXT files..eml
) files if an error related to the encoding is raised and the user has not specified an encoding.replace_mime_encodings
partition_html
when include_metadata=False
ValueError
now raises if file_filename
is not specified when you use partition_via_api
with a file-like object.