Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
Summary: More control over the {left-to-right, right-to-left, top-to-bottom, bottom-to-top} direction that pdfplumber
reads/writes text (many thanks to @afriedman412 for the idea and prototype in #1040), plus upgrading to pdfminer.six
's latest release (which provides more detailed paths for curves), and some fixes.
{line,char}_dir{,rotated,render}
params, to provide better support for non–top-to-bottom, left-to-right text (h/t @afriedman412). (850fd45)curve["path"]
and curve["dash"]
, thanks to pdfminer.six
upgrade (see below). (1820247)pdfminer.six
from 20221105
to 20231228
. (cd2f768)word["direction"]
from {1,-1}
to {"ltr","rtl","ttb","btt"}
. (850fd45)vertical_ttb
, horizontal_ltr
in favor of char_dir
and char_dir_rotated
.(850fd45)x_tolerance_ratio
parameter to extract_text
and similar functions, to account for text size when spacing characters (instead of a fixed number of pixels) (h/t @afriedman412). (#1041)Page.structure_tree
(h/t @dhdaines). (#963)repair.py
(h/t @echedey-ls). (#1032)Page.close()
method, have PDF.close()
close all pages as well, and improve relevant documentation (h/t @luketudge). (#1042)force_mediabox
parameter to Page.to_image(...)
. (#1054)Page.get_textmap
caching to allow for extra_attrs=[...]
, by preconverting list kwargs to tuples. (#1030)pypdfium2.PdfDocument
in get_page_image
(h/t @dhdaines). (#1090)PDFPageAggregatorWithMarkedContent.tag_cur_item
, check self.cur_item._objs
length before trying to access [-1]
. (4f39d03)mcid
and tag
attributes on char
/rect
/line
/curve
/image
objects (h/t @dhdaines). (#961)gs_path
argument to pdfplumber.open(...)
and pdfplumber.repair(...)
, to allow passing a custom Ghostscript path to be used for repairing. (#953)use_text_flow
in extract_text
(h/t @dhdaines). (#983)A simple release:
antialias
boolean parameter to Page.to_image(...)
and associated methods (h/t @cmdlineluser). (7e28931)tuple[float|int, ...]
(#917). (57d51bb)pdfplumber.repair(...)
and .open(repair=True)
(#824). (db6ae97)quantize=True
, colors=256
, bits=8
arguments/defaults to PageImage.save(...)
. (b049373)WordExtractor.char_begins_new_word(...)
) more explict and rigorous; should help in catching edge-cases in the future. (6acd580 + ebb93ea + #840)curve_edge
objects (instead of just line
and rect_edge
objects) in default table-detection strategy. (6f6b465 + #858)ffi
to ffi
), and add the expand_ligatures
boolean parameter to text-extraction methods. (86e935d + #598)Page.extract_text_lines(...)
method. (4b37397 + #852)main_group
, return_groups
, return_chars
parameters to Page.search(...)
. (4b37397).curve_edges
property to PDF
and Page
. (6f6b465)Page/utils.extract_text(layout=True)
approach so that it pads, to the degree necessary, the ends of lines with spaces and the end of the text with blank lines to acheive better mimicry of page layout. (d3662de)pts
attribute and, in doing so, deprecate the curve_obj["points"]
attribute, and fix PageImage.draw_line(...)
's handling of diagonal lines. (216bedd)Page.extract_table[s](...)
, keep_blank_chars
must now be passed as text_keep_blank_chars
, for consistency's sake. (c4e1b29)Page.extract_table[s](...)
support for all Page.extract_text(...)
keyword arguments. (c4e1b29)height
and width
keyword arguemnts to Page.to_image(...)
. (#798 + 93f7dbd)layout_width
, layout_width_chars
, layout_height
, and layout_width_chars
parameters to Page/utils.extract_text(layout=True)
. (d3662de)None
. (#811) [h/t @toshi1127]utils.py
into utils/
submodules. Retains same interface, just an improvement in organization. (6351d97)utils.extract_text(...)
, via Page.extract_text(...)
, via Page.extract_table(...)
). (3424b57)pdfminer.six
version to 20221105
. (e63a038)text
attribute to .textboxhorizontal
/etc., regression introduced in 9587cc7
/ v0.6.2
. (8a0c126)lru_cache
usage, which are discouraged for class methods due to garbage-collection issues. (e3142a0)nbexec
development requirement from 0.1.0
to 0.2.0
. (30dac25)py.typed
file was not included in PyPi distribution. (#698 + #703 + 6908487) [h/t @jhonatan-lopes]utils.cluster_objects(...)
with any hashable value (str
, int
, tuple
, etc.) as the key_fn
parameter, reverting breaking change in 58b1ab1. (#691 + 1e97656) [h/t @jfuruness]