PdfPig Versions Save

Read and extract text and other content from PDFs in C# (port of PDFBox)

v0.1.8

11 months ago

This is a release with various bug-fixes and quality of life improvements but no new major features. It adds many of the supporting classes necessary for PDF rendering.

Breaking Changes

IColor can now be of type PatternColor. This implementation will throw an error when calling ToRGBValues(). You might have to check for IColor.ColorSpace != ColorSpace.Pattern before calling this function
Remove Details suffix from ColorSpaceDetails property names
AlternateColorSpaceDetails renamed to AlternateColorSpace
BaseColorSpaceDetails renamed to BaseColorSpace
Seal IColor implementations
Use double instead of decimal in color spaces and colors
Move IColorSpaceContext from IOperationContext to CurrentGraphicsState
Removed ColorSpace property from IPdfImage. Use ColorSpaceDetails.Type to get the enum value
IColorSpaceContext's CurrentStrokingColorSpace and CurrentNonStrokingColorSpace are now of type ColorSpaceDetails (not a ColorSpace enum anymore). Use CurrentStrokingColorSpace.Type or CurrentNonStrokingColorSpace.Type to get the enum value
Logic change to DefaultWordExtractor, a logic bug in the existing implementation was fixed, meaning the output of the default page.GetWords() may change in this version

NET 4.5

Note that this version removes support for .NET 4.5. Consumers should upgrade to .NET 4.5.1 or 4.5.2

Release notes

Fix support for using the ZapfDingbats Standard 14 font when creating files
Address issue with extracting CJK text from PDFs
Fix issue with writing ShowText operations to output files when the text contained parentheses
Error handling for Type 2 charstring parsing
New letter properties, TextRenderingMode, StrokeColor and FillColor
Fix for copying inline images to output files
Enums for PDF/A-3 compliance
Fix for library embedding PNGs with invalid information on output
Resolve PageSize enum for landscape orientation documents
Fix to rotation handling. The coordinates used for letters etc. are different now for rotated and/or cropped pages
Fix to calculated positions of annotations
Fix to adding JPG files to output documents
Add height to Type 3 font bounding boxes and default width/height for zero values
CreationDate and ModifiedDate are now available in DocumentInformationBuilder
Images can be added to document builder without specifying placement rectangle, this will place the image at 0,0 with full width and height
PdfAction exposed by Annotation class. InReplyTo property also added
GetFields extensions method for AcroForm type
Fix for internal links when using existing documents with annotations with PdfDocumentBuilder
Handle name conflicts when using PdfDocumentBuilder with one or more existing documents
Swaps internal uses of Rijndael and RijndaelManaged to Aes since these were marked as obsolete

v0.1.7

1 year ago

Changes since 0.1.6:

Add page.SetRotation for PdfPageBuilder
Add SkipMissingFonts to parsing options to ignore content where the font is not present or corrupt. Can result in content being missed during extraction but will enable partial extraction of retrievable content on page for corrupted files.
Multiple bug fixes thanks to @fnatzke
Fix to page number order bug on extraction thanks to @grinay
Various shape drawing utilities on PdfPageBuilder thanks to @Jonowa
Fix to issue in GrahamScan thanks to @BobLd
Remove stray Debugger.Break from the encryption handler
Various other bug fixes

v0.1.6

2 years ago

Mainly bug fixes. There are some compatibility changes in the document layout analysis API. See here: https://github.com/UglyToad/PdfPig/wiki/Migration-to-0.1.6

Fix transparency being applied for PDF/A-1
Fixes to string handling
.NET 6.0 support
Handle null rather than missing encryption data
Fixes bug with size of JPG files in documents created by PdfPig
Better handling for unusual Type1 fonts
Support for invisible/hidden text in document builder
Fixes stack overflow when parsing page tree for some documents
Fixes bug in some glyph bounding boxes for Type2 fonts
Handle non-contiguous xref ranges when building a document
Better location of version headers for non-compliant documents

v0.1.5

2 years ago

Changes since v0.1.4: https://github.com/UglyToad/PdfPig/compare/v0.1.4...v0.1.5

v0.1.5-alpha002

3 years ago

Some more bug-fixes:

Fix for object streams in files which require brute force searching.
Handle NullToken presence when creating documents.
Support for PDFs where the filters are defined as indirect references (against specification).
Support for CMYK when generating PNG images from IPdfImage.
Support for indexed ColorSpaces where palette is stored in a string.
Handle UTF16 strings in encrypted document dictionaries.
Handle documents with a XMP metadata stream instead of an information dictionary.
CCITTFaxDecode filter support.
Tweaks to DefaultWordExtractor to try and detect word gap size based on preceding text instead of a global gap threshold.

Note that changes to DefaultWordExtractor may change the output of calls to Page.GetWords() in this version.

v0.1.5-alpha001

3 years ago

First alpha version of 0.1.5

Fix glyph bounding boxes and paths for Type1 fonts using flexpoints.
Fix stack overflow when merging some documents.
Support loading existing documents into PdfDocumentBuilder.
Performance improvements for multithreaded scenarios.
Fix checked value for AcroForm checkboxes where the checked state is appearance only.
New page.GetOptionalContents() partial optional content retrieval support.
Partial support for colorspace details on IPdfImages.
Multiple bug-fixes for various font related issues.

Breaking changes:

PdfDocumentBuilder now implements IDisposable. This disposes the underlying stream by default but this is a MemoryStream normally so not any serious consequences if left undisposed.
PdfPageBuilder had the AdvancedEditing property removed. The API is now available in the ContentStream methods / properties (this was from #250).

v0.1.4

3 years ago

Adds support for filling rectangles when using PdfDocumentBuilder. The DrawRectangle method now takes an optional boolean parameter, fill.
Fix bug recognising Standard 14 fonts with Arial MT naming.
Handle unusual object streams containing endobj tokens.
Support broken Differences arrays for encodings.
Support very long xref streams by making infinite loop detection more relaxed.
Fix issue with parsing Type0 fonts that are using indirect references.
Internal structure changes to support pdf to image work.

v0.1.3

3 years ago

Fixes a set of bugs for font handling and PDF parsing.
Improves font detection on Linux systems
Improves calculation of PointSize for letters accounting for rotation and other transformations
Improves document layout analysis results in some cases
Fixes writing UTF strings when using document builder
Improvements to PDF graphics path API

v0.1.3-alpha001

3 years ago

First alpha version of 0.1.3

v0.1.2

3 years ago

Some new features, performance tweaks and improved Document Layout Analysis tools:

PDF/A compliance for PdfDocumentBuilder, use PdfDocumentBuilder.ArchiveStandard to select a PDF/A compliance level.
Performance improvements to parsing.
Clipping support for PdfPaths, now PdfSubpath. Use ParsingOptions.ClipPaths to enable clipping.
SVG Exporter in Document Layout Analysis
Improvements to Recursive XY Cut algorithm in Document Layout Analysis.
Fixes to PDF Merging to support more use-cases. Use PdfMerger.Merge to generate merged PDFs.
Proper support for letters and paths in rotated PDF documents, previous locations were incorrect when the page dictionary contained a rotation value.
Better support for guessing point size for letters.
ContentTextOrderExtractor in Document Layout Analysis uses the existing content order of text from the page's content stream to generate text as a string.
IPdfImage now supports TryGetBytes() instead of Bytes. TryGetBytes returns false for JPXDecode and DCTDecode image filters for which RawBytes represent a valid JPEG image.
Font flags such as bold and italic available on Letter.
Bugfix for CID fonts.
TextDirection is now TextOrientation, various fixes to the calculations of orientation and bounding box for Words.
Most Document Layout Analysis algorithms now take in a DlaOptions parameter to specify behaviour.
Bugfix to files with large amounts of trailing data.
Support for OpenType in CID fonts.