PdfPig Versions Save

Read and extract text and other content from PDFs in C# (port of PDFBox)

v0.1.8

11 months ago

This is a release with various bug-fixes and quality of life improvements but no new major features. It adds many of the supporting classes necessary for PDF rendering.

Breaking Changes

  • IColor can now be of type PatternColor. This implementation will throw an error when calling ToRGBValues(). You might have to check for IColor.ColorSpace != ColorSpace.Pattern before calling this function
  • Remove Details suffix from ColorSpaceDetails property names
  • AlternateColorSpaceDetails renamed to AlternateColorSpace
  • BaseColorSpaceDetails renamed to BaseColorSpace
  • Seal IColor implementations
  • Use double instead of decimal in color spaces and colors
  • Move IColorSpaceContext from IOperationContext to CurrentGraphicsState
  • Removed ColorSpace property from IPdfImage. Use ColorSpaceDetails.Type to get the enum value
  • IColorSpaceContext's CurrentStrokingColorSpace and CurrentNonStrokingColorSpace are now of type ColorSpaceDetails (not a ColorSpace enum anymore). Use CurrentStrokingColorSpace.Type or CurrentNonStrokingColorSpace.Type to get the enum value
  • Logic change to DefaultWordExtractor, a logic bug in the existing implementation was fixed, meaning the output of the default page.GetWords() may change in this version

NET 4.5

Note that this version removes support for .NET 4.5. Consumers should upgrade to .NET 4.5.1 or 4.5.2

Release notes

  • Fix support for using the ZapfDingbats Standard 14 font when creating files
  • Address issue with extracting CJK text from PDFs
  • Fix issue with writing ShowText operations to output files when the text contained parentheses
  • Error handling for Type 2 charstring parsing
  • New letter properties, TextRenderingMode, StrokeColor and FillColor
  • Fix for copying inline images to output files
  • Enums for PDF/A-3 compliance
  • Fix for library embedding PNGs with invalid information on output
  • Resolve PageSize enum for landscape orientation documents
  • Fix to rotation handling. The coordinates used for letters etc. are different now for rotated and/or cropped pages
  • Fix to calculated positions of annotations
  • Fix to adding JPG files to output documents
  • Add height to Type 3 font bounding boxes and default width/height for zero values
  • CreationDate and ModifiedDate are now available in DocumentInformationBuilder
  • Images can be added to document builder without specifying placement rectangle, this will place the image at 0,0 with full width and height
  • PdfAction exposed by Annotation class. InReplyTo property also added
  • GetFields extensions method for AcroForm type
  • Fix for internal links when using existing documents with annotations with PdfDocumentBuilder
  • Handle name conflicts when using PdfDocumentBuilder with one or more existing documents
  • Swaps internal uses of Rijndael and RijndaelManaged to Aes since these were marked as obsolete

v0.1.7

1 year ago

Changes since 0.1.6:

  • Add page.SetRotation for PdfPageBuilder
  • Add SkipMissingFonts to parsing options to ignore content where the font is not present or corrupt. Can result in content being missed during extraction but will enable partial extraction of retrievable content on page for corrupted files.
  • Multiple bug fixes thanks to @fnatzke
  • Fix to page number order bug on extraction thanks to @grinay
  • Various shape drawing utilities on PdfPageBuilder thanks to @Jonowa
  • Fix to issue in GrahamScan thanks to @BobLd
  • Remove stray Debugger.Break from the encryption handler
  • Various other bug fixes

v0.1.6

2 years ago

Mainly bug fixes. There are some compatibility changes in the document layout analysis API. See here: https://github.com/UglyToad/PdfPig/wiki/Migration-to-0.1.6

  • Fix transparency being applied for PDF/A-1
  • Fixes to string handling
  • .NET 6.0 support
  • Handle null rather than missing encryption data
  • Fixes bug with size of JPG files in documents created by PdfPig
  • Better handling for unusual Type1 fonts
  • Support for invisible/hidden text in document builder
  • Fixes stack overflow when parsing page tree for some documents
  • Fixes bug in some glyph bounding boxes for Type2 fonts
  • Handle non-contiguous xref ranges when building a document
  • Better location of version headers for non-compliant documents

v0.1.5

2 years ago

v0.1.5-alpha002

3 years ago

Some more bug-fixes:

  • Fix for object streams in files which require brute force searching.
  • Handle NullToken presence when creating documents.
  • Support for PDFs where the filters are defined as indirect references (against specification).
  • Support for CMYK when generating PNG images from IPdfImage.
  • Support for indexed ColorSpaces where palette is stored in a string.
  • Handle UTF16 strings in encrypted document dictionaries.
  • Handle documents with a XMP metadata stream instead of an information dictionary.
  • CCITTFaxDecode filter support.
  • Tweaks to DefaultWordExtractor to try and detect word gap size based on preceding text instead of a global gap threshold.

Note that changes to DefaultWordExtractor may change the output of calls to Page.GetWords() in this version.

v0.1.5-alpha001

3 years ago

First alpha version of 0.1.5

  • Fix glyph bounding boxes and paths for Type1 fonts using flexpoints.
  • Fix stack overflow when merging some documents.
  • Support loading existing documents into PdfDocumentBuilder.
  • Performance improvements for multithreaded scenarios.
  • Fix checked value for AcroForm checkboxes where the checked state is appearance only.
  • New page.GetOptionalContents() partial optional content retrieval support.
  • Partial support for colorspace details on IPdfImages.
  • Multiple bug-fixes for various font related issues.

Breaking changes:

  • PdfDocumentBuilder now implements IDisposable. This disposes the underlying stream by default but this is a MemoryStream normally so not any serious consequences if left undisposed.
  • PdfPageBuilder had the AdvancedEditing property removed. The API is now available in the ContentStream methods / properties (this was from #250).

v0.1.4

3 years ago
  • Adds support for filling rectangles when using PdfDocumentBuilder. The DrawRectangle method now takes an optional boolean parameter, fill.
  • Fix bug recognising Standard 14 fonts with Arial MT naming.
  • Handle unusual object streams containing endobj tokens.
  • Support broken Differences arrays for encodings.
  • Support very long xref streams by making infinite loop detection more relaxed.
  • Fix issue with parsing Type0 fonts that are using indirect references.
  • Internal structure changes to support pdf to image work.

v0.1.3

3 years ago
  • Fixes a set of bugs for font handling and PDF parsing.
  • Improves font detection on Linux systems
  • Improves calculation of PointSize for letters accounting for rotation and other transformations
  • Improves document layout analysis results in some cases
  • Fixes writing UTF strings when using document builder
  • Improvements to PDF graphics path API

v0.1.3-alpha001

3 years ago

First alpha version of 0.1.3

v0.1.2

3 years ago

Some new features, performance tweaks and improved Document Layout Analysis tools:

  • PDF/A compliance for PdfDocumentBuilder, use PdfDocumentBuilder.ArchiveStandard to select a PDF/A compliance level.
  • Performance improvements to parsing.
  • Clipping support for PdfPaths, now PdfSubpath. Use ParsingOptions.ClipPaths to enable clipping.
  • SVG Exporter in Document Layout Analysis
  • Improvements to Recursive XY Cut algorithm in Document Layout Analysis.
  • Fixes to PDF Merging to support more use-cases. Use PdfMerger.Merge to generate merged PDFs.
  • Proper support for letters and paths in rotated PDF documents, previous locations were incorrect when the page dictionary contained a rotation value.
  • Better support for guessing point size for letters.
  • ContentTextOrderExtractor in Document Layout Analysis uses the existing content order of text from the page's content stream to generate text as a string.
  • IPdfImage now supports TryGetBytes() instead of Bytes. TryGetBytes returns false for JPXDecode and DCTDecode image filters for which RawBytes represent a valid JPEG image.
  • Font flags such as bold and italic available on Letter.
  • Bugfix for CID fonts.
  • TextDirection is now TextOrientation, various fixes to the calculations of orientation and bounding box for Words.
  • Most Document Layout Analysis algorithms now take in a DlaOptions parameter to specify behaviour.
  • Bugfix to files with large amounts of trailing data.
  • Support for OpenType in CID fonts.