A PDF processor written in Go.
PDF 2.0 encryption is now supported and you are free to use the following commands with your PDF 2.0 input files:
We can report another 🚀 @fancycode parser improvement resulting in a significant performance boost and lower memory overhead especially for large files:
Before:
$ time go run test.go
2024/03/21 09:03:55.874443 Parsing ...
2024/03/21 09:04:07.947987 Done, uses 4244 MiBytes heap memory, 6755 MiBytes system memory
2024/03/21 09:04:07.948013 Parsed 1133 pages
real 0m12,743s
user 0m21,830s
sys 0m2,589s
After:
$ time go run test.go
2024/03/21 09:04:30.639673 Parsing ...
2024/03/21 09:04:30.899588 Done, uses 12 MiBytes heap memory, 11 MiBytes system memory
2024/03/21 09:04:30.899609 Parsed 1133 pages
real 0m0,568s
user 0m0,881s
sys 0m0,228s
We have added options to skip some optimization steps or disable internal optimization alltogether:
If you disable the following option there will be no internal optimization of the cross reference table once it is loaded into memory.
This will only affect commands that do not rely on optimization like e.g. optimize
# toggle optimization
optimize: true
The following will disable the parsing of page content streams in order to detect unused resources like images or fonts.
# optimize page resources via content stream analysis.
optimizeResourceDicts: true
The following option decides if pdfcpu will scan for and remove duplicate content streams.
# optimize duplicate content streams across pages.
optimizeDuplicateContentStreams: false
⚡ Caution is advised and you have to know what you are doing when using these options. Tuning or turning optimization off can make sense in environments where you deal with large PDF files that usually look the same structure wise so there are no surprises.
Since the pdfcpu configuration has changed you are encouraged to recreate your config.yml:
pdfcpu conf
pdfcpu conf
for all of you test driving pdfcpu and reporting 🐛 s along the way.
Special PR thanks 👍🏻 also to @adamgreenhall for improving the booklet
command and to @xelan as well.
🧑🔬 We packed lots of goodies into this release for you..
You will like this ✨ Thanks to @fancycode we have improved PDF parsing significantly. While this is not easily comparable running the pdfcpu testsuite is now 8 seconds faster under MacOS 14.2.1:
Before:
./coverage.sh 67.60s user 13.35s system 119% cpu 1:07.93 total
After:
./coverage.sh 59.64s user 12.55s system 107% cpu 1:07.01 total
We now have basic support for writing back PDF 2.0 files. This means you may start using all pdfcpu operations that update validated PDF 2.0 files. Basic support means, your mileage may vary, especially when you try to process a file using one of the new 2.0 features.
Since it is hard to get a hand on PDF 2.0 files using a specific new 2.0 feature there is a disclaimer printed on the command line asking for your input and contribution. Please open an issue and share your file in case pdfcpu has a problem digesting your file. The same applies if you just want to see some specific 2.0 feature supported.
In general, please 🙏🏻 report back any issues - there is no way to fix something that does not get reported!
pdfcpu zoom [-p(ages) selectedPages] -- description inFile [outFile]
Zoom in/out of selected pages either by magnification factor or corresponding margin. When zooming out the unused page content space results into horizontal and vertical margins. These are different from each other but correspond to a certain factor.
Examples:
Zoom into magnification of 200%
pdfcpu zoom -- "factor: 2" in.pdf out.pdf
Zoom out to magnification of 50%
pdfcpu zoom -- "factor: .5" in.pdf out.pdf
Zoom out to a magnification equivalent to a horizontal margin of 1 cm
pdfcpu zoom -unit cm -- "hmargin: 1" in.pdf out.pdf
Zoom out to a magnification equivalent to a vertical margin of 30 points. Draw a border around zoomed out page content and fill unused page space light gray
pdfcpu zoom -- "vmargin: 30, border:true, bgcolor:lightgray" in.pdf out.pdf ...
Please consult pdfcpu help zoom
for more and also the official documentation
Thanks to @adamgreenhall we have an even more powerful booklet command for producing zines:
We now have booklet styles 2, 4, 6 and 8 and you may choose one of the following booklet types, each representing a certain method for arranging pages into a booklet:
booklet, bookletadvanced, perfectbound
Examples:
Arrange pages of in.pdf 2 per sheet side (4 per sheet, back and front) onto out.pdf
pdfcpu booklet -- "formsize:Letter" out.pdf 2 in.pdf
Arrange pages of in.pdf 4 per sheet side (8 per sheet, back and front) onto out.pdf:
pdfcpu booklet -- "formsize:Ledger" out.pdf 4 in.pdf
Arrange pages of in.pdf 6 per sheet side (12 per sheet, back and front) onto out.pdf
pdfcpu booklet -- "formsize:Ledger" out.pdf 6 in.pdf
Arrange pages of in.pdf 8 per sheet side (16 per sheet, back and front) onto out.pdf
pdfcpu booklet -- "formsize:A3" out.pdf 8 in.pdf
Arrange pages of in.pdf 4 per sheet side, with short-edge binding onto out.pdf
pdfcpu booklet -- "formsize:A3, binding:short" out.pdf 4 in.pdf
Arrange pages of in.pdf 2 per sheetside as sequence of folios covering 4*foliosize pages each.
pdfcpu booklet -- "formsize:A4, multifolio:on" hardbackbook.pdf 2 in.pdf
Arrange pages of in.pdf 2 per sheet side, arranged for perfect binding, onto out.pdf
pdfcpu booklet -- "formsize:A4, btype:perfectbound" out.pdf 2 in.pdf
Arrange pages of in.pdf 4 per sheet side, arranged for advanced binding, onto out.pdf
pdfcpu booklet -- "formsize:A3, btype:bookletadvanced" out.pdf 4 in.pdf
Please consult pdfcpu help booklet
for more and also the official documentation
There are two changes to the configuration:
validationNone
was eliminatedpostProcessValidate
is new and enables safeguard validationValidation mode ValidationNone
has been eliminated for a couple of reasons.
First of all during validation there are a lot of things happening like internalizing and caching needed for command processing,
secondly PDF validation has become quite performant.
We are introducing the new config flag postProcessValidate
.
This flag which is turned on by default enables the validation of your processed cross reference table right before writing.
This is considered a useful safeguard, since in cases when writing back a problematic cross reference table without problems,
only the next read/parse/validation attempt will take notice of a problem.
If you disable this you will get an additional performance boost overall but with the caveat described above.
As usual please renew your configuration!
Form filling now expects the user font Roboto-Regular
when using eastern european scripts.
You can do this manually or just remove your pdfcpu configuration all together and recreate it like so:
pdfcpu
folder using pdfcpu conf
pdfcpu
folderpdfcpu
folder by executing any pdfcpu cmd on the CLI eg. execute one more time pdfcpu conf
This all is complementing the official documentation
To get a better understanding of pdfcpu's operations please make sure you check out all tests and the corresponding PDF output and all json input where appropriate:
pdfcpu/pkg/samples/*
comes loaded with 230 MB worth of PDFs produced by corresponding tests and json input located at:
🙏 to all bug reporters and feature requestors. Special thanks for contributed PRs go to @adamgreenhall, @fancycode, @kalimit, @sivukhin and @afh
pdfcpu is in need of more frequent financial supporters! Please consider becoming a sponsor especially if you are a (small) business 🙏 If you are a developer within a business please go to your superior or team lead and have them compare the benefits/costs vs. commercial solutions. If you prefer to operate in stealth mode that's fine - you can always become a private sponsor. What's important is to keep the project funded and on a clear, steady path 🚀
I will be in the San Francisco Bay Area this fall. If you are a recurring sponsor or not but a business using pdfcpu I would like to get to know you and your pdfcpu use case. I'll be happy to meet also one-on-one possibly over 🍻 for a technical chat/discussion and to get feedback right from the trenches. Just get in touch with me: [email protected]
Support for PDF 2.0 encryption will be tackled next, after that digital signatures. A Beta version is within reach 👍🏻
Have fun 💚 with pdfcpu!
This release comes ready for you to play around with during the 🎄 holidays. It is packed with new features and the first one 🔥 dealing with PDF 2.0 (ISO 32000:2) support. Let's get right into it..
We start with basic support for validation and you can play around with the validate
and info
commands.
The work around this is ongoing and will stretch over the next couple of releases.
Please 🙏🏻 report back any issues.
There are three new commands:
Manage the page layout which shall be used when the document is opened:
pdfcpu pagelayout list inFile
pdfcpu pagelayout set inFile value
pdfcpu pagelayout reset inFile
➡️ pdfcpu help pagelayout
and pagelayout
Manage how the document shall be displayed when opened:
pdfcpu pagemode list inFile
pdfcpu pagemode set inFile value
pdfcpu pagemode reset inFile
➡️ pdfcpu help pagemode
and pagemode
Manage the way the document shall be displayed on the screen and shall be printed:
pdfcpu viewerpref list [-a(ll)] [-j(son)] inFile
pdfcpu viewerpref set inFile (inFileJSON | JSONstring)
pdfcpu viewerpref reset inFile
➡️ pdfcpu help viewerpref
and viewerpref
The split
command now also allows for splitting along page boundaries:
pdfcpu split [-m(ode) span|bookmark|page] inFile outDir [span|pageNr...]
➡️ pdfcpu help split
and split
The merge
command allows for divider pages at file boundaries and zipping two files together:
pdfcpu merge [-m(ode) create|append|zip] [-s(ort) -b(ookmarks) -d(ivider)] outFile inFile...
➡️ pdfcpu help merge
and merge
The permission
command is now more useful:
pdfcpu permissions list [-upw userpw] [-opw ownerpw] inFile...
pdfcpu permissions set [-perm none|print|all|max4Hex|max12Bits] [-upw userpw] -opw ownerpw inFile
It now also allows for conveniently setting individual PDF access bits either via a binary or hexadecimal number.
➡️ pdfcpu help permissions
and permissions
Thanks to @vsenko the stamp
command in combination with PDF stamps has become more powerful.
PDF
attribute in your Watermark
struct with a cached reader which should save you some memory.pdfcpu stamp add -mode pdf -- "stamp.pdf" "" in.pdf out.pdf
There is now a way to fine tune multi stamping eg: pdfcpu stamp add -mode pdf -- "stamp.pdf:2:3" "" in.pdf out.pdf
will initiate multi stamping at page 2 of stamp.pdf and page 3 of in.pdfThere are three changes to the configuration:
headerBufSize
was eliminated since PDF 2.0 comes with a flexible header location specification.permissions
are now a 4 digit hex number instead of a negative integer.needAppearances
is a flag you can set for form filling.Since the pdfcpu configuration has changed you are encouraged to recreate your config.yml:
pdfcpu conf
pdfcpu conf
pdfcpu is in need of financial supporters. There are membership fees, meetings and countless hours I am putting into this project. Please 🙏 consider supporting me in any way you can by becoming a sponsor. Go to your superior or team lead and have them compare the benefits/costs vs. commercial solutions.
🙏 to all bug reporters and PRs. Have fun 💚 with pdfcpu!
Hello!
This release features the new bookmark command and there are substantial changes to the API including better support for scenarios with parallel execution.
Finally you are able to get rid of unwanted bookmarks, replace existing bookmarks or create even new ones. There are four commands to list your bookmarks, import/export bookmarks via JSON or to remove all bookmarks:
pdfcpu bookmarks list inFile
pdfcpu bookmarks import [-r(eplace)] inFile inFileJSON [outFile]
pdfcpu bookmarks export inFile [outFileJSON]
pdfcpu bookmarks remove inFile [outFile]
Please check out the documentation.
pdfcpu is ready for go 1.21 ! Many API calls now return structs for corresponding objects and thanks to @semvis123 and @yyoshiki41 we were able to remove two significant points of contention. These changes should result in a much better experience running pdfcpu within goroutines. Your feedback is highly appreciated 💚
As always there is steady improvement to the PDF parser and thanks goes to every single user reporting issues. Remember, only because you are stuck parsing a specific file does not mean we can't do anything about it - but you are encouraged to take the time and file an issue.
This is a long term commitment for using the optimal resources, going in the right direction in order to make pdfcpu a sound tool for both developers and CLI users for the time to come.
There are membership fees, meetings and countless hours I am putting into this project. Please 🙏 consider supporting me in any way you can by becoming a sponsor. Go to your superior or team lead and have them compare the benefits/costs vs. commercial solutions.
As always 🙏 to all bug reporters and PRs. Have fun 💚 processing your PDFs with pdfcpu!
This release is a cut-off after fixing a couple of issues.
A notable feature of this release is bookmark support for merging PDFs. Existing bookmarks will be preserved during merging and from now on the output file also has a bookmark hierarchy representing the input files.
Bookmarks during merging will be created per default. You can skip bookmark creation on the CLI by supplying -bookmarks=false
If you want to skip bookmark creation when using the API you need to reset a new configuration parameter:
# merge creates bookmarks
createBookmarks: true
Since the pdfcpu configuration has changed you are encouraged to recreate your config.yml:
pdfcpu conf
pdfcpu conf
pdfcpu is in need of more supporters. If you use it please consider the hard work put into this and consider sponsorship. Go to your superior or team lead and have them compare the benefits/costs vs. commercial solutions.
As always 🙏 to all bug reporters and PRs. Have fun 💚 processing your PDFs with pdfcpu!
There are three new commands that will cut your PDF page in one way or another:
pdfcpu cut [-p(ages) selectedPages] -- description inFile outDir [outFileName]
A low level command for fine grained custom page cutting. Apply any number of horizontal or vertical page cuts:
pdfcpu ndown [-p(ages) selectedPages] -- [description] n inFile outDir [outFileName]
Cut selected page into n pages symmetrically. Think the inverse operation of n-up:
pdfcpu poster [-p(ages) selectedPages] -- description inFile outDir [outFileName]
Create a poster with full control over scaling and tile size and more:
A notable API change is an additional parameter for AddBookmarks.
The replace flag enforces deleting any old bookmarks that may be present in the input file:
AddBookmarksFile(inFile, outFile string, bms []pdf.Bookmark, replace bool, conf *model.Configuration)
As always 🙏 to all bug reporters and PRs. pdfcpu is getting better & better every day 💚 Have fun fiddling around with your PDFs!
.. Ohh, and before jumping right in please do me a favor and click here
This release has been a long time coming 😓 Thank you 🙏 for your patience!
API-Users, please proceed with caution! The codebase has been refactored heavily and there may be some side effects.
E.g. usages of: api.Merge(rsc []io.ReadSeeker,...) need to be migrated to api.MergeRaw(rsc []io.ReadSeeker, ...)
pdfcpu v0.4.0 is now based on go1.20 and comes with two new commands:
form
command solves all major form handling usecases:pdfcpu form list inFile...
pdfcpu form remove inFile [outFile] fieldID...
pdfcpu form lock inFile [outFile] [fieldID...]
pdfcpu form unlock inFile [outFile] [fieldID...]
pdfcpu form reset inFile [outFile] [fieldID...]
pdfcpu form export inFile [outFileJSON]
pdfcpu form fill inFile inFileJSON [outFile]
pdfcpu form multifill [-m(ode) single|merge] inFile inFileData outDir [outName]
DISCLAIMER 1 You are free to export and fill already existing PDF forms - this may or may not work! Feel free to open an issue and we may be able to make it work.
DISCLAIMER 2
All forms generated with pdfcpu create
are optimized for Adobe Reader.
Mac Preview is not suited well for form handling!
The following workflows are supported:
The Regular Workflow
create
command.list
of your form fields on the command line.export
your form to JSON,fill
your form with this JSON payload.lock
.The Fill & Merge Workflow
"Give me access to your contacts db and I will generate a single PDF containing a page sequence of contact sheets"
This usecase is implemented by extending the regular workflow by an additional integrated merge step.
As for automatically filling a form with your data pdfcpu gives you two options:
resize
command comes to the rescue when you're stuck with some large pages in front of a regular small form printer:pdfcpu resize [-p(ages) selectedPages] -- [description] inFile [outFile]
Scale your pages up or down or resize to one of the many supported standard form sizes (pdfcpu paper
prints a list ) optionally enforcing portrait or landscape mode:
Examples:
pdfcpu resize "scale:2" in.pdf out.pdf
Enlarge pages by doubling the page dimensions, keep orientation.
pdfcpu resize -pages 1-3 -- "sc:.5" in.pdf out.pdf
Shrink first 3 pages by cutting in half the page dimensions, keep orientation.
pdfcpu resize -u cm -- "dim:40 0" in.pdf out.pdf
Resize pages to width of 40 cm, keep orientation.
pdfcpu resize "form:A4" in.pdf out.pdf
Resize pages to A4, keep orientation.
pdfcpu resize "f:A4P, bgcol:#d0d0d0" in.pdf out.pdf
Resize pages to A4 and enforce orientation(here: portrait mode), apply background color.
pdfcpu resize "dim:400 200" in.pdf out.pdf
Resize pages to 400 x 200 points, keep orientation.
pdfcpu resize "dim:400 200, enforce:true" in.pdf out.pdf
Resize pages to 400 x 200 points, enforce orientation.
Countless bugs have been rolling in and many of those have been fixed. Thank you 🙏 all also for your PRs 👍 💚 Unfortunately due to heavy refactoring of the code base some of them had to be merged in manually or still will be.
A new command interprets a JSON structure representing page declarations and renders PDF pages accordingly:
pdfcpu create in.json [in.pdf] out.pdf
in.json contains a page sequence with content composed of text, images, colored boxes, tables and more. Each content element follows a box model consisting of margin, border and padding and may define fonts and colors where appropriate. You may also set general page attributes like paper size, background color or your crop box and you may also define your page headers and footers.
in.pdf if present, existing page content of in.pdf will be modified by appending to it.
out.pdf is where rendered pages are written to.
The way this command is setup allows for repeatedly adding content to a PDF. This fosters an incremental approach to PDF generation which may be during the design phase or in production.
The best documentation for this command is the combination of the content of:
If you are a Go developer you can play around with createFromJSON_test.go by modifying the JSON file and then executing the test which will give you immediate visual feedback by regenerating the corresponding result PDF.
The JSON is self explanatory and I highly recommend working through all the examples!
Many of the examples are multi page PDFs so make sure you don't miss anything!
You will learn about
Although this command allows for the modification of any PDF it works best for PDFs generated by pdfcpu itself.
Eventually and this release is really the preparation you will be able to create your PDF forms with this command.
To catch a glimpse of this effort have a look at:
and again the corresponding PDF files in pkg/samples/create/
You are welcome to experiment with form generation based on these form element samples but PDF form creation has NOT been released!
🙏 Thanks everybody for submitting issues and PRs 🙏
Thank you for using pdfcpu 💚
e89570a Fix OS agnostic fileName resolving dc0561f Bump version 72e4e71 Relax validation for FontDescriptor Lang 9616d4b Add create cmd 3c08a45 Add Stefan Huber as contributor 92bfd3e Extend relaxed validation for CIDToGIDMap c3eb4c0 Fix #335, #358 ba9f089 Fix #349 281745f Fix #353 b80c13b Fix #354 64e3df6 Fix #356 5e1ae87 Fix #362 befba81 Fix #366 5a7da1d Fix #371 437ef37 Fix #380, #387 4adf70c Fix #381 0c4c829 Fix #386 af9e334 Fix #388 6509cea Fix #394 8837dd1 Fix io.Reader based encryption #372 ce70c15 Merge branch 'pr/signalwerk/360' into master 16f43aa stamping: Fix OCG reuse 7e1546b validate cmd: wildcard support
This release is based on go1.16 and comes with two new commands and a wide array of improvements and fixes for both CLI and API:
The new annotation handling supports listing, adding and removing annotations via the API and listing and removing annotations via the CLI:
pdfcpu annotations list [-p(ages) selectedPages] inFile
pdfcpu annotations remove [-p(ages) selectedPages] inFile [objNr...]
Image extraction was improved in several ways. There is support for cascaded filters and as of now pdfcpu also recognizes thumbnails. There is also a related command that prints a list of images for selected pages including available details:
pdfcpu images list [-p(ages) selectedPages] inFile
boxes
and info
command now include the effective page rotation and orientation.url
parameter.Using the API you can now generate your own bookmarks aka outline hierarchy. See more: bookmarks_test.go
🙏 Thanks everybody for submitting issues and PRs 🙏 There were lots of bugs fixed and PRs processed. Not all of the PRs were actually merged in, because of the state of things, but rest assured digested into the code base and honored under the contributors list on the README.md. Special thanks go to @petervwyatt and @kpym for their continued suggestions and valuable input.
b911d21 info/boxes: add page rotation and orientation 6d519ce Add simple annotation handling 5b54825 Add stamps with links, Add annotation cmd b4ed6bb Embed config.yml cafccf8 Extract file writing fd08184 Extract images in memory 49c75a6 Fix #300, #323, #329, #336 b0c73af Fix #320 075feb9 Fix #322 7100142 Fix #324 680ae89 Fix #325 7a62cd0 Fix #326 e4890f2 Fix #330 694d519 Fix #331 c796f33 Fix #332 573ad83 Fix #333 2616a40 Fix #334, #342 662f7e8 Fix #350 f2aa684 Fix #335 19c3126 Fix #338 9fe5913 Fix #341 b8f382a Fix #343 c7d720b Fix #347 6c0ee1e Fix date validation f06c608 Fix rename err ac9adc6 Improve error context during validation 72223c3 Merge AcroForms 4ae05ab Merge branch 'extract-img-inmem' into master f189660 Simplify test
This release introduces stamping/watermark support for right to left scripts such as Arabic, Hebrew, Persian or Urdu:
Check out some samples at pdfcpu/pkg/samples/stamp/text/utf8 and the corresponding test at pdfcpu/pkg/api/test/stampUserFont_test.go
❗ Since this is really a piggyback release don't forget to check the v0.3.10 release notes as well.
956f29f Add right to left stamping