🚜 Parse text and tables from PDF files.
Read text and parse tables from PDF files.
Supports tabular data with automatic column detection, and rule-based parsing.
Dependencies: it is based on pdf2json, which itself relies on Mozilla's pdf.js.
🆕 Now includes TypeScript type definitions!
ℹ️ Important notes:
Summary:
After installing Node.js:
git clone https://github.com/adrienjoly/npm-pdfreader.git
cd npm-pdfreader
npm install
npm test
node parse.js test/sample.pdf
To install pdfreader
as a dependency of your Node.js project:
npm install pdfreader
Then, see below for examples of use.
This module exposes the PdfReader
class, to be instantiated. You can pass { debug: true }
to the constructor, in order to log debugging information. (useful for troubleshooting)
Your instance has two methods for parsing a PDF. They return the same output and differ only in input: PdfReader.parseFileItems
(as below) for a filename, and PdfReader.parseBuffer
(see: "Raw PDF reading from a PDF already in memory (buffer)") from data that you don't want to reference from the filesystem.
Whichever method you choose, it asks for a callback, which gets called each time the instance finds what it denotes as a PDF item.
An item object can match one of the following objects:
null
, when the parsing is over, or an error occured.{file:{path:string}}
, when a PDF file is being opened, and is always the first item.{page:integer, width:float, height:float}
, when a new page is being parsed, provides the page number, starting at 1. This basically acts as a carriage return for the coordinates of text items to be processed.{text:string, x:float, y:float, w:float, ...}
, which you can think of as simple objects with a text property, and floating 2D AABB coordinates on the page.It's up to your callback to process these items into a data structure of your choice, and also to handle any errors thrown to it.
For example:
import { PdfReader } from "pdfreader";
new PdfReader().parseFileItems("test/sample.pdf", (err, item) => {
if (err) console.error("error:", err);
else if (!item) console.warn("end of file");
else if (item.text) console.log(item.text);
});
new PdfReader({ password: "YOUR_PASSWORD" }).parseFileItems(
"test/sample-with-password.pdf",
function (err, item) {
if (err) console.error(err);
else if (!item) console.warn("end of file");
else if (item.text) console.log(item.text);
}
);
As above, but reading from a buffer in memory rather than from a file referenced by path. For example:
import fs from "fs";
import { PdfReader } from "pdfreader";
fs.readFile("test/sample.pdf", (err, pdfBuffer) => {
// pdfBuffer contains the file content
new PdfReader().parseBuffer(pdfBuffer, (err, item) => {
if (err) console.error("error:", err);
else if (!item) console.warn("end of buffer");
else if (item.text) console.log(item.text);
});
});
Source code of the examples above: parsing a CV/résumé.
For more, see Examples of use.
The Rule
class can be used to define and process data extraction rules, while parsing a PDF document.
Rule
instances expose "accumulators": methods that defines the data extraction strategy to be used for each rule.
Example:
const processItem = Rule.makeItemProcessor([
Rule.on(/^Hello \"(.*)\"$/)
.extractRegexpValues()
.then(displayValue),
Rule.on(/^Value\:/)
.parseNextItemValue()
.then(displayValue),
Rule.on(/^c1$/).parseTable(3).then(displayTable),
Rule.on(/^Values\:/)
.accumulateAfterHeading()
.then(displayValue),
]);
new PdfReader().parseFileItems("test/sample.pdf", (err, item) => {
if (err) console.error(err);
else processItem(item);
});
Solutions exist, but this module cannot be run directly by a web browser. If you really want to use this module, you will have to integrate it into your back-end so that PDF files can be read from your server.
Cannot read property 'userAgent' of undefined
error from an express-based node.js appDmitry found out that you may need to run these instructions before including the pdfreader
module:
global.navigator = {
userAgent: "node",
};
window.navigator = {
userAgent: "node",
};