GitHub - cloudnepal/zerox: Zero shot pdf OCR with gpt-4o-mini

Zerox OCR

A dead simple way of OCR-ing a document for AI ingestion. Documents are meant to be a visual representation after all. With weird layouts, tables, charts, etc. The vision models just make sense!

The general logic:

Pass in a PDF (URL or file buffer)
Turn the PDF into a series of images
Pass each image to GPT and ask nicely for Markdown
Aggregate the responses and return Markdown

Sounds pretty basic! But with the gpt-4o-mini this method is price competitive with existing products, with meaningfully better results.

Pricing Comparison

This is how the pricing stacks up to other document processers. Running 1,000 pages with Zerox uses about 25M input tokens and 0.4M output tokens.

Service	Cost	Accuracy	Table Quality
AWS Textract [1]	$1.50 / 1,000 pages	Low	Low
Google Document AI [2]	$1.50 / 1,000 pages	Low	Low
Azure Document AI [3]	$1.50 / 1,000 pages	High	Mid
Unstructured (PDF) [4]	$10.00 / 1,000 pages	Mid	Mid
------------------------	--------------------	--------	-------------
Zerox (gpt-mini)	$ 4.00 / 1,000 pages	High	High

Installation

npm install zerox

Zerox uses graphicsmagick and ghostscript for the pdf => image processing step. These should be pulled automatically, but you may need to manually install.

Usage

With file URL

import { zerox } from "zerox";

const result = await zerox({
  filePath: "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf",
  openaiAPIKey: process.env.OPENAI_API_KEY,
});

From local path

import path from "path";
import { zerox } from "zerox";

const result = await zerox({
  filePath: path.resolve(__dirname, "./cs101.pdf"),
  openaiAPIKey: process.env.OPENAI_API_KEY,
});

Options

const result = await zerox({
  // Required
  filePath: "path/to/file",
  openaiAPIKey: process.env.OPENAI_API_KEY,

  // Optional
  concurrency: 10, // Number of pages to run at a time.
  maintainFormat: false, // Slower but helps maintain consistent formatting.
  cleanup: true, // Clear images from tmp after run.
  outputDir: undefined, // Save combined result.md to a file
  tempDir: "/os/tmp", // Directory to use for temporary files (default: system temp directory)
});

The maintainFormat option trys to return the markdown in a consistent format by passing the output of a prior page in as additional context for the next page. This requires the requests to run synchronously, so it's a lot slower. But valueable if your documents have a lot of tabular data, or frequently have tables that cross pages.

Request #1 => page_1_image
Request #2 => page_1_markdown + page_2_image
Request #3 => page_2_markdown + page_3_image

Example Output

{
  completionTime: 10038,
  fileName: 'invoice_36258',
  inputTokens: 25543,
  outputTokens: 210,
  pages: [
    {
      content: '# INVOICE # 36258\n' +
        '**Date:** Mar 06 2012  \n' +
        '**Ship Mode:** First Class  \n' +
        '**Balance Due:** $50.10  \n' +
        '## Bill To:\n' +
        'Aaron Bergman  \n' +
        '98103, Seattle,  \n' +
        'Washington, United States  \n' +
        '## Ship To:\n' +
        'Aaron Bergman  \n' +
        '98103, Seattle,  \n' +
        'Washington, United States  \n' +
        '\n' +
        '| Item                                       | Quantity | Rate   | Amount  |\n' +
        '|--------------------------------------------|----------|--------|---------|\n' +
        "| Global Push Button Manager's Chair, Indigo | 1        | $48.71 | $48.71  |\n" +
        '| Chairs, Furniture, FUR-CH-4421             |          |        |         |\n' +
        '\n' +
        '**Subtotal:** $48.71  \n' +
        '**Discount (20%):** $9.74  \n' +
        '**Shipping:** $11.13  \n' +
        '**Total:** $50.10  \n' +
        '---\n' +
        '**Notes:**  \n' +
        'Thanks for your business!  \n' +
        '**Terms:**  \n' +
        'Order ID : CA-2012-AB10015140-40974  ',
      page: 1,
      contentLength: 747
    }
  ]
}

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
examples		examples
node-zerox		node-zerox
py_zerox		py_zerox
.editorconfig		.editorconfig
.gitignore		.gitignore
.npmignore		.npmignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
commitlint.config.js		commitlint.config.js
package-lock.json		package-lock.json
package.json		package.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zerox OCR

Pricing Comparison

Installation

Usage

Options

Example Output

License

About

Releases

Packages

Languages

License

cloudnepal/zerox

Folders and files

Latest commit

History

Repository files navigation

Zerox OCR

Pricing Comparison

Installation

Usage

Options

Example Output

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages