GitHub

This repository is some scripts to process physics papers into text files and parse them with LLMs.

The different folders are:

old_pdf_pipeline/ This uses some off the shelf models and heuristics to parse pdfs in a directory.

The workflow is as follows:

Go into main.py and change the source directory to the desired directory.
Run main.py with the dependencies in requirements.yml (there may be some extra not strictly necessary packages)
The script will walk through all subdirectories of the source directory looking for pdfs.
It will run paragraph and figure recognition using the detectron model.
It will run text and latex recognition on the extracted paragraphs using the Pix2Text package.
The script heuristically decides which paragraphs are captions. It does this by finding the paragraph with the minimum distance between the bottom center of a given figure and the centroid of a paragraph's bounding box.
The package runs a gaussian mixture model on the paragraph coordinates to decide whether they belong to the same column an N column layout.
It will combine all non-caption paragraphs and write them to a target directory, which will be named the same thing as the source pdf except without the .pdf extension, inside the output/ directory.
It will extract all the figures and write them as X.png into the target directory.
It will extract all caption paragraphs and write them to a separate text file named caption.txt, in the same order as the figures.
Errors are caught and those pdfs are skipped. The offending pdf's names are written to failed.txt.

extraction/ This loads a LLama3 LLM into memory and reads the text files in the directory printed. Then it outputs the results of the question, which can be checked against a reference.

new_pdf_pipeline/ This does the same thing as the pdf_processor but relies entirely on the Nougat OCR transformer.

finetune/ Contains training scripts, testing scripts, and dataset schema for finetuning LLMs on our domain.

latex_detection/ Some preliminary work on improving OCR for latex.

acquire_pdfs/ Some example scripts for working with publisher APIs to collect papers. They are not particularly useful.

chat/ Simple streamlit chat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
acquire_pdfs		acquire_pdfs
chat		chat
chat_automated		chat_automated
data		data
extraction		extraction
finetune		finetune
latex_detection		latex_detection
models		models
notebooks		notebooks
old_pdf_pipeline		old_pdf_pipeline
pdf_pipeline		pdf_pipeline
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.yml		requirements.yml

License

louisprimeau/pdf_processor

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages