Skip to content

Commit

Permalink
Refactor: support merging extracted layout with inferred layout (U…
Browse files Browse the repository at this point in the history
…nstructured-IO#2158)

### Summary
This PR is the second part of `pdfminer` refactor to move it from
`unstructured-inference` repo to `unstructured` repo, the first part is
done in
Unstructured-IO/unstructured-inference#294. This
PR adds logic to merge the extracted layout with the inferred layout.

The updated workflow for the `hi_res` strategy:
* pass the document (as data/filename) to the `inference` repo to get
`inferred_layout` (DocumentLayout)
* pass the `inferred_layout` returned from the `inference` repo and the
document (as data/filename) to the `pdfminer_processing` module, which
first opens the document (create temp file/dir as needed), and splits
the document by pages
* if is_image is `True`, return the passed
inferred_layout(DocumentLayout)
  * if is_image is `False`:
* get extracted_layout (TextRegions) from the passed
document(data/filename) by pdfminer
* merge `extracted_layout` (TextRegions) with the passed
`inferred_layout` (DocumentLayout)
* return the `inferred_layout `(DocumentLayout) with updated elements
(all merged LayoutElements) as merged_layout (DocumentLayout)
* pass merged_layout and the document (as data/filename) to the `OCR`
module, which first opens the document (create temp file/dir as needed),
and splits the document by pages (convert PDF pages to image pages for
PDF file)

### Note
This PR also fixes issue Unstructured-IO#2164 by using functionality similar to the one
implemented in the `fast` strategy workflow when extracting elements by
`pdfminer`.

### TODO
* image extraction refactor to move it from `unstructured-inference`
repo to `unstructured` repo
* improving natural reading order by applying the current default
`xycut` sorting to the elements extracted by `pdfminer`
  • Loading branch information
christinestraub committed Dec 1, 2023
1 parent e5bdf7f commit 69d0ee1
Show file tree
Hide file tree
Showing 53 changed files with 482 additions and 228 deletions.
6 changes: 4 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
## 0.11.4-dev0
## 0.11.4-dev1

### Enhancements

* **Refactor pdfminer code.** The pdfminer code is moved from `unstructured-inference` to `unstructured`.

### Features

### Fixes
Expand All @@ -23,8 +25,8 @@
## 0.11.1

### Enhancements
* **Use `pikepdf` to repair invalid PDF structure** for PDFminer when we see error `PSSyntaxError` when PDFminer opens the document and creates the PDFminer pages object or processes a single PDF page.

* **Use `pikepdf` to repair invalid PDF structure** for PDFminer when we see error `PSSyntaxError` when PDFminer opens the document and creates the PDFminer pages object or processes a single PDF page.
* **Batch Source Connector support** For instances where it is more optimal to read content from a source connector in batches, a new batch ingest doc is added which created multiple ingest docs after reading them in in batches per process.

### Features
Expand Down
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ idna==3.6
# requests
imagesize==1.4.1
# via sphinx
importlib-metadata==6.8.0
importlib-metadata==6.9.0
# via sphinx
jinja2==3.1.2
# via
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from PIL import Image

from unstructured.documents.elements import PageBreak
from unstructured.partition.pdf import partition_pdf
from unstructured.partition.pdf_image.pdf import partition_pdf
from unstructured.partition.utils.constants import SORT_MODE_BASIC, SORT_MODE_DONT, SORT_MODE_XY_CUT
from unstructured.partition.utils.xycut import (
bbox2points,
Expand Down
2 changes: 1 addition & 1 deletion examples/layout-analysis/visualization.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from unstructured_inference.visualize import draw_bbox

from unstructured.documents.elements import PageBreak
from unstructured.partition.pdf import partition_pdf
from unstructured.partition.pdf_image.pdf import partition_pdf

CUR_DIR = pathlib.Path(__file__).parent.resolve()

Expand Down
2 changes: 1 addition & 1 deletion requirements/build.txt
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ idna==3.6
# requests
imagesize==1.4.1
# via sphinx
importlib-metadata==6.8.0
importlib-metadata==6.9.0
# via sphinx
jinja2==3.1.2
# via
Expand Down
12 changes: 6 additions & 6 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ idna==3.6
# anyio
# jsonschema
# requests
importlib-metadata==6.8.0
importlib-metadata==6.9.0
# via
# build
# jupyter-client
Expand Down Expand Up @@ -138,7 +138,7 @@ jsonschema[format-nongpl]==4.20.0
# jupyter-events
# jupyterlab-server
# nbformat
jsonschema-specifications==2023.11.1
jsonschema-specifications==2023.11.2
# via jsonschema
jupyter==1.0.0
# via -r dev.in
Expand Down Expand Up @@ -301,7 +301,7 @@ qtconsole==5.5.1
# via jupyter
qtpy==2.4.1
# via qtconsole
referencing==0.31.0
referencing==0.31.1
# via
# jsonschema
# jsonschema-specifications
Expand All @@ -319,7 +319,7 @@ rfc3986-validator==0.1.1
# via
# jsonschema
# jupyter-events
rpds-py==0.13.1
rpds-py==0.13.2
# via
# jsonschema
# referencing
Expand Down Expand Up @@ -354,7 +354,7 @@ tomli==2.0.1
# jupyterlab
# pip-tools
# pyproject-hooks
tornado==6.3.3
tornado==6.4
# via
# ipykernel
# jupyter-client
Expand Down Expand Up @@ -395,7 +395,7 @@ urllib3==1.26.18
# -c constraints.in
# -c test.txt
# requests
virtualenv==20.24.7
virtualenv==20.25.0
# via pre-commit
wcwidth==0.2.12
# via prompt-toolkit
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-markdown.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile --output-file=extra-markdown.txt extra-markdown.in
#
importlib-metadata==6.8.0
importlib-metadata==6.9.0
# via markdown
markdown==3.5.1
# via -r extra-markdown.in
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-msg.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,5 @@
#
msg-parser==1.2.0
# via -r extra-msg.in
olefile==0.46
olefile==0.47
# via msg-parser
2 changes: 1 addition & 1 deletion requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ imageio==2.33.0
# scikit-image
imgaug==0.4.0
# via unstructured-paddleocr
importlib-metadata==6.8.0
importlib-metadata==6.9.0
# via flask
importlib-resources==6.1.1
# via matplotlib
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pdf-image.in
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ pikepdf
pypdf
# Do not move to contsraints.in, otherwise unstructured-inference will not be upgraded
# when unstructured library is.
unstructured-inference==0.7.15
unstructured-inference==0.7.17
# unstructured fork of pytesseract that provides an interface to allow for multiple output formats
# from one tesseract call
unstructured.pytesseract>=0.3.12
2 changes: 1 addition & 1 deletion requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,7 @@ typing-extensions==4.8.0
# torch
tzdata==2023.3
# via pandas
unstructured-inference==0.7.15
unstructured-inference==0.7.17
# via -r extra-pdf-image.in
unstructured-pytesseract==0.3.12
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/airtable.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ idna==3.6
# requests
inflection==0.5.1
# via pyairtable
pyairtable==2.2.0
pyairtable==2.2.1
# via -r ingest/airtable.in
pydantic==1.10.13
# via
Expand Down
4 changes: 1 addition & 3 deletions requirements/ingest/azure.txt
Original file line number Diff line number Diff line change
Expand Up @@ -76,9 +76,7 @@ portalocker==2.8.2
pycparser==2.21
# via cffi
pyjwt[crypto]==2.8.0
# via
# msal
# pyjwt
# via msal
requests==2.31.0
# via
# -c ingest/../base.txt
Expand Down
4 changes: 1 addition & 3 deletions requirements/ingest/box.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,7 @@ attrs==23.1.0
boxfs==0.2.1
# via -r ingest/box.in
boxsdk[jwt]==3.9.2
# via
# boxfs
# boxsdk
# via boxfs
certifi==2023.11.17
# via
# -c ingest/../base.txt
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/confluence.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile --output-file=ingest/confluence.txt ingest/confluence.in
#
atlassian-python-api==3.41.3
atlassian-python-api==3.41.4
# via -r ingest/confluence.in
certifi==2023.11.17
# via
Expand Down
8 changes: 5 additions & 3 deletions requirements/ingest/embed-aws-bedrock.txt
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@ frozenlist==1.4.0
# via
# aiohttp
# aiosignal
greenlet==3.0.1
# via sqlalchemy
idna==3.6
# via
# -c ingest/../base.txt
Expand All @@ -62,11 +64,11 @@ jsonpatch==1.33
# langchain-core
jsonpointer==2.4
# via jsonpatch
langchain==0.0.341
langchain==0.0.344
# via -r ingest/embed-aws-bedrock.in
langchain-core==0.0.6
langchain-core==0.0.8
# via langchain
langsmith==0.0.67
langsmith==0.0.68
# via
# langchain
# langchain-core
Expand Down
8 changes: 5 additions & 3 deletions requirements/ingest/embed-huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ fsspec==2023.9.1
# -c ingest/../constraints.in
# huggingface-hub
# torch
greenlet==3.0.1
# via sqlalchemy
huggingface==0.0.1
# via -r ingest/embed-huggingface.in
huggingface-hub==0.19.4
Expand All @@ -77,11 +79,11 @@ jsonpatch==1.33
# langchain-core
jsonpointer==2.4
# via jsonpatch
langchain==0.0.341
langchain==0.0.344
# via -r ingest/embed-huggingface.in
langchain-core==0.0.6
langchain-core==0.0.8
# via langchain
langsmith==0.0.67
langsmith==0.0.68
# via
# langchain
# langchain-core
Expand Down
11 changes: 7 additions & 4 deletions requirements/ingest/embed-openai.txt
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@ frozenlist==1.4.0
# via
# aiohttp
# aiosignal
greenlet==3.0.1
# via sqlalchemy
h11==0.14.0
# via httpcore
httpcore==1.0.2
Expand All @@ -62,11 +64,11 @@ jsonpatch==1.33
# langchain-core
jsonpointer==2.4
# via jsonpatch
langchain==0.0.341
langchain==0.0.344
# via -r ingest/embed-openai.in
langchain-core==0.0.6
langchain-core==0.0.8
# via langchain
langsmith==0.0.67
langsmith==0.0.68
# via
# langchain
# langchain-core
Expand All @@ -87,7 +89,7 @@ numpy==1.24.4
# -c ingest/../base.txt
# -c ingest/../constraints.in
# langchain
openai==1.3.5
openai==1.3.7
# via -r ingest/embed-openai.in
packaging==23.2
# via
Expand Down Expand Up @@ -116,6 +118,7 @@ sniffio==1.3.0
# via
# anyio
# httpx
# openai
sqlalchemy==2.0.23
# via langchain
tenacity==8.2.3
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/gcs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ google-api-core==2.14.0
# via
# google-cloud-core
# google-cloud-storage
google-auth==2.23.4
google-auth==2.24.0
# via
# gcsfs
# google-api-core
Expand Down
4 changes: 1 addition & 3 deletions requirements/ingest/github.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,7 @@ pycparser==2.21
pygithub==2.1.1
# via -r ingest/github.in
pyjwt[crypto]==2.8.0
# via
# pygithub
# pyjwt
# via pygithub
pynacl==1.5.0
# via pygithub
python-dateutil==2.8.2
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/google-drive.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ charset-normalizer==3.3.2
# requests
google-api-core==2.14.0
# via google-api-python-client
google-api-python-client==2.108.0
google-api-python-client==2.109.0
# via -r ingest/google-drive.in
google-auth==2.23.4
google-auth==2.24.0
# via
# google-api-core
# google-api-python-client
Expand Down
10 changes: 5 additions & 5 deletions requirements/ingest/hubspot.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,19 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/ingest-hubspot.in
# pip-compile --output-file=ingest/hubspot.txt ingest/hubspot.in
#
certifi==2023.7.22
certifi==2023.11.17
# via hubspot-api-client
hubspot-api-client==8.1.1
# via -r requirements/ingest-hubspot.in
# via -r ingest/hubspot.in
python-dateutil==2.8.2
# via hubspot-api-client
six==1.16.0
# via
# hubspot-api-client
# python-dateutil
urllib3==1.26.17
urllib3==2.1.0
# via
# -r requirements/ingest-hubspot.in
# -r ingest/hubspot.in
# hubspot-api-client
2 changes: 1 addition & 1 deletion requirements/ingest/jira.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile --output-file=ingest/jira.txt ingest/jira.in
#
atlassian-python-api==3.41.3
atlassian-python-api==3.41.4
# via -r ingest/jira.in
certifi==2023.11.17
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/mongodb.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,5 @@
#
dnspython==2.4.2
# via pymongo
pymongo==4.6.0
pymongo==4.6.1
# via -r ingest/mongodb.in
4 changes: 1 addition & 3 deletions requirements/ingest/onedrive.txt
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,7 @@ office365-rest-python-client==2.4.2
pycparser==2.21
# via cffi
pyjwt[crypto]==2.8.0
# via
# msal
# pyjwt
# via msal
pytz==2023.3.post1
# via office365-rest-python-client
requests==2.31.0
Expand Down
4 changes: 1 addition & 3 deletions requirements/ingest/outlook.txt
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,7 @@ office365-rest-python-client==2.4.2
pycparser==2.21
# via cffi
pyjwt[crypto]==2.8.0
# via
# msal
# pyjwt
# via msal
pytz==2023.3.post1
# via office365-rest-python-client
requests==2.31.0
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/pinecone.in
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
-c constraints.in
-c base.txt
-c ../constraints.in
-c ../base.txt
pinecone-client
Loading

0 comments on commit 69d0ee1

Please sign in to comment.