Skip to content

Commit

Permalink
rfctr(pptx): extract _PptxPartitionerOptions (Unstructured-IO#2853)
Browse files Browse the repository at this point in the history
**Reviewers:** Likely quicker to review commit-by-commit.

**Summary**

In preparation for adding a PPTX `Picture` shape _sub-partitioner_,
extract management of PPTX partitioning-run options to a separate
`_PptxPartitioningOptions` object similar to those used in chunking and
XLSX partitioning. This provides several benefits:
- Extract code dealing with applying defaults and computing derived
values from the main partitioning code, leaving it less cluttered and
focused on the partitioning algorithm itself.
- Allow the options set to be passed to helper objects, prominently
including sub-partitioners, without requiring a long list of parameters
or requiring the caller to couple itself to the particular option values
the helper object requires.
- Allow options behaviors to be thoroughly and efficiently tested in
isolation.
  • Loading branch information
scanny committed Apr 8, 2024
1 parent a9b6506 commit 2c7e028
Show file tree
Hide file tree
Showing 11 changed files with 537 additions and 289 deletions.
12 changes: 12 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -116,13 +116,20 @@ jobs:
- name: Test
env:
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
TESSERACT_VERSION : "5.3.4"
run: |
source .venv/bin/activate
sudo apt-get update
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
tesseract --version
installed_tesseract_version=$(tesseract --version | grep -oP '(?<=tesseract )\d+\.\d+\.\d+')
if [ "$installed_tesseract_version" != "${{env.TESSERACT_VERSION}}" ]; then
echo "Tesseract version ${{env.TESSERACT_VERSION}} is required but found version $installed_tesseract_version"
exit 1
fi
# FIXME (yao): sometimes there is cache but we still miss argilla in the env; so we add make install-ci again
make install-ci
make test CI=true UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true
Expand Down Expand Up @@ -156,6 +163,7 @@ jobs:
sudo apt-get install -y poppler-utils
make install-pandoc
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
tesseract --version
make test-chipper CI=true UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true
Expand Down Expand Up @@ -224,6 +232,7 @@ jobs:
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
make install-pandoc
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
tesseract --version
make test-extra-${{ matrix.extra }} CI=true
Expand Down Expand Up @@ -343,6 +352,7 @@ jobs:
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
make install-pandoc
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get update
sudo apt-get install -y tesseract-ocr
sudo apt-get install -y tesseract-ocr-kor
sudo apt-get install diffstat
Expand Down Expand Up @@ -408,6 +418,7 @@ jobs:
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
make install-pandoc
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get update
sudo apt-get install -y tesseract-ocr
sudo apt-get install -y tesseract-ocr-kor
sudo apt-get install diffstat
Expand Down Expand Up @@ -454,6 +465,7 @@ jobs:
make install-ci
sudo apt-get update && sudo apt-get install --yes poppler-utils libreoffice
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
tesseract --version
make install-nltk-models
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -398,7 +398,7 @@ check-flake8-print:
.PHONY: check-ruff
check-ruff:
# -- ruff options are determined by pyproject.toml --
ruff .
ruff check .

.PHONY: check-autoflake
check-autoflake:
Expand Down
File renamed without changes.
3 changes: 0 additions & 3 deletions test_unstructured/partition/pptx/test_files/README.md

This file was deleted.

Loading

0 comments on commit 2c7e028

Please sign in to comment.