Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[On top of PR #39] Feat. Postprocessing control #40

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

pradhyumna85
Copy link
Contributor

@pradhyumna85 pradhyumna85 commented Sep 15, 2024

To accommodate and resolve #37

Changes

Note: This PR adds changes on top of PR #39. If merged, this will accommodate changes of PR #39, which won't require the previous PR to be merged.

  • added post_process_function param to override/skip Zerox's default format_markdown post processing on the model's text output.
  • removed output_dir param and added output_file_path which is more flexible for arbitrary file extensions
  • page_separator param added (used when writing the consolidated output to the output_file_path

Edit:
Fixes #42

Pradyumna Singh Rathore added 3 commits September 14, 2024 00:46
Changes:
- Feature added: now specific pages can be processed with the python sdk using "select_pages" param. Incorporates getomni-ai#23, getomni-ai#24 for python sdk
- workflow for the above feature: create a new temperory pdf in the tempdir if select_pages is specified and follow the rest of the process as usual and finally map the page number in the formatted markdown to get the actual number instead of index.
- raise warning when both select_pages and maintain used.
- required adaptations and updates in messages, exceptions, types, processor, utils etc

Fixes/improvements:
- memory efficient pdf to image conversion, utilizing paths only option to directly get sorted image paths from pdf2image api

Misc:
- Bump the version tag
- documentation updated
- added post_process_function param to override/skip Zerox's default format_markdown post processing on the model's text output.
- removed output_dir param and added output_file_path which is more flexible for arbitrary file extensions
- page_separator param added (used when writing the consolidated output to the output_file_path
@pradhyumna85 pradhyumna85 changed the title [To be applied after PR #39] Feat. Postprocessing control [On top of PR #39] Feat. Postprocessing control Sep 15, 2024
@tylermaran
Copy link
Contributor

I'll take a look and test this one as well. I like the page_separator optional param. I was thinking of adding a <=== Page {x} ===> to our own use of zerox the other day. So make sense!

@pradhyumna85
Copy link
Contributor Author

pradhyumna85 commented Sep 18, 2024

I'll take a look and test this one as well. I like the page_separator optional param. I was thinking of adding a <=== Page {x} ===> to our own use of zerox the other day. So make sense!

@tylermaran, you want to provide an option to pass a string with fixed placeholder like {page_no} (if this placeholder is not found then we don't populate anything) and populate that with the page number while writing the output markdown file (if the user has choosed to output)?

@tylermaran
Copy link
Contributor

Hey @pradhyumna85, thinking through this a bit more. Right now we return an array of objects (including page number, content, etc.). So for our day to day use I was doing:

const result = await zerox(...)
const aggregatedText = result.pages.map((el) => el.content).join('\n\n');

But if we're writing to the output directory, it makes sense to have a configurable page deliminator built in. Although something like === Page {X} === might be a better default than \n\n.

@pradhyumna85
Copy link
Contributor Author

@tylermaran, Added the functionality for you to have a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants