[On top of PR #39] Feat. Postprocessing control #40

pradhyumna85 · 2024-09-15T19:17:29Z

To accommodate and resolve #37

Changes

Note: This PR adds changes on top of PR #39. If merged, this will accommodate changes of PR #39, which won't require the previous PR to be merged.

added post_process_function param to override/skip Zerox's default format_markdown post processing on the model's text output.
removed output_dir param and added output_file_path which is more flexible for arbitrary file extensions
page_separator param added (used when writing the consolidated output to the output_file_path

Edit:
Fixes #42

Changes: - Feature added: now specific pages can be processed with the python sdk using "select_pages" param. Incorporates getomni-ai#23, getomni-ai#24 for python sdk - workflow for the above feature: create a new temperory pdf in the tempdir if select_pages is specified and follow the rest of the process as usual and finally map the page number in the formatted markdown to get the actual number instead of index. - raise warning when both select_pages and maintain used. - required adaptations and updates in messages, exceptions, types, processor, utils etc Fixes/improvements: - memory efficient pdf to image conversion, utilizing paths only option to directly get sorted image paths from pdf2image api Misc: - Bump the version tag - documentation updated

- added post_process_function param to override/skip Zerox's default format_markdown post processing on the model's text output. - removed output_dir param and added output_file_path which is more flexible for arbitrary file extensions - page_separator param added (used when writing the consolidated output to the output_file_path

tylermaran · 2024-09-17T23:05:46Z

I'll take a look and test this one as well. I like the page_separator optional param. I was thinking of adding a <=== Page {x} ===> to our own use of zerox the other day. So make sense!

pradhyumna85 · 2024-09-18T12:55:03Z

I'll take a look and test this one as well. I like the page_separator optional param. I was thinking of adding a <=== Page {x} ===> to our own use of zerox the other day. So make sense!

@tylermaran, you want to provide an option to pass a string with fixed placeholder like {page_no} (if this placeholder is not found then we don't populate anything) and populate that with the page number while writing the output markdown file (if the user has choosed to output)?

tylermaran · 2024-09-18T20:09:11Z

Hey @pradhyumna85, thinking through this a bit more. Right now we return an array of objects (including page number, content, etc.). So for our day to day use I was doing:

const result = await zerox(...)
const aggregatedText = result.pages.map((el) => el.content).join('\n\n');

But if we're writing to the output directory, it makes sense to have a configurable page deliminator built in. Although something like === Page {X} === might be a better default than \n\n.

…s not passed

pradhyumna85 · 2024-09-19T07:52:24Z

@tylermaran, Added the functionality for you to have a look.

Pradyumna Singh Rathore added 3 commits September 14, 2024 00:46

Merge remote-tracking branch 'origin/main' into feat-page-number-input

1ecb275

pradhyumna85 changed the title ~~[To be applied after PR #39] Feat. Postprocessing control~~ [On top of PR #39] Feat. Postprocessing control Sep 15, 2024

pradhyumna85 mentioned this pull request Sep 15, 2024

Ability to prompt for json output ? #37

Open

pradhyumna85 and others added 2 commits September 18, 2024 18:07

Merge branch 'main' into formatting-control

ff77b36

bump the python sdk version tag

c96b83b

Pradyumna Singh Rathore added 4 commits September 19, 2024 12:12

added default page number separator which will use the page numbers

316e8f6

fix: bump version tag in setup.py

d79740f

fix issue after getomni-ai#39: select pages error when select pages i…

71d7d1b

…s not passed

better default to prevent mixup in markdown output

5846b57

pradhyumna85 mentioned this pull request Sep 19, 2024

select pages broken in #39 when select_pages in not passed -> error sorting None type #42

Open

minor update

eccc73b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[On top of PR #39] Feat. Postprocessing control #40

[On top of PR #39] Feat. Postprocessing control #40

pradhyumna85 commented Sep 15, 2024 •

edited

Loading

tylermaran commented Sep 17, 2024

pradhyumna85 commented Sep 18, 2024 •

edited

Loading

tylermaran commented Sep 18, 2024

pradhyumna85 commented Sep 19, 2024

[On top of PR #39] Feat. Postprocessing control #40

Are you sure you want to change the base?

[On top of PR #39] Feat. Postprocessing control #40

Conversation

pradhyumna85 commented Sep 15, 2024 • edited Loading

Changes

tylermaran commented Sep 17, 2024

pradhyumna85 commented Sep 18, 2024 • edited Loading

tylermaran commented Sep 18, 2024

pradhyumna85 commented Sep 19, 2024

pradhyumna85 commented Sep 15, 2024 •

edited

Loading

pradhyumna85 commented Sep 18, 2024 •

edited

Loading