Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfctr: prepare to add orig_elements serde #2668

Merged
merged 9 commits into from
Mar 20, 2024

Conversation

scanny
Copy link
Collaborator

@scanny scanny commented Mar 19, 2024

Summary
The serialization and deserialization (serde) of metadata.orig_elements will be located in unstructured.staging.base alongside elements_to_json() and other existing serde functions. Improve the typing, readability, and structure of that module before adding the new serde functions for metadata.orig_elements.

Reviewers: The commits are well-groomed and are probably quicker to review commit-by-commit than as all files-changed at once.

Make test module naming consistent, "test_" + module-name of module
under test.

This eases locating the test module and also allows automatic switching
between the test module and the module under test in IDEs.
@scanny scanny force-pushed the scanny/prep-for-orig-elements-serde branch from ba80c5e to 09d7934 Compare March 19, 2024 22:59
Preserve legacy names to avoid a breaking change. Standardize on the
uniform `elements_to_*` for serialization and `elements_from_*` for
deserialization which is already established.

Change references within the `unstructured` repo. These functions may
still be called from outside the library, but the aliases will prevent
those calls from breaking.
NO LOGIC CHANGES, only moving functions within module.

These functions all cohere around serialization and deserialization of
elements (list[Element]). Because the staging "brick" is deprecated,
these will need a new home and `unstructured/documents/elements` is the
likely location since serde for Elements coheres strongly with elements.

Place all these functions together and sequence in alphabetical order
for easy location.
Addresses the `import file mismatch` error when two have the same name,
like `test_base.py` which is common across multiple subpackages.
return flattened_dict


def _get_table_fieldnames(rows):
def _get_table_fieldnames(rows: list[dict[str, Any]]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using list instead of Sequence because this is a private method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, no, this could definitely be Sequence. The main reason is the type checker wasn't complaining and we know it will always be this exact type and never list[some-subtype-of-dict[str, Any]]. But it wouldn't hurt to change it to Sequence.

The reason i use Sequence for elements is because list is invariant. Invariant means that this function call will type-check:

def f(elements: list[Element]):
    ...

f([Element("abc"), Element("def")])

but not this one:

f([Title("abc"), Text("def")])

The idea being that a list of Element subtypes are not acceptable as a list of Element.

Sequence on the other hand is covariant, which means a sequence of Element subtypes is acceptable as Sequence[Element].

The other kind of variance is contra-variance. More on that business in this article: https://www.playfulpython.com/type-hinting-covariance-contra-variance/

And I like this guy's videos on the subject: https://www.youtube.com/watch?v=7Chd5gPHlDg

A good principle is: "be as permissive as possible for input types and as specific as possible for return types". Following that principle, all we need of rows is Iterable[dict[str, Any]]. So if we were going to change it then probably that would be the best choice.

My intention in these initial mechanical refactorings has just been to get to clean-strict as quickly as possible, so I haven't thought a lot about that, but going forward, as long as we're putting in the effort, it probably makes sense to do as good a job as we can, I'm sure it doesn't take any longer really.

@@ -78,7 +78,7 @@ def convert_to_isd(elements: Sequence[Element]) -> list[dict[str, Any]]:


def convert_to_dict(elements: Sequence[Element]) -> list[dict[str, Any]]:
"""Converts a list of elements into a dictionary."""
"""Convert a list of elements into a list of element-dicts."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, but shouldn't this be "Converts" to stay consistent with the other docstrings? I am not confident on what the "correct" tense is for docstrings, though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think imperative tense is appropriate for docstrings, like "do this" as opposed to "does this". So I'd be more inclined to change any others that say Converts to Convert.

But my standard with these refactorings is to improve the code rather than to perfect it, mainly because the latter is unachievable and we need to get on with other things. So I'm happy with the Boy Scout rule, which is "leave it cleaner than you found it" :) So I didn't go looking for other cases, although now that you've mentioned it I probably won't be able to stop myself ... :)

Copy link
Collaborator

@Coniferish Coniferish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the organized commits :)

@scanny scanny force-pushed the scanny/prep-for-orig-elements-serde branch from 84af9e9 to 5cbaf06 Compare March 20, 2024 20:01
@scanny scanny added this pull request to the merge queue Mar 20, 2024
Merged via the queue into main with commit 31bef43 Mar 20, 2024
46 checks passed
@scanny scanny deleted the scanny/prep-for-orig-elements-serde branch March 20, 2024 22:02
kaaloo pushed a commit to inclusif/unstructured that referenced this pull request Apr 8, 2024
**Summary**
The serialization and deserialization (serde) of
`metadata.orig_elements` will be located in `unstructured.staging.base`
alongside `elements_to_json()` and other existing serde functions.
Improve the typing, readability, and structure of that module before
adding the new serde functions for `metadata.orig_elements`.

**Reviewers:** The commits are well-groomed and are probably quicker to
review commit-by-commit than as all files-changed at once.
kaaloo pushed a commit to inclusif/unstructured that referenced this pull request Apr 8, 2024
**Summary**
The serialization and deserialization (serde) of
`metadata.orig_elements` will be located in `unstructured.staging.base`
alongside `elements_to_json()` and other existing serde functions.
Improve the typing, readability, and structure of that module before
adding the new serde functions for `metadata.orig_elements`.

**Reviewers:** The commits are well-groomed and are probably quicker to
review commit-by-commit than as all files-changed at once.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants