rfctr: prepare to add orig_elements serde #2668

scanny · 2024-03-19T22:43:59Z

Summary
The serialization and deserialization (serde) of metadata.orig_elements will be located in unstructured.staging.base alongside elements_to_json() and other existing serde functions. Improve the typing, readability, and structure of that module before adding the new serde functions for metadata.orig_elements.

Reviewers: The commits are well-groomed and are probably quicker to review commit-by-commit than as all files-changed at once.

Make test module naming consistent, "test_" + module-name of module under test. This eases locating the test module and also allows automatic switching between the test module and the module under test in IDEs.

test_unstructured/documents/test_elements.py

test_unstructured/staging/test_base.py

Preserve legacy names to avoid a breaking change. Standardize on the uniform `elements_to_*` for serialization and `elements_from_*` for deserialization which is already established. Change references within the `unstructured` repo. These functions may still be called from outside the library, but the aliases will prevent those calls from breaking.

NO LOGIC CHANGES, only moving functions within module. These functions all cohere around serialization and deserialization of elements (list[Element]). Because the staging "brick" is deprecated, these will need a new home and `unstructured/documents/elements` is the likely location since serde for Elements coheres strongly with elements. Place all these functions together and sequence in alphabetical order for easy location.

Addresses the `import file mismatch` error when two have the same name, like `test_base.py` which is common across multiple subpackages.

test_unstructured/staging/test_base.py

Coniferish · 2024-03-20T15:55:47Z

unstructured/staging/base.py

    return flattened_dict


-def _get_table_fieldnames(rows):
+def _get_table_fieldnames(rows: list[dict[str, Any]]):


Are you using list instead of Sequence because this is a private method?

Hmm, no, this could definitely be Sequence. The main reason is the type checker wasn't complaining and we know it will always be this exact type and never list[some-subtype-of-dict[str, Any]]. But it wouldn't hurt to change it to Sequence.

The reason i use Sequence for elements is because list is invariant. Invariant means that this function call will type-check:

def f(elements: list[Element]): ... f([Element("abc"), Element("def")])

but not this one:

f([Title("abc"), Text("def")])

The idea being that a list of Element subtypes are not acceptable as a list of Element.

Sequence on the other hand is covariant, which means a sequence of Element subtypes is acceptable as Sequence[Element].

The other kind of variance is contra-variance. More on that business in this article: https://www.playfulpython.com/type-hinting-covariance-contra-variance/

And I like this guy's videos on the subject: https://www.youtube.com/watch?v=7Chd5gPHlDg

A good principle is: "be as permissive as possible for input types and as specific as possible for return types". Following that principle, all we need of rows is Iterable[dict[str, Any]]. So if we were going to change it then probably that would be the best choice.

My intention in these initial mechanical refactorings has just been to get to clean-strict as quickly as possible, so I haven't thought a lot about that, but going forward, as long as we're putting in the effort, it probably makes sense to do as good a job as we can, I'm sure it doesn't take any longer really.

Coniferish · 2024-03-20T16:03:47Z

unstructured/staging/base.py

@@ -78,7 +78,7 @@ def convert_to_isd(elements: Sequence[Element]) -> list[dict[str, Any]]:


 def convert_to_dict(elements: Sequence[Element]) -> list[dict[str, Any]]:
-    """Converts a list of elements into a dictionary."""
+    """Convert a list of elements into a list of element-dicts."""


minor, but shouldn't this be "Converts" to stay consistent with the other docstrings? I am not confident on what the "correct" tense is for docstrings, though.

I think imperative tense is appropriate for docstrings, like "do this" as opposed to "does this". So I'd be more inclined to change any others that say Converts to Convert.

But my standard with these refactorings is to improve the code rather than to perfect it, mainly because the latter is unachievable and we need to get on with other things. So I'm happy with the Boy Scout rule, which is "leave it cleaner than you found it" :) So I didn't go looking for other cases, although now that you've mentioned it I probably won't be able to stop myself ... :)

Coniferish

LGTM! Thanks for the organized commits :)

**Summary** The serialization and deserialization (serde) of `metadata.orig_elements` will be located in `unstructured.staging.base` alongside `elements_to_json()` and other existing serde functions. Improve the typing, readability, and structure of that module before adding the new serde functions for `metadata.orig_elements`. **Reviewers:** The commits are well-groomed and are probably quicker to review commit-by-commit than as all files-changed at once.

scanny added 3 commits March 19, 2024 14:20

rfctr: test_base_staging.py -> test_base.py

654895a

Make test module naming consistent, "test_" + module-name of module under test. This eases locating the test module and also allows automatic switching between the test module and the module under test in IDEs.

rfctr: improve typing

bee9d11

rfctr: improve docstrings

1394d51

scanny commented Mar 19, 2024

View reviewed changes

test_unstructured/documents/test_elements.py Show resolved Hide resolved

scanny commented Mar 19, 2024

View reviewed changes

test_unstructured/documents/test_elements.py Show resolved Hide resolved

scanny commented Mar 19, 2024

View reviewed changes

test_unstructured/staging/test_base.py Show resolved Hide resolved

scanny force-pushed the scanny/prep-for-orig-elements-serde branch from ba80c5e to 09d7934 Compare March 19, 2024 22:59

scanny added 3 commits March 19, 2024 16:04

rfctr: improve expression

662f4a1

scanny force-pushed the scanny/prep-for-orig-elements-serde branch from 09d7934 to d9ca913 Compare March 19, 2024 23:05

scanny temporarily deployed to ci March 19, 2024 23:07 — with GitHub Actions Inactive

chore: bump CHANGELOG + __version__

f0f3ad3

scanny force-pushed the scanny/prep-for-orig-elements-serde branch from d9ca913 to f0f3ad3 Compare March 19, 2024 23:12

scanny temporarily deployed to ci March 19, 2024 23:24 — with GitHub Actions Inactive

scanny temporarily deployed to ci March 19, 2024 23:59 — with GitHub Actions Inactive

fix: add __init__.py files in test sub-packages

33ff418

Addresses the `import file mismatch` error when two have the same name, like `test_base.py` which is common across multiple subpackages.

scanny force-pushed the scanny/prep-for-orig-elements-serde branch from 49eeb31 to 33ff418 Compare March 20, 2024 00:54

scanny temporarily deployed to ci March 20, 2024 00:56 — with GitHub Actions Inactive

scanny requested a review from Coniferish March 20, 2024 03:21

Coniferish reviewed Mar 20, 2024

View reviewed changes

test_unstructured/staging/test_base.py Show resolved Hide resolved

Coniferish reviewed Mar 20, 2024

View reviewed changes

Coniferish approved these changes Mar 20, 2024

View reviewed changes

scanny temporarily deployed to ci March 20, 2024 19:54 — with GitHub Actions Inactive

scanny temporarily deployed to ci March 20, 2024 19:55 — with GitHub Actions Inactive

fix: some additional improvements from CR

5cbaf06

scanny force-pushed the scanny/prep-for-orig-elements-serde branch from 84af9e9 to 5cbaf06 Compare March 20, 2024 20:01

scanny temporarily deployed to ci March 20, 2024 20:02 — with GitHub Actions Inactive

scanny added this pull request to the merge queue Mar 20, 2024

Merged via the queue into main with commit 31bef43 Mar 20, 2024
46 checks passed

scanny deleted the scanny/prep-for-orig-elements-serde branch March 20, 2024 22:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfctr: prepare to add orig_elements serde #2668

rfctr: prepare to add orig_elements serde #2668

scanny commented Mar 19, 2024 •

edited

Loading

Coniferish Mar 20, 2024

scanny Mar 20, 2024

Coniferish Mar 20, 2024

scanny Mar 20, 2024

Coniferish left a comment

rfctr: prepare to add orig_elements serde #2668

rfctr: prepare to add orig_elements serde #2668

Conversation

scanny commented Mar 19, 2024 • edited Loading

Coniferish Mar 20, 2024

Choose a reason for hiding this comment

scanny Mar 20, 2024

Choose a reason for hiding this comment

Coniferish Mar 20, 2024

Choose a reason for hiding this comment

scanny Mar 20, 2024

Choose a reason for hiding this comment

Coniferish left a comment

Choose a reason for hiding this comment

scanny commented Mar 19, 2024 •

edited

Loading