-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfctr: prepare to add orig_elements serde #2668
Conversation
Make test module naming consistent, "test_" + module-name of module under test. This eases locating the test module and also allows automatic switching between the test module and the module under test in IDEs.
ba80c5e
to
09d7934
Compare
Preserve legacy names to avoid a breaking change. Standardize on the uniform `elements_to_*` for serialization and `elements_from_*` for deserialization which is already established. Change references within the `unstructured` repo. These functions may still be called from outside the library, but the aliases will prevent those calls from breaking.
NO LOGIC CHANGES, only moving functions within module. These functions all cohere around serialization and deserialization of elements (list[Element]). Because the staging "brick" is deprecated, these will need a new home and `unstructured/documents/elements` is the likely location since serde for Elements coheres strongly with elements. Place all these functions together and sequence in alphabetical order for easy location.
09d7934
to
d9ca913
Compare
d9ca913
to
f0f3ad3
Compare
Addresses the `import file mismatch` error when two have the same name, like `test_base.py` which is common across multiple subpackages.
49eeb31
to
33ff418
Compare
return flattened_dict | ||
|
||
|
||
def _get_table_fieldnames(rows): | ||
def _get_table_fieldnames(rows: list[dict[str, Any]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you using list
instead of Sequence
because this is a private method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, no, this could definitely be Sequence
. The main reason is the type checker wasn't complaining and we know it will always be this exact type and never list[some-subtype-of-dict[str, Any]]
. But it wouldn't hurt to change it to Sequence
.
The reason i use Sequence
for elements is because list
is invariant. Invariant means that this function call will type-check:
def f(elements: list[Element]):
...
f([Element("abc"), Element("def")])
but not this one:
f([Title("abc"), Text("def")])
The idea being that a list of Element
subtypes are not acceptable as a list of Element
.
Sequence
on the other hand is covariant, which means a sequence of Element
subtypes is acceptable as Sequence[Element]
.
The other kind of variance is contra-variance. More on that business in this article: https://www.playfulpython.com/type-hinting-covariance-contra-variance/
And I like this guy's videos on the subject: https://www.youtube.com/watch?v=7Chd5gPHlDg
A good principle is: "be as permissive as possible for input types and as specific as possible for return types". Following that principle, all we need of rows
is Iterable[dict[str, Any]]
. So if we were going to change it then probably that would be the best choice.
My intention in these initial mechanical refactorings has just been to get to clean-strict as quickly as possible, so I haven't thought a lot about that, but going forward, as long as we're putting in the effort, it probably makes sense to do as good a job as we can, I'm sure it doesn't take any longer really.
unstructured/staging/base.py
Outdated
@@ -78,7 +78,7 @@ def convert_to_isd(elements: Sequence[Element]) -> list[dict[str, Any]]: | |||
|
|||
|
|||
def convert_to_dict(elements: Sequence[Element]) -> list[dict[str, Any]]: | |||
"""Converts a list of elements into a dictionary.""" | |||
"""Convert a list of elements into a list of element-dicts.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor, but shouldn't this be "Converts" to stay consistent with the other docstrings? I am not confident on what the "correct" tense is for docstrings, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think imperative tense is appropriate for docstrings, like "do this" as opposed to "does this". So I'd be more inclined to change any others that say Converts
to Convert
.
But my standard with these refactorings is to improve the code rather than to perfect it, mainly because the latter is unachievable and we need to get on with other things. So I'm happy with the Boy Scout rule, which is "leave it cleaner than you found it" :) So I didn't go looking for other cases, although now that you've mentioned it I probably won't be able to stop myself ... :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the organized commits :)
84af9e9
to
5cbaf06
Compare
**Summary** The serialization and deserialization (serde) of `metadata.orig_elements` will be located in `unstructured.staging.base` alongside `elements_to_json()` and other existing serde functions. Improve the typing, readability, and structure of that module before adding the new serde functions for `metadata.orig_elements`. **Reviewers:** The commits are well-groomed and are probably quicker to review commit-by-commit than as all files-changed at once.
**Summary** The serialization and deserialization (serde) of `metadata.orig_elements` will be located in `unstructured.staging.base` alongside `elements_to_json()` and other existing serde functions. Improve the typing, readability, and structure of that module before adding the new serde functions for `metadata.orig_elements`. **Reviewers:** The commits are well-groomed and are probably quicker to review commit-by-commit than as all files-changed at once.
Summary
The serialization and deserialization (serde) of
metadata.orig_elements
will be located inunstructured.staging.base
alongsideelements_to_json()
and other existing serde functions. Improve the typing, readability, and structure of that module before adding the new serde functions formetadata.orig_elements
.Reviewers: The commits are well-groomed and are probably quicker to review commit-by-commit than as all files-changed at once.