Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(expressions): Use compound names in function identifiers #23

Closed

Conversation

wackywendell
Copy link
Contributor

This change addresses a bug where the type signatures in substrait proto files were lost in decoding -> encoding, and only simple names were used.

To address this, I updated the expression builders to always use compound names (e.g. add:i64_i64) and not just simple names (add). The spec currently requires compound names for functions with multiple impls (and may soon require them always), and the signatures were lost in the conversion from protobuf to internal formats. That is the first commit, with tests updated in the second commit.

Secondly, I changed the code so that if the function is not found with the given type signature, a "not found" error is returned, just as it is if the function is not found. That's the third commit. I'm less attached to this, but it helped in testing because I found a number of tests that "passed" incorrectly as (with the change above) they were referring to non-existent functions.

For that second part - I'm not sure it's necessary; we can drop that commit if you prefer. Alternatively - it might be worth removing the NewCustom*Func functions altogether.

@vbarua vbarua requested a review from zeroshade August 8, 2023 20:31
Copy link
Contributor

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this change, the intent for allowing usage of the simple names was so that the builders can be used to auto resolve the compound function names based on the passed in argument types. You can see that resolution issue here when you had to change the types of the outputs. Notice that in the expressions_tests the output is no longer fp32. We want the type resolution to still work.

As for the NewCustom*Func functions, those are necessary for the cases where users utilize custom URNs rather than valid URIs. This is why we don't error out when we can't find the function. You can see more discussion of it here: #14 (comment)

func ExampleExpression_scalarFunction() {
func TestExampleExpressionScalarFunction(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this change? The intent was for this to show up as an example in the docs on pkg.go.dev, by making this change the test will no longer show up as a testable example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! That makes sense. I wanted to use assert as used elsewhere; I'm not used to example's like that, so I didn't realize the regression. Thanks for pointing it out!

Comment on lines -120 to -136
// print string represention of the expression
fmt.Println(fromProto)
// print the string representation of our
// manually constructed expression
// Check the string represention of the expression
assert.Equal(t, "add:i8_i8(.field(0), fp64(10)) => i8", fromProto.String())
// Check our manually constructed expression
assert.Equal(t, "add:i8_i8(.field(0) => i8, fp64(10)) => i8", ex.String())
fmt.Println(ex)

// verify that the Equals methods work recursively
fmt.Println(ex.Equals(fromProto))
// verify that the two are equal
assert.True(t, ex.Equals(fromProto))
// confirm our manually constructed expression is the same
// as the one we got from protojson
fmt.Println(pb.Equal(&exprProto, toProto))

// Output:
// add(.field(0), fp64(10)) => i32
// add(.field(0) => i32, fp64(10)) => i32
// true
// true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we want the // Output: and such so that this shows up as a testable example in the documentation to make it easy for consumers to see how to work with the library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, as above 👍

@wackywendell
Copy link
Contributor Author

@zeroshade - great points! Yes, I had some uncertainty going into this, and you've confirmed it.

Perhaps we should start by clarifying goals / requirements, so that we can get an overall improvement?

I see a few that we've gathered:

  1. The protobuf output from the builder should use compound names for functions in the substrait extensions. This is required by the spec for functions with multiple impls, and now for single impls as well. (This was the intent of this PR)
  2. Converting a protobuf input to substrait internal representation should interpret compound function names into the correct SubstraitFunctionVariant with the appropriate types. (Also intended by this PR)
  3. The builder should be able to accept function names without compound types, and determine the correct function variant. (Regression in this PR)
  4. Users should be able to add custom functions specified with custom URNs, the NewCustom*Func functions. (Regression in this PR)
  5. Examples should show up in the Godocs. (Regression in this PR).

So you pointed out 3-5, and that makes sense to me; I can work on that. I might do a bit of looking at the spec and ask about (3) on Slack, as the intention makes sense in a vague way to me, but I don't understand the specifics, and I'd like to.

For 1-2, do those make sense to you as reasonable goals? Also - please let me know if any of the above is incorrect or could be restated better, I'm new to this repo and trying to figure it out. Thanks for your help, I appreciate it!

@zeroshade
Copy link
Contributor

Hey @wackywendell:

1 and 2 make perfect sense as reasonable goals for me. I believe that number 2 should already occur (to my knowledge compound function names should already get interpretted into the correct variant, if not then I must have missed a test somewhere and I'm glad you're fixing it!). Did you find a case where number 2 isn't currently true in the code?

@wackywendell
Copy link
Contributor Author

For (2), it looks like that works most of the time, but has some issues with functions that are (a) specified with compound names, and (b) not found in the YAML, so they end up getting created with the NewCustomScalarFunc machinery.

See example here, where an ID with a compound name add:i32_i32 gets translated to simple name add:i32_i32 and type signature empty, leading to a full compound name add:i32_i32:.

This becomes relevant because if you fix (1), then you get problems related to (2), where parsing a protobuf file into the substrait-go representation and then back to protobuf fails.

🤔 I think this might be solved by changing the NewScalarFuncVariant function to parse compound names into appropriate arguments, if the name is found to be compound?

@zeroshade
Copy link
Contributor

🤔 I think this might be solved by changing the NewScalarFuncVariant function to parse compound names into appropriate arguments, if the name is found to be compound?

Without digging too much into it myself at the moment, that sounds like a reasonable approach. Wanna try it out and see if that fixes the issue?

@wackywendell
Copy link
Contributor Author

Yes - I'm giving it a try. Thanks! Let me see what code I can come up with and get back to you 🙇

@wackywendell
Copy link
Contributor Author

Closing in favor of #24.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants