-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for inline struct creation #4120
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with a couple of suggestions.
Trying not to bikeshed this too hard - but you asked for feedback, so here it is 😉 I'm not generally a fan of adding any extra syntax which depends on additional punctuation characters (the curly braces in this case) - it's always confusing trying to remember which brace-type to use in which situation and then you have to go look it up. Secondly, we have a (not uncommon) pattern of using functions to create non-primitive types (cf. Side note: would be great if we fixed the parser and udf code to allow keywords as function names too, so we could rename these to just The difference with existing fns which create non-primitives is, of course, that we have to name the components of a struct. I think rather than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @agavra!
Like @blueedgenick I really don't like the use fo {}
to represent a struct constructor. We should go with a more standard SQL approach of STRUCT(col0/col2 AS f0, f1)
, which would result in a schema like STRUCT<F0 INT, F1 STRING>
. We can support this constructor syntax.
Changing the syntax won't mean much code changes.
Aside from that the code LGTM, except that there's some confusion in the naming of new types and methods. Half the time you've named things 'struct', which makes sense, but the other half its 'schema', e.g. addSchema'... I don't get it - is this intentional, or a repeated mistype? I've also left a few suggestions for improve names e.g.
StructExpression->
StructConstructorExpression`, which would help, IMHO, to make the code more readable.
Finally, if we're releasing this, we should look to do #3593 at the same time. :D
Add me back in as a reviewer if you want me to review again.
ksql-parser/src/main/antlr4/io/confluent/ksql/parser/SqlBase.g4
Outdated
Show resolved
Hide resolved
ksql-execution/src/main/java/io/confluent/ksql/execution/codegen/CodeGenSpec.java
Outdated
Show resolved
Hide resolved
|
||
private int argumentCount = 0; | ||
private int schemaCount = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private int schemaCount = 0; | |
private int structConstructorCount = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distinction between struct
and schema
is intentional in the code. The expression evaluator takes in as an argument the schemas for the structs, not the struct expressions themselves. If you look at the code, it is used as new Struct(schema0)...
. In this code, schema0
is very much a schema.
It's just like how we have the method addFunction
and not addFunctionName
. The functionName
is the key, but the function
is what is actually being stored.
Resolving the other conversations that are about this to concentrate on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I still disagree.
It's just like how we have the method addFunction and not addFunctionName. The functionName is the key, but the function is what is actually being stored.
Yep, agreed. So what is being stored in this case is not just any schema, but a struct constructor's schema, so the name should reflect that, e.g. structSchemaCount
. Or maybe you're planning on re-using this counter when we add MAP
, DECIMAL
or ARRAY
constructors that also need a schema? In which case it might be called typeConstructorCount
.
I just think losing the connection between the 'struct' and/or 'constructor' part in these method and variable names makes the code hard to grasp.
Maybe getSchemaName
should be getStructSchemaName
or getConstructorSchemaName
if you're planning on using the same for other type constructors.
I know we're only disagreeing over naming here, but naming is important as I know you agree.
Anyway, up to you, but I'm just letting you know I find the current naming very confusing and hence others might too, and I'd find it less confusing if method and var names kept some link back to their 'struct' or 'constructor' heritage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll name it getStructSchemaName
and make those corresponding changes
ksql-execution/src/main/java/io/confluent/ksql/execution/expression/tree/StructExpression.java
Outdated
Show resolved
Hide resolved
ksql-execution/src/main/java/io/confluent/ksql/execution/codegen/CodeGenSpec.java
Outdated
Show resolved
Hide resolved
ksql-execution/src/main/java/io/confluent/ksql/execution/codegen/SqlToJavaVisitor.java
Show resolved
Hide resolved
ksql-execution/src/main/java/io/confluent/ksql/execution/util/ExpressionTypeManager.java
Outdated
Show resolved
Hide resolved
ksql-functional-tests/src/test/resources/query-validation-tests/create-struct.json
Outdated
Show resolved
Hide resolved
ksql-parser/src/main/java/io/confluent/ksql/schema/ksql/DefaultSqlValueCoercer.java
Outdated
Show resolved
Hide resolved
ksql-parser/src/main/java/io/confluent/ksql/schema/ksql/DefaultSqlValueCoercer.java
Show resolved
Hide resolved
Thanks for the reviews @big-andy-coates and @blueedgenick. Glad I asked for syntax feedback :) I've changed it to be @blueedgenick - I didn't actually implement this with the UDF framework, so I could change the name to be exactly With regards to implicit naming (omitting the @big-andy-coates - about decimal literals, I will do that in a different PR. They're not exactly interdependent 😂 |
Thanks @agavra , looking great! No worries about deferring the implicit naming until later - i'd suggest simply concatenating "levels" of nested-name as the nesting level increases e.g. a generated name for a second-level field might be And yes, it would be awesome to move map and array creation to work this same way instead of the ...more expedient... way we had done it previously with UDFs 😁 One thing i realize i forgot to address yesterday is the literal syntax you've proposed for use in the
Thanks for getting this one done! |
Somewhat true - I think while the schema has ordered fields, the struct does not require ordering. You could decide to map More importantly, we could add this feature later but it's a non-trivial amount of work. The way |
I acknowledge that the current code path for |
I'm not sure I understand. You can reorder the columns you are inserting values for (
This makes sense, maybe that isn't the right pattern for creating structs then 😉
I disagree with this as well. Explicit nulls are nice, but that's not how |
|
I don't think your first bullet is a fair comparison 😂If we wanted to take your example, it would be that instead of With regards to nulls, I think the example I'm thinking of is not really an overloaded constructor but rather the As a side note, I think we've come to an agreement about non insert-values syntax. I also think it's nice to have the same syntax for insert-values and CSAS statements. That way you can think of an |
Hummm... The concern that
Good idea and worth doing IMHO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @agavra
A few comments on the impl below. The main one being I think there's definitely a bug in there due to our use of Map
to store the constructor parameters: the parameters should be an ordered list.
Add me back in when you're ready for another review :D
|
||
private int argumentCount = 0; | ||
private int schemaCount = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I still disagree.
It's just like how we have the method addFunction and not addFunctionName. The functionName is the key, but the function is what is actually being stored.
Yep, agreed. So what is being stored in this case is not just any schema, but a struct constructor's schema, so the name should reflect that, e.g. structSchemaCount
. Or maybe you're planning on re-using this counter when we add MAP
, DECIMAL
or ARRAY
constructors that also need a schema? In which case it might be called typeConstructorCount
.
I just think losing the connection between the 'struct' and/or 'constructor' part in these method and variable names makes the code hard to grasp.
Maybe getSchemaName
should be getStructSchemaName
or getConstructorSchemaName
if you're planning on using the same for other type constructors.
I know we're only disagreeing over naming here, but naming is important as I know you agree.
Anyway, up to you, but I'm just letting you know I find the current naming very confusing and hence others might too, and I'd find it less confusing if method and var names kept some link back to their 'struct' or 'constructor' heritage.
...cution/src/main/java/io/confluent/ksql/execution/expression/tree/CreateStructExpression.java
Outdated
Show resolved
Hide resolved
public Void visitStructExpression(CreateStructExpression exp, ExpressionTypeContext context) { | ||
final Builder builder = SqlStruct.builder(); | ||
|
||
for (Entry<String, Expression> field : exp.getStruct().entrySet()) { | ||
process(field.getValue(), context); | ||
builder.field(field.getKey(), context.getSqlType()); | ||
} | ||
|
||
context.setSqlType(builder.build()); | ||
return null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be dangerous...
This is building a struct schema based on the parameters provided in the constructor... but:
- the list of parameters may not be complete: it may have some fields that will be set to null.
- the parameters in the constructor aren't ordered, yet field order matters to
SqlStruct
.
Hence I don't think you're building the right schema here.
On the first point, I'm guessing it may be OK as when its used in CSAS
/CTAS
statements the struct constructor defines the schema of the struct in the result. However, the same is not true for INSERT INTO
. I guess we can add some coercion of one struct to another. However, I don't think there are currently any tests for such cases for INSERT INTO
. INSERT VALUES
does have tests, and I guess here there is coercion going on.
On the second point, it's worth noting that in both Connect's and KSQL's struct schema type the order of fields is taken into account, so STRUCT<foo INT, bar INT>
is NOT equal to STRUCT<bar INT, foo INT>
. It's potentially valid to argue the order should not matter. Though I think this falls down when we come to converting such a schema into other protocols, e.g. Avro, PB, etc, where order also matters. Hence, I think we need to maintain ordering as being important.
Hence I think its important that the mappings of fieldname & values in the constructor is ordered, i.e. probably a list of tuples. I also think we need to ensure we have test cases, including QTT, to ensure something fails if the ordering is not maintained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first point I think it's okay as you suggested. The coercion happens in the INSERT INTO
. For the second point, I agree it is better to maintain ordering even if it isn't exposed in KSQL - I have changed all map usages to List<Field>
ksql-execution/src/test/java/io/confluent/ksql/execution/codegen/SqlToJavaVisitorTest.java
Show resolved
Hide resolved
ksql-functional-tests/src/test/resources/query-validation-tests/create-struct.json
Show resolved
Hide resolved
ksql-functional-tests/src/test/resources/rest-query-validation-tests/insert-values.json
Outdated
Show resolved
Hide resolved
...cution/src/main/java/io/confluent/ksql/execution/expression/tree/CreateStructExpression.java
Outdated
Show resolved
Hide resolved
aa9715c
to
577f56a
Compare
577f56a
to
7366d85
Compare
I think there's still the outstanding issue of syntax. As per #4120 (comment), I think its worth considering not using |
Big +1 from me on this. As @big-andy-coates alluded to, with |
7366d85
to
37545cb
Compare
@derekjn @big-andy-coates I've made the change to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great feature @agavra .... just decimal, map and array to do ;) . Might be worth adding tasks for map and array...
"comments": [ | ||
"Tests covering the use of the array returning UDFs." | ||
], | ||
"tests": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some negative test cases? e.g. what happens if there are two fields with the same name, or some name but different case? etc. etc.
Also, a test for case-sensitive field names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the tests are awkwardly split between files - i'll add more here
fixes #2147
Description
I was trying to debug an issue (#3874) but to my chagrin I was incapable of finding anyway to produce a struct with a decimal in it to a kafka topic. So I wrote this. What is this? This is the ability to create
STRUCT
types directly from KSQL - a long awaited feature!The syntax is up for debate (cc @hjafarpour @blueedgenick @rmoff) though I'd rather not go too far on that tangent as it's a long awaited feature. At the moment I chose
{(identifier expression (, identifier expression)*)?}
STRUCT((expression AS identifier (, expression AS identifier)*)?
Here's how it's used (note that the examples use the old syntax):
INSERT VALUES:
CSAS:
Implementation
Ah... so you care about how it's implemented, eh? It's actually surprisingly similar to the
FunctionCall
andSubscriptExpression
implementations combined:ExpressionTypeManager
can figure out the type of the struct by recursively visiting each field's expressionCodeGenRunner
creates the schema once, and passes it as an argument to theExpressionEvaluator
SqlToJavaVisitor
uses the param that was passed in (see theSqlToJavaVisitorTest
for an example) to construct the struct in the code that it generatesSqlValueCoercer
helps with some edge cases inINSERT VALUES
clauses (namely to insert decimals)That's all folk!
Testing done
Reviewer checklist