Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for inline struct creation #4120

Merged
merged 6 commits into from
Dec 19, 2019

Conversation

agavra
Copy link
Contributor

@agavra agavra commented Dec 11, 2019

fixes #2147

Description

I was trying to debug an issue (#3874) but to my chagrin I was incapable of finding anyway to produce a struct with a decimal in it to a kafka topic. So I wrote this. What is this? This is the ability to create STRUCT types directly from KSQL - a long awaited feature!

The syntax is up for debate (cc @hjafarpour @blueedgenick @rmoff) though I'd rather not go too far on that tangent as it's a long awaited feature. At the moment I chose {(identifier expression (, identifier expression)*)?} STRUCT((expression AS identifier (, expression AS identifier)*)?

Here's how it's used (note that the examples use the old syntax):

INSERT VALUES:

ksql> CREATE STREAM decimals (col0 STRUCT<val DECIMAL(2,1)>) WITH (kafka_topic='decimals', partitions=1, value_format='AVRO');

 Message
----------------
 Stream created
----------------
ksql> INSERT INTO decimals (col0) VALUES ({VAL '2.1'});
ksql> SELECT COL0->val FROM DECIMALS EMIT CHANGES;
+----------------------------------------------------------------------------------------------+
|COL0__VAL                                                                                     |
+----------------------------------------------------------------------------------------------+
|2.1                                                                                           |

CSAS:

ksql> CREATE STREAM decimals2 as SELECT {times2 col0->val * 2, times3 col0->val * 3} FROM DECIMALS;

 Message
-----------------------------------------------------------------------------------------
 Stream DECIMALS2 created and running. Created by query with query ID: CSAS_DECIMALS2_12
-----------------------------------------------------------------------------------------
ksql> SELECT KSQL_COL_0->TIMES2, KSQL_COL_0->TIMES3 FROM DECIMALS2 EMIT CHANGES;
+----------------------------------------------+----------------------------------------------+
|KSQL_COL_0__TIMES2                            |KSQL_COL_0__TIMES3                            |
+----------------------------------------------+----------------------------------------------+
|4.2                                           |6.3                                           |
ksql> DESCRIBE DECIMALS2;

Name                 : DECIMALS2
 Field      | Type
-----------------------------------------------------
 ROWTIME    | BIGINT           (system)
 ROWKEY     | VARCHAR(STRING)  (system)
 KSQL_COL_0 | STRUCT<TIMES3 DECIMAL, TIMES2 DECIMAL>
-----------------------------------------------------
For runtime statistics and query details run: DESCRIBE EXTENDED <Stream,Table>;

Implementation

Ah... so you care about how it's implemented, eh? It's actually surprisingly similar to the FunctionCall and SubscriptExpression implementations combined:

  • The ExpressionTypeManager can figure out the type of the struct by recursively visiting each field's expression
  • The CodeGenRunner creates the schema once, and passes it as an argument to the ExpressionEvaluator
  • The SqlToJavaVisitor uses the param that was passed in (see the SqlToJavaVisitorTest for an example) to construct the struct in the code that it generates
  • The SqlValueCoercer helps with some edge cases in INSERT VALUES clauses (namely to insert decimals)

That's all folk!

Testing done

  • Unit tests
  • QTT Testing
  • E2E Integration testing

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@agavra agavra requested a review from a team as a code owner December 11, 2019 23:35
Copy link
Member

@JimGalasyn JimGalasyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a couple of suggestions.

@blueedgenick
Copy link
Contributor

Trying not to bikeshed this too hard - but you asked for feedback, so here it is 😉

I'm not generally a fan of adding any extra syntax which depends on additional punctuation characters (the curly braces in this case) - it's always confusing trying to remember which brace-type to use in which situation and then you have to go look it up. Secondly, we have a (not uncommon) pattern of using functions to create non-primitive types (cf. as_array() and as_map()) - seems like we could follow that here too ? as_struct() looks like a candidate :)

Side note: would be great if we fixed the parser and udf code to allow keywords as function names too, so we could rename these to just array(), map() and struct() but i guess that's for another day...

The difference with existing fns which create non-primitives is, of course, that we have to name the components of a struct. I think rather than (identifier expression (, identifier expression)*)? we could follow something like (expression (as identifier)?, (expression (as identifier)?)*)? instead - again for consistency with how we assign names to new attributes in projections.
In conclusion, I'm leaning towards something like this in usage: select as_struct(foo as bar, abc * 3) as my_new_struct_col from ...;. Note that this also leaves open the option to not explicitly name the members of the new struct and have them auto-provided (KSQL_COL_0... etc) as necessary (as it would be for the result of abc * 3 in the example).

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @agavra!

Like @blueedgenick I really don't like the use fo {} to represent a struct constructor. We should go with a more standard SQL approach of STRUCT(col0/col2 AS f0, f1), which would result in a schema like STRUCT<F0 INT, F1 STRING>. We can support this constructor syntax.

Changing the syntax won't mean much code changes.

Aside from that the code LGTM, except that there's some confusion in the naming of new types and methods. Half the time you've named things 'struct', which makes sense, but the other half its 'schema', e.g. addSchema'... I don't get it - is this intentional, or a repeated mistype? I've also left a few suggestions for improve names e.g. StructExpression->StructConstructorExpression`, which would help, IMHO, to make the code more readable.

Finally, if we're releasing this, we should look to do #3593 at the same time. :D

Add me back in as a reviewer if you want me to review again.


private int argumentCount = 0;
private int schemaCount = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private int schemaCount = 0;
private int structConstructorCount = 0;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distinction between struct and schema is intentional in the code. The expression evaluator takes in as an argument the schemas for the structs, not the struct expressions themselves. If you look at the code, it is used as new Struct(schema0).... In this code, schema0 is very much a schema.

It's just like how we have the method addFunction and not addFunctionName. The functionName is the key, but the function is what is actually being stored.

Resolving the other conversations that are about this to concentrate on this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I still disagree.

It's just like how we have the method addFunction and not addFunctionName. The functionName is the key, but the function is what is actually being stored.

Yep, agreed. So what is being stored in this case is not just any schema, but a struct constructor's schema, so the name should reflect that, e.g. structSchemaCount. Or maybe you're planning on re-using this counter when we add MAP, DECIMAL or ARRAY constructors that also need a schema? In which case it might be called typeConstructorCount.

I just think losing the connection between the 'struct' and/or 'constructor' part in these method and variable names makes the code hard to grasp.

Maybe getSchemaName should be getStructSchemaName or getConstructorSchemaName if you're planning on using the same for other type constructors.

I know we're only disagreeing over naming here, but naming is important as I know you agree.

Anyway, up to you, but I'm just letting you know I find the current naming very confusing and hence others might too, and I'd find it less confusing if method and var names kept some link back to their 'struct' or 'constructor' heritage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll name it getStructSchemaName and make those corresponding changes

@big-andy-coates big-andy-coates self-assigned this Dec 12, 2019
@big-andy-coates big-andy-coates requested review from a team, rmoff, hjafarpour, blueedgenick, rodesai, vcrfxia, vpapavas, stevenpyzhang and apurvam and removed request for a team December 12, 2019 09:51
@agavra
Copy link
Contributor Author

agavra commented Dec 12, 2019

Thanks for the reviews @big-andy-coates and @blueedgenick. Glad I asked for syntax feedback :) I've changed it to be STRUCT((expression AS identifier (, expression AS identifier)*)?). I agree this feels more SQLy.

@blueedgenick - I didn't actually implement this with the UDF framework, so I could change the name to be exactly STRUCT() instead of AS_STRUCT. I might change array creation and map creation to be implemented inline. I don't think the UDF framework is necessarily the right place to be doing those type of type construction because schema inference is limited when you have arbitrary code generation and java generics.

With regards to implicit naming (omitting the AS, that's backwards compatible to add later so I will do that in a future PR. I didn't want to think about edge cases (i.e. what happens if I create nested structs with the same names?) in this PR - though I don't think there are any problems.

@big-andy-coates - about decimal literals, I will do that in a different PR. They're not exactly interdependent 😂

@blueedgenick
Copy link
Contributor

Thanks @agavra , looking great! No worries about deferring the implicit naming until later - i'd suggest simply concatenating "levels" of nested-name as the nesting level increases e.g. a generated name for a second-level field might be FIRST_LEVEL_NAME + COL_0, which may even turn out to be KSQL_COL_0_COL_0 if you don't name anything. Bonus points if the fields inside the struct get names with a FIELD_0 suffix instead of COL_0.

And yes, it would be awesome to move map and array creation to work this same way instead of the ...more expedient... way we had done it previously with UDFs 😁

One thing i realize i forgot to address yesterday is the literal syntax you've proposed for use in the INSERT...VALUES statement. It seems like re-stating the name of the field which exists inside the struct should be unnecessary here? e.g. in this example from the original PR description:

ksql> CREATE STREAM decimals (col0 STRUCT<val DECIMAL(2,1)>) WITH (kafka_topic='decimals', partitions=1, value_format='AVRO');
...
ksql> INSERT INTO decimals (col0) VALUES ({VAL '2.1'});

  • here the VALUES ({VAL '2.1'}) also seems not to fit our desired pattern. Perhaps you already covered this in your changes but I wanted to explicitly call it out to avoid unnecessary revisiting. This could simply be VALUES (STRUCT( '2.1')), right ? The fields of the struct should be in fixed positions so the naming is not needed. Or am I missing something ?

Thanks for getting this one done!

@agavra
Copy link
Contributor Author

agavra commented Dec 12, 2019

It seems like re-stating the name of the field which exists inside the struct should be unnecessary here

Somewhat true - I think while the schema has ordered fields, the struct does not require ordering. You could decide to map STRUCT<foo VARCHAR, bar VARCHAR> to STRUCT('val1' as BAR, 'val2' as FOO) (rearranging the fields). So technically a struct value should be paired with a name. You could also entirely omit fields (thus setting them to null) by constructing STRUCT('val1' as BAR).

More importantly, we could add this feature later but it's a non-trivial amount of work. The way INSERT VALUES works is to use the same code path as the rest of the expression evaluators; which don't necessarily have schema information.

@blueedgenick
Copy link
Contributor

I acknowledge that the current code path for INSERT...VALUES is not without it's challenges, perhaps we can at least converge on what we would like to have happen so that anything 'dangling' from this PR can be collected/pasted into a follow-up issue ?
I disagree with using the STRUCT('val1' as BAR, 'val2' as FOO) pattern on the grounds that you can't "re-order" the other columns to which you are inserting values in this way, so why special-case just the struct internals ? Also, intuitively, the x AS y pattern is for assigning new names to the result of something, not for assigning specific values to specific extant parameters ? Further if you want to set a sub-component of a struct to null while inserting, you should say so explicitly or, at the very least, insert enough commas to skip the sub-fields in question (i strongly prefer explicit nulls though!).

@agavra
Copy link
Contributor Author

agavra commented Dec 12, 2019

I disagree with using the STRUCT('val1' as BAR, 'val2' as FOO) pattern on the grounds that you can't "re-order" the other columns to which you are inserting values in this way, so why special-case just the struct internals ?

I'm not sure I understand. You can reorder the columns you are inserting values for (INSERT INTO foo (col1, col2) ... vs INSERT INTO foo (col2, col1)). Also, logically a struct is an unordered mapping of key-value pairs - the fact that the schema has an ordering is an implementation detail and not exposed anywhere else (you cant, for example, do my_struct[0]).

Also, intuitively, the x AS y pattern is for assigning new names to the result of something, not for assigning specific values to specific extant parameters ?

This makes sense, maybe that isn't the right pattern for creating structs then 😉

Further if you want to set a sub-component of a struct to null while inserting, you should say so explicitly or, at the very least, insert enough commas to skip the sub-fields in question (i strongly prefer explicit nulls though!).

I disagree with this as well. Explicit nulls are nice, but that's not how INSERT VALUES works today with columns either. If I have CREATE foo (col1 VARCHAR, col2 VARCHAR) and I INSERT INTO foo (col1) VALUES( 'foo') then col2 will be null.

@blueedgenick
Copy link
Contributor

I disagree with using the STRUCT('val1' as BAR, 'val2' as FOO) pattern on the grounds that you can't "re-order" the other columns to which you are inserting values in this way, so why special-case just the struct internals ?

I'm not sure I understand. You can reorder the columns you are inserting values for (INSERT INTO foo (col1, col2) ... vs INSERT INTO foo (col2, col1)). Also, logically a struct is an unordered mapping of key-value pairs - the fact that the schema has an ordering is an implementation detail and not exposed anywhere else (you cant, for example, do my_struct[0]).

  • not well explained on my part, apologies. I meant to say you don't write something like insert into foo(col2, col1, col3) values ('x' as col1, 'y' as col3, 1 as col2) - see how unnatural that looks to use the '..as..' inside of the values section ?

  • re explicit nulls: i know this is how it works today (unfortunately) but i suggest that's only because we don't yet have explicit nullability controls on the schemas i.e. which columns can/cannot be null. I suppose it's something of a gray area how to handle the "sub-fields" of a struct in this case because there's not much precedence for having them in column-based schemas at all. If, for example, you could declare in your schema that struct1 was non-null - which i suppose you are considering to be the case for insert...values anyway in the event that you list struct1 as one of the columns you are inserting to - then what implication does that/should that have for teh sub-components of struct1 ? Do we imagine that in the future they will be individually nullable ? If so, i could be convinced that you may want to explicitly name the ones to which you are assigining values and omit the others in an insert statement, otherwise they have to be declared as null. Either way, i'd argue that the ability to omit them and have them default to null is an optimization of sorts (you're thinking of it like an overloaded object constructor, i suspect ;) ), which we could layer in later as sugar. WDYT ?

@agavra
Copy link
Contributor Author

agavra commented Dec 12, 2019

I don't think your first bullet is a fair comparison 😂If we wanted to take your example, it would be that instead of INSERT INTO foo (col1, col2) VALUES ('foo1', 'foo2'); you could write INSERT INTO foo VALUES ('foo2' AS col2, 'foo1' AS col1);. What makes the example you gave unnatural is not (in my opinion) the .. as .. syntax but rather needing to specify the columns twice. The meta-point here is that when you're creating a row in a table, you do need to specify the column names unless you want to specify all columns in order (which I'm proposing is an extension of this work).

With regards to nulls, I think the example I'm thinking of is not really an overloaded constructor but rather the new Struct(schema).put(fieldX, valueX).put(fieldY, valueY)... pattern. The schema defines which fields exist, but whether or not you decide to put a value in a struct is orthogonal to the schema.

As a side note, I think we've come to an agreement about non insert-values syntax. I also think it's nice to have the same syntax for insert-values and CSAS statements. That way you can think of an INSERT VALUES statement as creating a struct independent of the column schema, and then fitting it into the column.

@big-andy-coates
Copy link
Contributor

big-andy-coates commented Dec 16, 2019

Hummm...

The concern that ... AS ... should be for aliasing, not passing parameters is valid. I think there is some precedent in other SQL systems to use syntax similar to STRUCT(foo => 'x', bar => col0) or STRUCT(foo:= 'x', bar:= col0). For example, see postgres 'Using Named Notation' here: https://www.postgresql.org/docs/9.1/sql-syntax-calling-funcs.html.

INSERT VALUES matches most, if not all, SQL systems I've used in the past in the way it doesn't require values to be provided for any nullable or defaulted column. IMHO, I think it's fine / preferable to have STRUCT() follow the same pattern. I'm also with @agavra in that the order should not be important when named notation is used. However, we could allow non-named notation as well, and in which case the field order can be the same as the STRUCT's definition, (which has order in its field names). Here we can either require explicit nulls or use the double comma pattern @blueedgenick mentioned. INSERT VALUES supported something similar to named and non-named notation, depending on if the set of columns is provided or not. If I had to choose I'd go with named first.

@blueedgenick - I didn't actually implement this with the UDF framework, so I could change the name to be exactly STRUCT() instead of AS_STRUCT. I might change array creation and map creation to be implemented inline.

Good idea and worth doing IMHO.

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @agavra

A few comments on the impl below. The main one being I think there's definitely a bug in there due to our use of Map to store the constructor parameters: the parameters should be an ordered list.

Add me back in when you're ready for another review :D

docs/developer-guide/syntax-reference.rst Outdated Show resolved Hide resolved

private int argumentCount = 0;
private int schemaCount = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I still disagree.

It's just like how we have the method addFunction and not addFunctionName. The functionName is the key, but the function is what is actually being stored.

Yep, agreed. So what is being stored in this case is not just any schema, but a struct constructor's schema, so the name should reflect that, e.g. structSchemaCount. Or maybe you're planning on re-using this counter when we add MAP, DECIMAL or ARRAY constructors that also need a schema? In which case it might be called typeConstructorCount.

I just think losing the connection between the 'struct' and/or 'constructor' part in these method and variable names makes the code hard to grasp.

Maybe getSchemaName should be getStructSchemaName or getConstructorSchemaName if you're planning on using the same for other type constructors.

I know we're only disagreeing over naming here, but naming is important as I know you agree.

Anyway, up to you, but I'm just letting you know I find the current naming very confusing and hence others might too, and I'd find it less confusing if method and var names kept some link back to their 'struct' or 'constructor' heritage.

Comment on lines 327 to 336
public Void visitStructExpression(CreateStructExpression exp, ExpressionTypeContext context) {
final Builder builder = SqlStruct.builder();

for (Entry<String, Expression> field : exp.getStruct().entrySet()) {
process(field.getValue(), context);
builder.field(field.getKey(), context.getSqlType());
}

context.setSqlType(builder.build());
return null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be dangerous...

This is building a struct schema based on the parameters provided in the constructor... but:

  1. the list of parameters may not be complete: it may have some fields that will be set to null.
  2. the parameters in the constructor aren't ordered, yet field order matters to SqlStruct.

Hence I don't think you're building the right schema here.

On the first point, I'm guessing it may be OK as when its used in CSAS/CTAS statements the struct constructor defines the schema of the struct in the result. However, the same is not true for INSERT INTO. I guess we can add some coercion of one struct to another. However, I don't think there are currently any tests for such cases for INSERT INTO. INSERT VALUES does have tests, and I guess here there is coercion going on.

On the second point, it's worth noting that in both Connect's and KSQL's struct schema type the order of fields is taken into account, so STRUCT<foo INT, bar INT> is NOT equal to STRUCT<bar INT, foo INT>. It's potentially valid to argue the order should not matter. Though I think this falls down when we come to converting such a schema into other protocols, e.g. Avro, PB, etc, where order also matters. Hence, I think we need to maintain ordering as being important.

Hence I think its important that the mappings of fieldname & values in the constructor is ordered, i.e. probably a list of tuples. I also think we need to ensure we have test cases, including QTT, to ensure something fails if the ordering is not maintained.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first point I think it's okay as you suggested. The coercion happens in the INSERT INTO. For the second point, I agree it is better to maintain ordering even if it isn't exposed in KSQL - I have changed all map usages to List<Field>

@agavra agavra force-pushed the inline_struct branch 2 times, most recently from aa9715c to 577f56a Compare December 16, 2019 18:28
@big-andy-coates
Copy link
Contributor

I think there's still the outstanding issue of syntax. As per #4120 (comment), I think its worth considering not using AS, which is for aliasing. Rather than STRUCT('x' AS foo, col0 AS bar) I propose: STRUCT(foo:= 'x', bar:= col0)

@derekjn
Copy link
Contributor

derekjn commented Dec 18, 2019

Rather than STRUCT('x' AS foo, col0 AS bar) I propose: STRUCT(foo:= 'x', bar:= col0)

Big +1 from me on this. As @big-andy-coates alluded to, with STRUCT('x' AS foo, col0 AS bar) we'd be repurposing the meaning of AS, and it's not immediately clear which side of the AS is what (Struct key or value). PostgreSQL uses := for named parameters, and I believe Oracle uses =>. The := operator is more intuitive to me personally, as it represents assignment in pretty much all contexts whereas => can mean different things.

@agavra
Copy link
Contributor Author

agavra commented Dec 19, 2019

@derekjn @big-andy-coates I've made the change to use :=!

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great feature @agavra .... just decimal, map and array to do ;) . Might be worth adding tasks for map and array...

"comments": [
"Tests covering the use of the array returning UDFs."
],
"tests": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some negative test cases? e.g. what happens if there are two fields with the same name, or some name but different case? etc. etc.

Also, a test for case-sensitive field names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the tests are awkwardly split between files - i'll add more here

@agavra agavra merged commit 6e558da into confluentinc:master Dec 19, 2019
@agavra agavra deleted the inline_struct branch December 19, 2019 20:05
@agavra agavra mentioned this pull request Jan 7, 2020
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support constructing new STRUCTs (i.e., writing nested data)
5 participants