feat: Implement EXPLODE(ARRAY) for single table function in SELECT #3589

purplefox · 2019-10-16T08:55:19Z

Description

Tracked by: #3505

This is the first task in the implementation of table functions for KSQL.

It uses the approach outlined in KLIP-9 (waiting for approval but we seem to have consensus)

This PR implements the table function EXPLODE(ARRAY)
Only one table function per SELECT is currently supported
No complex expressions as table function arguments
Documentation in later tasks

Testing done

Unit tests added/modified as appropriate
New QTT tests

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

big-andy-coates

Thanks @purplefox

Lot to review here. Lot's of nits, but also some architectural / structural improvements requested to. Though, given you're new to the code base this is a blooming awesome first stab.

Feel free to suggest picking up any of my comments in follow up PRs where you think it makes sense.

Let me know when this is ready for another review.

ksql-common/src/main/java/io/confluent/ksql/function/FunctionRegistry.java

ksql-common/src/main/java/io/confluent/ksql/function/KsqlTableFunction.java

ksql-functional-tests/src/test/resources/query-validation-tests/simple_agg.json

ksql-streams/src/main/java/io/confluent/ksql/execution/streams/ExecutionStepFactory.java

ksql-udf/src/main/java/io/confluent/ksql/function/udtf/Udtf.java

ksql-udf/src/main/java/io/confluent/ksql/function/udtf/UdtfDescription.java

ksql-udf/src/main/java/io/confluent/ksql/function/udtf/UdtfFactory.java

agavra

First pass with mostly surface level comments - there's a lot of unused code that andy pointed out. It'd be much easier to review without that! Overall looks pretty solid (cool to see the functioning QTT tests for this!)

ksql-common/src/main/java/io/confluent/ksql/function/FunctionRegistry.java

ksql-engine/src/main/java/io/confluent/ksql/function/udtf/array/ExplodeFunctionFactory.java

ksql-common/src/main/java/io/confluent/ksql/function/KsqlTableFunction.java

ksql-common/src/main/java/io/confluent/ksql/function/TableFunctionFactory.java

ksql-engine/src/main/java/io/confluent/ksql/function/InternalFunctionRegistry.java

ksql-engine/src/main/java/io/confluent/ksql/planner/LogicalPlanner.java

ksql-engine/src/main/java/io/confluent/ksql/util/TableFunctionExpressionRewriter.java

ksql-engine/src/test/java/io/confluent/ksql/function/InternalFunctionRegistryTest.java

apurvam · 2019-10-19T00:30:13Z

This PR has 108 comments over several days. Would it make sense to have a meeting to quickly resolve the big talking points? I see at least one about the apparently leaky abstraction TableFunctionExpressionRewriter.

Essentially, it would be good to see if there is a way to make the review more efficient.

purplefox · 2019-10-19T17:20:26Z

This PR has 108 comments over several days. Would it make sense to have a meeting to quickly resolve the big talking points? I see at least one about the apparently leaky abstraction TableFunctionExpressionRewriter.

Essentially, it would be good to see if there is a way to make the review more efficient.

I think there are four main types of comments in this review.

Code style criticisms. Imho these shouldn't happen, they should be covered by checkstyle rules or they don't exist.
There are bunch of comments along the lines of 'I know you're just following the way it was done in <the other classes|methods|tests> that already exist, but don't you think you should refactor them all while you're there?'. On the whole, I think these comments point to valid refactorings that could/should be done, but are usually off topic for this review, and just add clutter. In these cases I'd suggest if you spot an opportunity for further refactoring while doing a review just add a new issue to track it. There's probably not even a need to mention it in the review. Or if you do mention it, create the issue first then add a comment with a link to it, then job done.
Review of WIP code. Parts of the code are very much WIP (e.g. the flat mapper). The TODOs in there were designed to make that obvious. When reviewing those parts, that should be taken in mind.
Discussions on the architectural/design approach taken. There are some awesome points that have been made!

Imho 1, 2 and 3 detract from the focus of the review and probably make up a large majority of the traffic on this review. Once we subtract them we're left with a small number of totally valid, and interesting points that we can discuss and that would make the review way more targeted and manageable. Ideally we'd only have 4 type comments in reviews but we don't live in an ideal world yet, but I think that's what we should aim towards ;)

purplefox · 2019-10-20T08:15:25Z

This PR has 108 comments over several days. Would it make sense to have a meeting to quickly resolve the big talking points? I see at least one about the apparently leaky abstraction TableFunctionExpressionRewriter.

Essentially, it would be good to see if there is a way to make the review more efficient.

Sorry @apurvam I accidentally edited your comment (clicked on the wrong comment while editing), but I think I put it back to how it was before! Actually I'm kind of surprised that GitHub lets you edit other users comments.

big-andy-coates · 2019-10-21T15:27:59Z

Code style criticisms. Imho these shouldn't happen, they should be covered by checkstyle rules or they don't exist.

I think there was general consensus to better tooling to capture more things. However, until that is in place there'll still need to be such things picked up in reviews.

There are also likely some things we can't easily pick up with existing tooling, e.g. checking parameters to constructors are valid. I'm a strong believer in the matra don't allow your objects to be created in an invalid state, hence tending to submit nits on param null checks and other invalid parameter issue.

I'm not aware of any existing tooling that will pick up on missing null checks, though we could look to write our own checkstyle rule or similar - and I agree there is worth in doing so as we grow. How ever there are other instances of parameter validation that a code analysis tool can't so readily pick up on, e.g. a string parameter that shouldn't have trailing or leading whitespace or similar. Such things will ultimately rely on the originating engineer defensively coding, or the reviews pointing this out.

There are bunch of comments along the lines of 'I know you're just following the way it was done in <the other classes|methods|tests> that already exist, but don't you think you should refactor them all while you're there?'. On the whole, I think these comments point to valid refactorings that could/should be done, but are usually off topic for this review, and just add clutter. In these cases I'd suggest if you spot an opportunity for further refactoring while doing a review just add a new issue to track it. There's probably not even a need to mention it in the review. Or if you do mention it, create the issue first then add a comment with a link to it, then job done.

Just because an existing piece of code does something does not implicitly make it something we want to emulate or a pattern we want to see replicated. We know we have areas in the code that need recactoring. May of these comments are simply requesting that we write new code that matches how we want the code to be and not replicate bad patterns that we know we want to fix.

For minor things it sometimes makes sense to fix up the existing code as we go. Otherwise such things often never happen. That said, it's generally a value call. Too much 'tidy up' in a PR that's main job is to introduce new functionally, makes the PR harder to review.

For example, I added a note that a new class in this PR was in a util package, but wasn't a utility class in any way, shape or form. It was a class used during query analysis. I therefore suggested it would be better placed in the analysis package. I was aware why you'd put it there - there was a similar file in the same package that you'd copied. I didn't originally ask you to move the old file, which is also in the wrong package. I asked you to move the new file, because otherwise we're replicating a bad pattern.

Review of WIP code. Parts of the code are very much WIP (e.g. the flat mapper). The TODOs in there were designed to make that obvious. When reviewing those parts, that should be taken in mind.

IMHO we should never check todo's into the code base. Either do them, raise Github issues or some how track them on your own 'todo' list.

In general, I think we should look to get beyond the WIP before we look to commit. This means those doing the reviewing can see how all the parts of the code/design hang together and make meaningful commons. If we commit WIP code, there is always the risk that it's never completed.

Discussions on the architectural/design approach taken. There are some awesome points that have been made!

No arguments here ;)

In summary, my thoughts are:

Code style / nits: Improving our tooling should be an ongoing thing. Until its improves we'll need to continue to pick up these things in reviews. Even with perfect tooling there will always be causes where tooling can't detecting minor things, e.g. around parameter validation, and these will continue to, hopefully, be caught in reviews.
Point out replication of bad patterns should definitely be in scope for reviews. Fixing existing instances of bad patterns is generally better done in a separate PR, or if its a chunk of work, tracked by a Github issue. Only if it's trivial should the existing refactor be done in the current PR.
We should avoid commiting TODOs into the code base: they have a tendency to remain in the code for months/years. We should aim to raise PRs with complete pieces of functionality. Where pieces of the puzzle are to be extended in follow up PRs, this is best called out in a reviewing notes section of the PR description.
Arch/design discussions are a given.

apurvam · 2019-10-22T06:14:32Z

Thanks for taking the time to write up your thoughts on the PR review process it self, @big-andy-coates and @purplefox . I think this merits a face to face discussion.

Some concrete thoughts on each of the points @purplefox raised:

Code style criticisms. Imho these shouldn't happen, they should be covered by checkstyle rules or they don't exist.

... On the whole, I think these comments point to valid refactorings that could/should be done, but are usually off topic for this review, and just add clutter. ...

IMO, both of these I think should be approached with moderation. Unless there is something egregious (major deviation from the documented coding style, obvious bugs, introduction of significant tech debt etc.), a rule of thumb I have used in the past is is to have a 1:1 ratio (or something close to it) of these types of comments relative to the comments raising questions on the architectural/design approach taken (category 4 in Tim's list).

I think in general it should be acceptable to file followup issues for refactorings rather than bundling them into a PR adding functionality.

I would also say that most of the contributors on this code base have the attitude of leaving things better than they found it. If most PRs (including planned followups) follow this ethos, then I think we should favor review efficiency at the cost of perfecting each PR.

Numerous iterations on single PRs take a lot of energy and I think we have an opportunity to improve our efficiency with this process.

Review of WIP code. Parts of the code are very much WIP (e.g. the flat mapper). The TODOs in there were designed to make that obvious. When reviewing those parts, that should be taken in mind.

I am not quite sure what this means. Is it review of code that isn't meant to be checked in? Or is it review of code which has planned follow ups which is taking up time?

Discussions on the architectural/design approach taken. There are some awesome points that have been made!

Cool! It would be good to get a sense of what percentage of the discussion in this case was of this class. That is the true metric of our efficiency here IMO.

apurvam · 2019-10-22T06:24:19Z

Meta comments of the PR review aside, I think it would be good to get a sense of how far this is from merging. It seems to me that the biggest point is whether to model UDTFs on UDFs or on UDAFs (current implementation). I think this is probably worth resolving in a face to face conversation.

purplefox · 2019-10-22T13:23:28Z

Code style criticisms. Imho these shouldn't happen, they should be covered by checkstyle rules or they don't exist.

I think there was general consensus to better tooling to capture more things. However, until that is in place there'll still need to be such things picked up in reviews.

I think that's true but we should be transparent here. Whether not they are checked automatically I think we should state, clearly and unambiguously, (in a wiki page for example), exactly what the style rules are.

These should be easy to catalogue - I already know 5 or 6 that I have inferred from reviews already (I will make a stab on this today). Being transparent means the contributors can make an effort to comply the rules before they submit PR so they don't receive any surprises during review which is a waste of everyone's time. This is especially important for external contributors where we risk alienating them and not getting a second PR if we're too critical (style wise) of their work after they've spent time on it.

So, sure, some will slip through the net, and will be picked up in reviews, but making expectations clear up front should minimise these.

Making clear what the style rules are also clarifies what are rules and what are just opinions. There's an important difference here - every contributor should be expected to conform to the rules whether they like them or not. Rules are enforceable in review, opinions are not, and the opinion of everyone on the team should be equally valued.

purplefox · 2019-10-22T13:48:05Z

hence tending to submit nits on param null checks and other invalid parameter issue.

Other than being transparent about what are style rules actually are, I think there is a whole other debate that should be had, at some point (no need to do it now) about whether these rules are actually adding value.

Picking up briefly on two classic ones (and I am happy to go into much more detail about these and my real world experiences around them): Null parameter checking and final parameters.

Null parameter checking - this seems to have come over from C++ programming. In C/C++ if you dereference an invalid pointer the results are undefined - you might get a segfault, but your program might continue, now silently referencing a garbage object, potentially causing all sorts of nasty knock-on effects. But in Java that can't happen - the only kind of invalid pointer that's possible is a null pointer, and an attempt to dereference it is guaranteed to throw a NullPointerException. When this happens (hopefully when running your test suite!) you look at the stack trace and fix the problem. Adding an Objects.requireNotNull() also throws a NullPointerException. So you're just trading one NPE for another, either of which will allow you to solve the issue. So you're just adding extra unsightly code for no actual gain. Without the check you will get the NPE whenever there's an attempt to dereference, so it's going to happen in your test suite run if this happens by any execution path. So assuming you have written tests it's going to fire. If you haven't written tests you have a bigger issue to solve first! The fact that it will fire in your test suite makes the argument "I want to throw early" kind of meaningless. There is one exception to this where i would recommend param checking - and that's in public APIs - here you want to throw exceptions on invalid params immediately as you don't want to expose a stack trace from internal implementation classes through your API.
Final parameters. Well, I drunk the koolaid myself on these years ago with one of my previous projects (HornetQ aka ActiveMQ Artemis). I had heard that making params final would help catch bugs and just assumed this was a good thing. So in that project we used final params everywhere. In later projects (e.g. Vert.x) we didn't do this, and were we deluged with lots of bugs due to non final args? No. I couldn't see the slightest difference between the two projects in terms of these kinds of bugs. Most probably because these kinds of bugs are rare in the first place. My conclusion - final params are a waste of time, and I'm not alone here. Most projects don't use final params, indeed the standard google style guidelines (which is becoming the defacto style standard in Java) doesn't mandate them, and for good reason IMHO.

My 2c - always question "best practice" and don't take it as read. I've been developing in Java since 1996 (and before that I was doing C++, amongst other things), and I've seen a lot of fads come and go. Be wary of "rules" that are applied without judgement.

[Yes this is totally off topic, and I'm sorry ;) ]

purplefox · 2019-10-22T14:02:33Z

I am not quite sure what this means. Is it review of code that isn't meant to be checked in? Or is it review of code which has planned follow ups which is taking up time?

Apologies, I was unclear here! The overall epic task that tracks table functions is made up of several sub tasks. And each one of those subtasks will probably have its own PR. By WIP I don't mean this PR is incomplete, but the overall epic is incomplete. So all this work will be WIP until the entire epic is complete.

So, when I have added TODOs to the code, they are referring to work that will be completed in later subtasks.

BTW, I quite agree with Andy's point that TODOs should not exist in completed code. But in this case they are not a substitute to using issues, as all the work is already tracked by issues. They are merely giving a little bit of a hint to the reader of the code that the overall work is incomplete. As a concrete example, the code here only supports a single table function, but KLIP-9 says that multiple table functions will be supported. A reader who has read the KLIP and is now reading the code might be confused that the code only supports a single table function. A little TODO saying "TODO support multiple table functions in later PRs" goes a long way to helping the reader understand without embarking on a 15 min goose chase to figure out what is going on.

Of course (and this goes for any style rules that we agree on). I am happy to follow any style rules the project has, whether or not I agree with them, as the team comes first. But this need to be made transparent.

…ewriter to analyzer package

…separate methods

agavra

I've had time to digest my previous comment (#3589 (comment)), and I think what was bothering me there has matured into some concrete ideas. See inline comments below.

ksql-common/src/main/java/io/confluent/ksql/name/ColumnName.java

ksql-engine/src/main/java/io/confluent/ksql/analyzer/TableFunctionExpressionRewriter.java

ksql-engine/src/main/java/io/confluent/ksql/planner/plan/FlatMapNode.java

…able functions are now gathered in Analyzer and select expression rewriting occurs in FlatMapNode

agavra

I've learned a ton from reviewing this PR and our discussions - thanks @purplefox for the quick iterations and the tons of changes you've made. I think this PR is nearly ready to go! A few of my comments inline are important (around the column naming/bad imports), most are follow-ups that can be done later.

diff.txt

ksql-engine/src/main/java/io/confluent/ksql/analyzer/QueryAnalyzer.java

ksql-engine/src/main/java/io/confluent/ksql/function/BaseAggregateFunction.java

ksql-engine/src/main/java/io/confluent/ksql/function/BaseTableFunction.java

ksql-engine/src/main/java/io/confluent/ksql/planner/plan/FlatMapNode.java

ksql-engine/src/main/java/io/confluent/ksql/planner/plan/PlanNode.java

ksql-engine/src/main/java/io/confluent/ksql/planner/plan/FlatMapNode.java

ksql-functional-tests/src/test/resources/query-validation-tests/table-functions.json

…lumns to avoid confusion with select column aliases

agavra

super stoked for this to go through. let's wait on @big-andy-coates' stamp as well and maybe it even sneaks into 5.4, how awesome would that be?

ksql-engine/src/main/java/io/confluent/ksql/function/InternalFunctionRegistry.java

ksql-engine/src/main/java/io/confluent/ksql/analyzer/Analysis.java

ksql-engine/src/main/java/io/confluent/ksql/function/InternalFunctionRegistry.java

ksql-engine/src/main/java/io/confluent/ksql/function/udtf/array/ExplodeTableFunction.java

ksql-engine/src/main/java/io/confluent/ksql/function/udtf/array/ExplodeFunctionFactory.java

big-andy-coates

for comment...

ksql-engine/src/main/java/io/confluent/ksql/planner/PlanSourceExtractorVisitor.java

ksql-engine/src/main/java/io/confluent/ksql/planner/plan/FlatMapNode.java

big-andy-coates · 2019-10-25T15:27:28Z

ksql-engine/src/test/java/io/confluent/ksql/function/InternalFunctionRegistryTest.java

@@ -419,4 +354,125 @@ public void shouldNotAllowModificationViaListFunctions() {
  private void givenUdfFactoryRegistered() {
    functionRegistry.ensureFunctionFactory(UdfLoaderUtil.createTestUdfFactory(func));
  }
+
+  private static AggregateFunctionFactory createAggregateFunctionFactory() {
+    return new AggregateFunctionFactory("my_aggregate") {


@agavra suggested using a mock - is there any reason that won't work, it would certainly be a whole lot less code!

ksql-engine/src/test/java/io/confluent/ksql/function/tf/ExplodeArrayTableFunctionTest.java

ksql-execution/src/main/java/io/confluent/ksql/execution/function/udtf/KudtfFlatMapper.java

ksql-engine/src/main/java/io/confluent/ksql/planner/plan/DataSourceNode.java

ksql-engine/src/main/java/io/confluent/ksql/structured/SchemaKStream.java

...-execution/src/main/java/io/confluent/ksql/execution/function/udtf/TableFunctionApplier.java

ksql-common/src/main/java/io/confluent/ksql/function/KsqlTableFunction.java

ksql-functional-tests/src/test/resources/query-validation-tests/explode.json

purplefox requested a review from a team as a code owner October 16, 2019 08:55

purplefox requested review from agavra and big-andy-coates October 16, 2019 09:06

purplefox changed the title ~~DRAFT feat: Implement EXPLODE(ARRAY) for single table function in SELECT~~ feat: Implement EXPLODE(ARRAY) for single table function in SELECT Oct 17, 2019

big-andy-coates reviewed Oct 18, 2019

View reviewed changes

agavra reviewed Oct 18, 2019

View reviewed changes

agavra requested a review from a team October 18, 2019 20:16

purplefox mentioned this pull request Oct 21, 2019

refactor: FunctionRegistry.getAggregateFunction() can return generic type #3632

Open

Tim Fox added 13 commits October 22, 2019 08:32

Apply patch from prototype

a503028

Apply patch from prototype2

0a726ca

Remove unnecessary files

42b1a6d

renamed some functions

4af2b53

Some improvements to explode factory

39d1880

Added test for ExplodeArrayTableFunction

3cb190d

Some refactoring

324862b

Another test

5edcff7

feat: Some improvements to tests

ccfffee

chore: resolved merge conflict

1d6a015

fix: remove check for optional return value

8184307

chore: updates after review

c3f5923

refactor: Simplify synchronization of InternalFunctionRegistry

f9b9560

Tim Fox added 9 commits October 22, 2019 12:48

chore: provide Description and version for ExplodeFunctionFactory

1310db6

fix: Traverse FlatMapNode in PlanSourceExtractorVisitor

8195bba

refactor: Move TableFunctionExpressionWriter and AggregateExpressionR…

fd7c9fb

…ewriter to analyzer package

chore: removed TODOs

214b4ef

chore: moved creation of Aggregate and table function factories into …

7c6b9bc

…separate methods

chore: Refactored test

f1f3d5c

chore: null check

4dada0d

chore: combined udtf.json test cases into one

dc03181

fix: Add check and test for more than one table function

28b3208

agavra reviewed Oct 22, 2019

View reviewed changes

Tim Fox added 4 commits October 22, 2019 17:56

fix: remove unused import

a668aa5

fix: Use passed through select expressions plus fix some tests

814a8b9

chore: merged with master

7c9b6b9

refactor: Removed TableFunctionAnalyzer and TableFunctionAnalysis - t…

e0cfc64

…able functions are now gathered in Analyzer and select expression rewriting occurs in FlatMapNode

agavra reviewed Oct 23, 2019

View reviewed changes

Tim Fox added 2 commits October 23, 2019 13:49

refactor: Use different naming for synthesised intermediate schema co…

6331a35

…lumns to avoid confusion with select column aliases

fix: changes after review comments

9bcb561

purplefox requested review from big-andy-coates and agavra October 23, 2019 21:55

agavra approved these changes Oct 23, 2019

View reviewed changes

agavra requested a review from a team October 23, 2019 21:59

big-andy-coates reviewed Oct 25, 2019

View reviewed changes

Tim Fox added 5 commits October 25, 2019 11:30

fix: Added check for non aggregates with table functions and test

6d4547f

chore: Added null check

11ccfa8

fix: Add check if udtf with same name already exists

e38222b

chore: revie updates

a208b63

chore: not null check

99cbff6

purplefox merged commit 8b52aa8 into confluentinc:master Oct 25, 2019

purplefox deleted the udtf_1 branch October 28, 2019 00:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement EXPLODE(ARRAY) for single table function in SELECT #3589

feat: Implement EXPLODE(ARRAY) for single table function in SELECT #3589

purplefox commented Oct 16, 2019

big-andy-coates left a comment

agavra left a comment

apurvam commented Oct 19, 2019 •

edited by purplefox

Loading

purplefox commented Oct 19, 2019 •

edited

Loading

purplefox commented Oct 20, 2019

big-andy-coates commented Oct 21, 2019

apurvam commented Oct 22, 2019 •

edited

Loading

apurvam commented Oct 22, 2019 •

edited

Loading

purplefox commented Oct 22, 2019 •

edited

Loading

purplefox commented Oct 22, 2019 •

edited

Loading

purplefox commented Oct 22, 2019

agavra left a comment

agavra left a comment

agavra left a comment

big-andy-coates left a comment

big-andy-coates Oct 25, 2019

feat: Implement EXPLODE(ARRAY) for single table function in SELECT #3589

feat: Implement EXPLODE(ARRAY) for single table function in SELECT #3589

Conversation

purplefox commented Oct 16, 2019

Description

Testing done

Reviewer checklist

big-andy-coates left a comment

Choose a reason for hiding this comment

agavra left a comment

Choose a reason for hiding this comment

apurvam commented Oct 19, 2019 • edited by purplefox Loading

purplefox commented Oct 19, 2019 • edited Loading

purplefox commented Oct 20, 2019

big-andy-coates commented Oct 21, 2019

apurvam commented Oct 22, 2019 • edited Loading

apurvam commented Oct 22, 2019 • edited Loading

purplefox commented Oct 22, 2019 • edited Loading

purplefox commented Oct 22, 2019 • edited Loading

purplefox commented Oct 22, 2019

agavra left a comment

Choose a reason for hiding this comment

agavra left a comment

Choose a reason for hiding this comment

agavra left a comment

Choose a reason for hiding this comment

big-andy-coates left a comment

Choose a reason for hiding this comment

big-andy-coates Oct 25, 2019

Choose a reason for hiding this comment

apurvam commented Oct 19, 2019 •

edited by purplefox

Loading

purplefox commented Oct 19, 2019 •

edited

Loading

apurvam commented Oct 22, 2019 •

edited

Loading

apurvam commented Oct 22, 2019 •

edited

Loading

purplefox commented Oct 22, 2019 •

edited

Loading

purplefox commented Oct 22, 2019 •

edited

Loading