Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement EXPLODE(ARRAY) for single table function in SELECT #3589

Merged
merged 44 commits into from
Oct 25, 2019

Conversation

purplefox
Copy link
Contributor

Description

Tracked by: #3505

This is the first task in the implementation of table functions for KSQL.

It uses the approach outlined in KLIP-9 (waiting for approval but we seem to have consensus)

  • This PR implements the table function EXPLODE(ARRAY)
  • Only one table function per SELECT is currently supported
  • No complex expressions as table function arguments
  • Documentation in later tasks

Testing done

  • Unit tests added/modified as appropriate
  • New QTT tests

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@purplefox purplefox requested a review from a team as a code owner October 16, 2019 08:55
@purplefox purplefox changed the title DRAFT feat: Implement EXPLODE(ARRAY) for single table function in SELECT feat: Implement EXPLODE(ARRAY) for single table function in SELECT Oct 17, 2019
Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @purplefox

Lot to review here. Lot's of nits, but also some architectural / structural improvements requested to. Though, given you're new to the code base this is a blooming awesome first stab.

Feel free to suggest picking up any of my comments in follow up PRs where you think it makes sense.

Let me know when this is ready for another review.

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass with mostly surface level comments - there's a lot of unused code that andy pointed out. It'd be much easier to review without that! Overall looks pretty solid (cool to see the functioning QTT tests for this!)

@agavra agavra requested a review from a team October 18, 2019 20:16
@apurvam
Copy link
Contributor

apurvam commented Oct 19, 2019

This PR has 108 comments over several days. Would it make sense to have a meeting to quickly resolve the big talking points? I see at least one about the apparently leaky abstraction TableFunctionExpressionRewriter.

Essentially, it would be good to see if there is a way to make the review more efficient.

@purplefox
Copy link
Contributor Author

purplefox commented Oct 19, 2019

This PR has 108 comments over several days. Would it make sense to have a meeting to quickly resolve the big talking points? I see at least one about the apparently leaky abstraction TableFunctionExpressionRewriter.

Essentially, it would be good to see if there is a way to make the review more efficient.

I think there are four main types of comments in this review.

  1. Code style criticisms. Imho these shouldn't happen, they should be covered by checkstyle rules or they don't exist.
  2. There are bunch of comments along the lines of 'I know you're just following the way it was done in <the other classes|methods|tests> that already exist, but don't you think you should refactor them all while you're there?'. On the whole, I think these comments point to valid refactorings that could/should be done, but are usually off topic for this review, and just add clutter. In these cases I'd suggest if you spot an opportunity for further refactoring while doing a review just add a new issue to track it. There's probably not even a need to mention it in the review. Or if you do mention it, create the issue first then add a comment with a link to it, then job done.
  3. Review of WIP code. Parts of the code are very much WIP (e.g. the flat mapper). The TODOs in there were designed to make that obvious. When reviewing those parts, that should be taken in mind.
  4. Discussions on the architectural/design approach taken. There are some awesome points that have been made!

Imho 1, 2 and 3 detract from the focus of the review and probably make up a large majority of the traffic on this review. Once we subtract them we're left with a small number of totally valid, and interesting points that we can discuss and that would make the review way more targeted and manageable. Ideally we'd only have 4 type comments in reviews but we don't live in an ideal world yet, but I think that's what we should aim towards ;)

@purplefox
Copy link
Contributor Author

This PR has 108 comments over several days. Would it make sense to have a meeting to quickly resolve the big talking points? I see at least one about the apparently leaky abstraction TableFunctionExpressionRewriter.

Essentially, it would be good to see if there is a way to make the review more efficient.

Sorry @apurvam I accidentally edited your comment (clicked on the wrong comment while editing), but I think I put it back to how it was before! Actually I'm kind of surprised that GitHub lets you edit other users comments.

@big-andy-coates
Copy link
Contributor

  1. Code style criticisms. Imho these shouldn't happen, they should be covered by checkstyle rules or they don't exist.

I think there was general consensus to better tooling to capture more things. However, until that is in place there'll still need to be such things picked up in reviews.

There are also likely some things we can't easily pick up with existing tooling, e.g. checking parameters to constructors are valid. I'm a strong believer in the matra don't allow your objects to be created in an invalid state, hence tending to submit nits on param null checks and other invalid parameter issue.

I'm not aware of any existing tooling that will pick up on missing null checks, though we could look to write our own checkstyle rule or similar - and I agree there is worth in doing so as we grow. How ever there are other instances of parameter validation that a code analysis tool can't so readily pick up on, e.g. a string parameter that shouldn't have trailing or leading whitespace or similar. Such things will ultimately rely on the originating engineer defensively coding, or the reviews pointing this out.

  1. There are bunch of comments along the lines of 'I know you're just following the way it was done in <the other classes|methods|tests> that already exist, but don't you think you should refactor them all while you're there?'. On the whole, I think these comments point to valid refactorings that could/should be done, but are usually off topic for this review, and just add clutter. In these cases I'd suggest if you spot an opportunity for further refactoring while doing a review just add a new issue to track it. There's probably not even a need to mention it in the review. Or if you do mention it, create the issue first then add a comment with a link to it, then job done.

Just because an existing piece of code does something does not implicitly make it something we want to emulate or a pattern we want to see replicated. We know we have areas in the code that need recactoring. May of these comments are simply requesting that we write new code that matches how we want the code to be and not replicate bad patterns that we know we want to fix.

For minor things it sometimes makes sense to fix up the existing code as we go. Otherwise such things often never happen. That said, it's generally a value call. Too much 'tidy up' in a PR that's main job is to introduce new functionally, makes the PR harder to review.

For example, I added a note that a new class in this PR was in a util package, but wasn't a utility class in any way, shape or form. It was a class used during query analysis. I therefore suggested it would be better placed in the analysis package. I was aware why you'd put it there - there was a similar file in the same package that you'd copied. I didn't originally ask you to move the old file, which is also in the wrong package. I asked you to move the new file, because otherwise we're replicating a bad pattern.

  1. Review of WIP code. Parts of the code are very much WIP (e.g. the flat mapper). The TODOs in there were designed to make that obvious. When reviewing those parts, that should be taken in mind.

IMHO we should never check todo's into the code base. Either do them, raise Github issues or some how track them on your own 'todo' list.

In general, I think we should look to get beyond the WIP before we look to commit. This means those doing the reviewing can see how all the parts of the code/design hang together and make meaningful commons. If we commit WIP code, there is always the risk that it's never completed.

  1. Discussions on the architectural/design approach taken. There are some awesome points that have been made!

No arguments here ;)

In summary, my thoughts are:

  1. Code style / nits: Improving our tooling should be an ongoing thing. Until its improves we'll need to continue to pick up these things in reviews. Even with perfect tooling there will always be causes where tooling can't detecting minor things, e.g. around parameter validation, and these will continue to, hopefully, be caught in reviews.
  2. Point out replication of bad patterns should definitely be in scope for reviews. Fixing existing instances of bad patterns is generally better done in a separate PR, or if its a chunk of work, tracked by a Github issue. Only if it's trivial should the existing refactor be done in the current PR.
  3. We should avoid commiting TODOs into the code base: they have a tendency to remain in the code for months/years. We should aim to raise PRs with complete pieces of functionality. Where pieces of the puzzle are to be extended in follow up PRs, this is best called out in a reviewing notes section of the PR description.
  4. Arch/design discussions are a given.

@apurvam
Copy link
Contributor

apurvam commented Oct 22, 2019

Thanks for taking the time to write up your thoughts on the PR review process it self, @big-andy-coates and @purplefox . I think this merits a face to face discussion.

Some concrete thoughts on each of the points @purplefox raised:

  1. Code style criticisms. Imho these shouldn't happen, they should be covered by checkstyle rules or they don't exist.
  2. ... On the whole, I think these comments point to valid refactorings that could/should be done, but are usually off topic for this review, and just add clutter. ...

IMO, both of these I think should be approached with moderation. Unless there is something egregious (major deviation from the documented coding style, obvious bugs, introduction of significant tech debt etc.), a rule of thumb I have used in the past is is to have a 1:1 ratio (or something close to it) of these types of comments relative to the comments raising questions on the architectural/design approach taken (category 4 in Tim's list).

I think in general it should be acceptable to file followup issues for refactorings rather than bundling them into a PR adding functionality.

I would also say that most of the contributors on this code base have the attitude of leaving things better than they found it. If most PRs (including planned followups) follow this ethos, then I think we should favor review efficiency at the cost of perfecting each PR.

Numerous iterations on single PRs take a lot of energy and I think we have an opportunity to improve our efficiency with this process.

  1. Review of WIP code. Parts of the code are very much WIP (e.g. the flat mapper). The TODOs in there were designed to make that obvious. When reviewing those parts, that should be taken in mind.

I am not quite sure what this means. Is it review of code that isn't meant to be checked in? Or is it review of code which has planned follow ups which is taking up time?

  1. Discussions on the architectural/design approach taken. There are some awesome points that have been made!

Cool! It would be good to get a sense of what percentage of the discussion in this case was of this class. That is the true metric of our efficiency here IMO.

@apurvam
Copy link
Contributor

apurvam commented Oct 22, 2019

Meta comments of the PR review aside, I think it would be good to get a sense of how far this is from merging. It seems to me that the biggest point is whether to model UDTFs on UDFs or on UDAFs (current implementation). I think this is probably worth resolving in a face to face conversation.

@purplefox
Copy link
Contributor Author

purplefox commented Oct 22, 2019

  1. Code style criticisms. Imho these shouldn't happen, they should be covered by checkstyle rules or they don't exist.

I think there was general consensus to better tooling to capture more things. However, until that is in place there'll still need to be such things picked up in reviews.

I think that's true but we should be transparent here. Whether not they are checked automatically I think we should state, clearly and unambiguously, (in a wiki page for example), exactly what the style rules are.

These should be easy to catalogue - I already know 5 or 6 that I have inferred from reviews already (I will make a stab on this today). Being transparent means the contributors can make an effort to comply the rules before they submit PR so they don't receive any surprises during review which is a waste of everyone's time. This is especially important for external contributors where we risk alienating them and not getting a second PR if we're too critical (style wise) of their work after they've spent time on it.

So, sure, some will slip through the net, and will be picked up in reviews, but making expectations clear up front should minimise these.

Making clear what the style rules are also clarifies what are rules and what are just opinions. There's an important difference here - every contributor should be expected to conform to the rules whether they like them or not. Rules are enforceable in review, opinions are not, and the opinion of everyone on the team should be equally valued.

@purplefox
Copy link
Contributor Author

purplefox commented Oct 22, 2019

hence tending to submit nits on param null checks and other invalid parameter issue.

Other than being transparent about what are style rules actually are, I think there is a whole other debate that should be had, at some point (no need to do it now) about whether these rules are actually adding value.

Picking up briefly on two classic ones (and I am happy to go into much more detail about these and my real world experiences around them): Null parameter checking and final parameters.

  1. Null parameter checking - this seems to have come over from C++ programming. In C/C++ if you dereference an invalid pointer the results are undefined - you might get a segfault, but your program might continue, now silently referencing a garbage object, potentially causing all sorts of nasty knock-on effects. But in Java that can't happen - the only kind of invalid pointer that's possible is a null pointer, and an attempt to dereference it is guaranteed to throw a NullPointerException. When this happens (hopefully when running your test suite!) you look at the stack trace and fix the problem. Adding an Objects.requireNotNull() also throws a NullPointerException. So you're just trading one NPE for another, either of which will allow you to solve the issue. So you're just adding extra unsightly code for no actual gain. Without the check you will get the NPE whenever there's an attempt to dereference, so it's going to happen in your test suite run if this happens by any execution path. So assuming you have written tests it's going to fire. If you haven't written tests you have a bigger issue to solve first! The fact that it will fire in your test suite makes the argument "I want to throw early" kind of meaningless. There is one exception to this where i would recommend param checking - and that's in public APIs - here you want to throw exceptions on invalid params immediately as you don't want to expose a stack trace from internal implementation classes through your API.
  2. Final parameters. Well, I drunk the koolaid myself on these years ago with one of my previous projects (HornetQ aka ActiveMQ Artemis). I had heard that making params final would help catch bugs and just assumed this was a good thing. So in that project we used final params everywhere. In later projects (e.g. Vert.x) we didn't do this, and were we deluged with lots of bugs due to non final args? No. I couldn't see the slightest difference between the two projects in terms of these kinds of bugs. Most probably because these kinds of bugs are rare in the first place. My conclusion - final params are a waste of time, and I'm not alone here. Most projects don't use final params, indeed the standard google style guidelines (which is becoming the defacto style standard in Java) doesn't mandate them, and for good reason IMHO.

My 2c - always question "best practice" and don't take it as read. I've been developing in Java since 1996 (and before that I was doing C++, amongst other things), and I've seen a lot of fads come and go. Be wary of "rules" that are applied without judgement.

[Yes this is totally off topic, and I'm sorry ;) ]

@purplefox
Copy link
Contributor Author

I am not quite sure what this means. Is it review of code that isn't meant to be checked in? Or is it review of code which has planned follow ups which is taking up time?

Apologies, I was unclear here! The overall epic task that tracks table functions is made up of several sub tasks. And each one of those subtasks will probably have its own PR. By WIP I don't mean this PR is incomplete, but the overall epic is incomplete. So all this work will be WIP until the entire epic is complete.

So, when I have added TODOs to the code, they are referring to work that will be completed in later subtasks.

BTW, I quite agree with Andy's point that TODOs should not exist in completed code. But in this case they are not a substitute to using issues, as all the work is already tracked by issues. They are merely giving a little bit of a hint to the reader of the code that the overall work is incomplete. As a concrete example, the code here only supports a single table function, but KLIP-9 says that multiple table functions will be supported. A reader who has read the KLIP and is now reading the code might be confused that the code only supports a single table function. A little TODO saying "TODO support multiple table functions in later PRs" goes a long way to helping the reader understand without embarking on a 15 min goose chase to figure out what is going on.

Of course (and this goes for any style rules that we agree on). I am happy to follow any style rules the project has, whether or not I agree with them, as the team comes first. But this need to be made transparent.

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had time to digest my previous comment (#3589 (comment)), and I think what was bothering me there has matured into some concrete ideas. See inline comments below.

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've learned a ton from reviewing this PR and our discussions - thanks @purplefox for the quick iterations and the tons of changes you've made. I think this PR is nearly ready to go! A few of my comments inline are important (around the column naming/bad imports), most are follow-ups that can be done later.

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: super stoked for this to go through. let's wait on @big-andy-coates' stamp as well and maybe it even sneaks into 5.4, how awesome would that be?

@agavra agavra requested a review from a team October 23, 2019 21:59
Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for comment...

@@ -419,4 +354,125 @@ public void shouldNotAllowModificationViaListFunctions() {
private void givenUdfFactoryRegistered() {
functionRegistry.ensureFunctionFactory(UdfLoaderUtil.createTestUdfFactory(func));
}

private static AggregateFunctionFactory createAggregateFunctionFactory() {
return new AggregateFunctionFactory("my_aggregate") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agavra suggested using a mock - is there any reason that won't work, it would certainly be a whole lot less code!

@purplefox purplefox merged commit 8b52aa8 into confluentinc:master Oct 25, 2019
@purplefox purplefox deleted the udtf_1 branch October 28, 2019 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants