Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: new UDFs for set-like operations on Arrays #5548

Merged
merged 3 commits into from
Jun 5, 2020

Conversation

blueedgenick
Copy link
Contributor

Four new UDFs, implementing set-like operations on Array fields,inspired by the similarly-named functions in Presto (and others):

  • array_distinct(array)
  • array_except(array1, array2)
  • array_intersect(array1, array2)
  • array_union(array1, array2)

This is a revised and updated version of these functions and their tests, which were originally proposed long ago in PR #1611 and blocked at that time by limited scope for handling collections of generics in the UDF invocation framework.

Review feedback from the original PR incorporated herein, along with QTTs and historical plans.

@blueedgenick blueedgenick requested review from JimGalasyn and a team as code owners June 4, 2020 05:53
import java.util.List;
import org.junit.Test;

public class ArrayExceptTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are missing a test case where the arrays contain null values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argh, good catch - thanks! will add

final List<String> input1 = Arrays.asList(null, null);
final List<String> input2 = Arrays.asList(null, null, null);
final List<String> result = udf.union(input1, input2);
assertThat(result, contains(nullValue()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing behavior for me. Why does the test right above contain a null value in the resulting array but this test returns null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a subtlety here - in this case the udf doesn't really return null, instead it returns an array containing a single null - which i think is what the function should do in this case, on the grounds that the only distinct value found in either of the input arrays is a null. This is different than the case that we pass in an actual null, as compared to array-of-nulls which is what we have here, and that other case is tested below.
Does that make sense, or am i misunderstanding the question ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I just noticed the contains. I thought it was equals that's why I asked. Your answer and the code totally make sense. Sorry about that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this new matcher-naming throws me off regularly too - i think i liked "contains" much better as something like "isArrayContaining", especially when you're scanning through a whole set of tests :)

@vpapavas vpapavas self-assigned this Jun 4, 2020
```

Returns an array of all the distinct values, including NULL if present, from the input array.
The output array elements will be in order of their first occurrence in the input.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The output array elements will be in order of their first occurrence in the input.
The output array elements are in order of their first occurrence in the input.

ARRAY_EXCEPT(array1, array2)
```

Returns an array of all the distinct elements from an array except for those also present in a second array. The order of entries in the first array is preserved although any duplicates are removed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Returns an array of all the distinct elements from an array except for those also present in a second array. The order of entries in the first array is preserved although any duplicates are removed.
Returns an array of all the distinct elements from an array, except for those also present in a second array. The order of entries in the first array is preserved, but duplicates are removed.

Copy link
Member

@JimGalasyn JimGalasyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a few tiny suggestions.

Copy link
Member

@vpapavas vpapavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@blueedgenick blueedgenick merged commit 50428c7 into confluentinc:master Jun 5, 2020
@blueedgenick blueedgenick deleted the array_set_functions branch June 11, 2020 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants