Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap timestamps in ROWTIME expressions with STRINGTOTIMESTAMP #3160

Merged
merged 5 commits into from
Aug 8, 2019

Conversation

jzaralim
Copy link
Contributor

@jzaralim jzaralim commented Aug 1, 2019

Description

Because the ROWTIME is stored as a Long, the user either has to pass in Long values or use STRINGTOTIMESTAMP function to make queries using ROWTIME, for example:

ROWTIME > STRINGTOTIMESTAMP('1970-01-01', 'yyyy-mm-dd')
ROWTIME BETWEEN 156469222000 AND 1564692253275

This change wraps datestrings with STRINGTOTIMESTAMP so the user can directly pass datestrings for querying:

ROWTIME > '1970-01-01'
ROWTIME BETWEEN '1970-01-01 08:00:00' AND ''1970-01-01 10:00:00

Testing done

Wrote tests for several types of expressions.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@jzaralim jzaralim requested review from JimGalasyn and a team as code owners August 1, 2019 20:49
AND ROWTIME <= '2017-11-17 04:53:48';

If the datestring is inexact, the rest of the timestamp is assumed to be 0.
For example, `ROWTIME = `2019-07-30 11:00` is equivalent to `ROWTIME = `2019-07-30 11:00:00`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example, `ROWTIME = `2019-07-30 11:00` is equivalent to `ROWTIME = `2019-07-30 11:00:00`.
For example, ``ROWTIME = 2019-07-30 11:00`` is equivalent to ``ROWTIME = 2019-07-30 11:00:00``.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i might explicitly call out here that all unspecified elements of the full pattern are right-padded with zeros

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhere in here i would expect to see a mention of timezone handling too, as this always trips people up.... i.e. in what TZ are your iso-8601 strings interpreted ? that of teh running jvm or utc or ....?

Copy link
Member

@JimGalasyn JimGalasyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a couple of suggestions.

@blueedgenick
Copy link
Contributor

can we feature-flag this somewhere, so that we can easily turn it off once we have proper datetime data-type support ? i assume that we will eventually change the datatype of the ROWTIME pseudo-column to dateitme, once such a thing exists, and this kind of "auto-cast" can then get terribly confusing

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature @jzaralim - suggestion below about improving it to support timezones and also I think one of the use cases is a bit risky.

Can you also add a few JSON based QueryTranslationTests to cover this implicit conversion, please?

Please add me again as a reviewer if you want a second pass from me.

docs/developer-guide/syntax-reference.rst Outdated Show resolved Hide resolved

@Test
public void shouldHandleNestedExpressions() {
final String simpleQuery = "SELECT * FROM orders where FOO(ROWTIME + 6000) > '2017-01-01 00:00:00.000' + 35;";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a bit dangerous? We're assuming FOO is returning a long and that represents a datetime, but this might not be the case, e.g. SELECT * FROM orders where IS_LEAP_YEAR(ROWTIME + 6000) > '2017-01-01 00:00:00.000' + 35; wouldn't make sense!

I don't think we can make any such assumption on the return value from a function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts were that this is purely a query rewriter, so those kinds of cases would get caught later on. So for example, IS_LEAP_YEAR(ROWTIME + 6000) > 35 passes this step, but would still eventually get caught during evaluation. I did change the overall logic of the rewriter as per @hjafarpour's comment though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They may get caught later on, but the UX for the user is not great. There's going to get some message that Foo can not be compared to Long or something, which is very confusing for the user.

@big-andy-coates big-andy-coates requested review from a team, rodesai and hjafarpour and removed request for a team August 5, 2019 20:52
@big-andy-coates
Copy link
Contributor

@blueedgenick

can we feature-flag this somewhere, so that we can easily turn it off once we have proper datetime data-type support

This is all hidden from the user anyway. So once we have a datetime type if/when we are able to flip ROWTIME to be of type datetime we'll need to make sure all the same comparisons work. So I don't think we need a feature flag, just to remember to tidy up old code once ROWTIME is a datetime.

Copy link
Contributor

@hjafarpour hjafarpour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jzaralim . Left a comment.

}

public static boolean requiresRewrite(final Expression expression) {
return expression.toString().contains("ROWTIME");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will match any expression with ROWTIME in it. For instance, it will match the following expression where you should not indeed rewrite the expression:
TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss') = '2019-08-08 12:12:12'.
You need to change this such that it only rewrites the cases you have mentioned in the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular statement remained the same, but the rewriter itself has been updated to only rewrite timestrings when one side of the comparison is exactly ROWTIME.

Copy link
Contributor

@hjafarpour hjafarpour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jzaralim jzaralim merged commit 42acd78 into confluentinc:master Aug 8, 2019
@jzaralim jzaralim deleted the rowtime branch August 8, 2019 22:35
@blueedgenick
Copy link
Contributor

@big-andy-coates

This is all hidden from the user anyway. So once we have a datetime type if/when we are able to flip ROWTIME to be of type datetime we'll need to make sure all the same comparisons work. So I don't think we need a feature flag, just to remember to tidy up old code once ROWTIME is a datetime.

this is kind of my point - what's really happening here is that we are introducing some limited form of "auto-casting" from one datatype (long-of-a-timestamp) to another (string).
This kind of automated attempt to be helpful with datatypes rarely works out well in practice as it only obfuscates to the user what the true types of things are, is incomplete (not covering all possible conversions), and can have unpredictable or surprising consequences. I'd recommend against pursuing any further form of this pattern. You'll notice that very few, if any, rdbms products do this without forcing you to do a proper cast() or convert() in your sql.

@jzaralim
Copy link
Contributor Author

jzaralim commented Aug 8, 2019

Hey all, I realized that I've hit "merge" prematurely. If anyone has any more feedback, I can submit another PR. Sorry about this!

@big-andy-coates
Copy link
Contributor

@blueedgenick ah, that's what you were getting at... then I whole heartily agree!

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jzaralim,

Couple of concerns about this remain:

1. Rewriting function calls.

I still feel handling functions is too risky. There are too many edge cases and bad UX.

SELECT * FROM orders where FuncThatReturnsLong(ROWTIME) > '2017-01-01 00:00:00.000';
-- Or
SELECT * FROM orders where ROWTIME > FuncThatReturnsLong('2017-01-01 00:00:00.000');

The PR will rewrite it to:

 SELECT * FROM orders where FuncThatReturnsLong(ROWTIME) > STRINGTOTIMESTAMP('2018-01-01T00:00:00.000', 'yyyy-MM-dd''T''HH:mm:ss.SSS');
-- OR
SELECT * FROM orders where ROWTIME > FuncThatReturnsLong(STRINGTOTIMESTAMP('2018-01-01T00:00:00.000', 'yyyy-MM-dd''T''HH:mm:ss.SSS'));

Which is valid SQL, but may not make any sense. ROWTIME is a BIGINT, but should actually be a datetime, which makes the auto-conversion from date string to long OK. You can't safely assume a BIGINT return value from any function to also be a pseudo timestamp.

As noted inline, the PR is, I think, not rewriting function calls due to a bug in the code, but if it where, it would incorrectly rewrite this:

SELECT * FROM orders where ROWTIME > FuncThatReturnsDateTime('not a date');

to

-- You can't assume a string parameter to a function is a date string!
SELECT * FROM orders where ROWTIME > FuncThatReturnsDateTime(STRINGTOTIMESTAMP('not a date01T00:00:00.000', 'yyyy-MM-dd''T''HH:mm:ss.SSS'));

2. Error handling needs improving

The following should generate an error message but doesn't, it just silently filters all rows out:

-- should generate error:
SELECT source FROM TEST WHERE ROWTIME < 'not a date';

3. (MINOR) Does handle casts

e.g.

SELECT * FROM orders where cast(ROWTIME as BIGING) > '2017-01-01 00:00:00.000';

The above doesn't work, which is a minor thing, but is an indication there are probably a lot of other similar edge cases to this 'magic'.

What worries me is that there may be many more edge cases that we're not thought of. I would urge you to drop the (broken?) function call handling. I'd even go as far as saying only rewrite if one side is a pure ROWTIME and the other is string. This minimises the surface area for edge cases.

I'd then really think how this can be misused and add tests. What are the edge cases? e.g. string that is not a date, and then ensure the UX is good.

Good luck :D


@Override
public Expression visitFunctionCall(final FunctionCall node, final Object context) {
return (Expression) new StatementRewriter().process(node, context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this line do anything other than return a duplicate of the function call node? I think StatementRewriter should be abstract as its meant as a base class for rewrites. In itself it doesn't actually change the query, right?

Copy link
Contributor Author

@jzaralim jzaralim Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It stops the rewriter from further processing anything within a function. So, if the expression looks like ROWTIME > FOO('2017-01-01 ' + 100) + '2019-01-01', then '2019-01-01' will get wrapped but '2017-01-01' doesn't. This actually also addresses the first point above, since function calls don't get rewritten at all.

@jzaralim
Copy link
Contributor Author

jzaralim commented Aug 12, 2019

@big-andy-coates The decision flow for a rewrite are as follows:

  1. If a filter contains the string ROWTIME anywhere in it, pass it to the rewriter.
  2. If the rewriter ever reaches a Comparision expression, check if one of the sides is exactly ROWTIME. If one of them is, pass its subnodes to another rewriter (TimestampRewriter).
  3. In TimestampRewriter, if a string is reached, wrap it in STRINGTOTIMESTAMP. However, if a function is reached, don't process the arguments of it.

So, the following statements do not get rewritten:
ROWTIME > FOO(5 + '2017-01-01')
FOO(ROWTIME) > '2017-01-01'
5 + ROWTIME>'2017-01-01'

As for error handling, ideally STRINGTOTIMESTAMP would catch them. However, it seems that STRINGTOTIMESTAMP not catching parsing errors is not just a simple bug, so the next PR will have the following changes:

  1. Instead of wrapping the datestrings, rewrite them by directly converting them to Long. That way, the errors caught by the datestring parser could be passed on to the user.
  2. Add more tests to go over more edge cases.

EDIT: After thinking about it more, I agree that downscaling the rewrite conditions to only rewriting if one side is ROWTIME and the other a string is the safest solution and will be changing that in the next PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants