Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DROP [STREAM|TABLE] should support termination of query started during creation. #2177

Closed
big-andy-coates opened this issue Nov 21, 2018 · 22 comments · Fixed by #6143
Closed
Assignees
Labels
engine operability Issues pertaining to running and operating KSQL, notably in production popular Issues with significant user demand user-experience
Milestone

Comments

@big-andy-coates
Copy link
Contributor

big-andy-coates commented Nov 21, 2018

If I create a stream that starts a query in the background, e.g.

CREATE STREAM S AS SELECT c1, c2 FROM T;

Then I should be able to drop this stream without needing to find the query Id and terminate the query. e.g. I should be able to do:

DROP STREAM S;

This currently complains, due to inter-stream referential integrity checking. (The query is seen as a source to stream 'S', so 'S' can't be dropped while the query is running).

The issue here is that the query id is non-deterministic, being based on an internal counter. So its not possible to submit a script to the rest api that drops and recreates a stream, as you can't add suitable TERMINATE call.

The complexity comes about when we have INSERT INTO S queries running into the same stream/table. I think it is OK/correct in such a situation that the DROP STREAM S to complain. However, we should have a way of overriding this so that even that query is stopped.

I would propose that DROP STREAM S should automatically stop any query that was started as part of the original CREATE STREAM S AS statement. But the presence of other queries that insert into the stream should cause it to fail, unless the user requests us to stop them, e.g. we could support something like:

-- use if you want any other source queries to automatically be stopped
FORCE DROP STREAM S;  

The DROP command will also fail if this stream is used as a source for other queries. This should continue to cause the DROP to fail.

I guess an alternative would be some syntax to allow your to terminate a queries based on the stream/table they output too, e.g.

-- terminate all queries that output to stream S.
TERMINATE ALL FOR STREAM S;

This approach differs from having a straight DROP STREAM S automatically stopping the query started by its CSAS statement in that with the above TERMINATE FOR STREAM syntax the user can't choose to drop only if the only query running is the one auto created during the CSAS statement.

I prefer the implicit terminate approach because it differentiates between those queries created during the CSAS/CTAS and those that weren't. So you can do this:

Straight create / drop example:
CREATE STREAM S1 AS SELECT * FROM FOO;
-- This should work, automatically stopping the query created in the line above
DROP STREAM S1;
With insert into example:
CREATE STREAM S2 AS SELECT * FROM FOO;
INSERT INTO S2 SELECT * FROM BAR;

-- Drop should fail as there are other queries that write to S2.
DROP STREAM S2;

-- a forced drop should still succeed:
FORCE DROP STREAM S2;
With downstream dependencies example:
CREATE STREAM S2 AS SELECT * FROM FOO;
CREATE STREAM S3 AS SELECT * FROM S2;

-- Drop should fail as as this would break the query behind S3;
DROP STREAM S2;

-- Drop should _still_ fail as as this would break the query behind S3;
FORCE DROP STREAM S2;

Ephemeral queries

Ephemeral / temporary queries, (e.g. from a user issuing a SELECT * FROM BLAH; statement in the CLI), should not be included when determining if an entity can be dropped, i.e. an ephemeral query should not stop an entity being dropped.

To do so would be the same as not allowing a DB Admin of a RDBMS system to not be able to drop a table if any system was currently selecting from it.

Such ephemeral queries should terminate and, ideally, report a suitable message to the user.

@big-andy-coates
Copy link
Contributor Author

Duplicate of #1317?

@big-andy-coates
Copy link
Contributor Author

Related to #2112, which is more about a CASCADE DROP command to drop a hierarchy of statements.

@big-andy-coates big-andy-coates added user-experience operability Issues pertaining to running and operating KSQL, notably in production labels Nov 21, 2018
@miguno
Copy link
Contributor

miguno commented Nov 21, 2018

@big-andy-coates: Thanks for putting this together. At large, this make sense to me.

Impact on dependencies?

Could you please clarify the behavior for a stream's/table's dependencies:

  1. What's the desired behavior for queries/streams/tables that write to the [stream/table]-to-be-dropped when using a normal DROP? In comparison, what's the desired behavior for a forced DROP? This is, as you already alluded to above, not just about the [stream/table]-to-be-dropped itself but also what happens to these related queries/streams/tables.

    • From what I understand: For this scenario there's only one case today (apart from a stream's/table's "own" query as part of CTAS), which is a query, and that query is INSERT INTO, right? And, fwiw, INSERT INTO is a DML statement.
  2. What's the desired behavior for streams/tables/queries that read from the [stream/table]-to-be-dropped? In comparison, what's the desired behavior for a forced DROP? This is, as you already alluded to above, not just about the [stream/table]-to-be-dropped itself but also what happens to these related queries/streams/tables.

    • From what I understand: This scenario seems to be DDL only, and that would be CTAS and CSAS statements.
    • But what about temporary queries (SELECT * FROM streamOrTableToBeDropped) in case user A runs a temporary query on the stream/table while, in parallel, user B issues the DROP statement?

Relation of this ticket to other tickets

As you pointed out above, we also have the related ticket #2112 (~ DROP CASCADE).

My questions here:

  • Can we really differentiate between the two tickets?
  • Should we really differentiate?

I think your argument is that any "collateral damage" of this ticket (beyond terminating its own query from a CSAS/CTAS) should be limited to what I listed as (1) further above in this comment, whereas #2112 is more about (2)?

Regarding existing RDBMS: What's their behavior for DROP TABLE in the face of (1) and (2)? Will a DROP TABLE ... CASCADE, if supported (see below), cover both (1) and (2), or just either of the two?

Syntax feedback

Regarding the syntax: I'd prefer an approach based on DROP, too, rather than sth like TERMINATE ALL FOR STREAM S.

For what it's worth, here's what some RDBMS have chosen to implement.

PostgreSQL (docs):

Uses DROP TABLE myTable and DROP TABLE myTable CASCADE. DROP TABLE myTable is actually an alias for DROP TABLE myTable RESTRICT.

CASCADE
Automatically drop objects that depend on the table (such as views), and in turn all objects that depend on those objects (see Section 5.13).

RESTRICT
Refuse to drop the table if any objects depend on it. This is the default.

MySQL (docs):

Considers DROP TABLE ... [RESTRICT | CASCADE] as valid syntax to make migration from other DBs like PostgreSQL easier, but itself it doesn't support this functionality.

The RESTRICT and CASCADE keywords do nothing. They are permitted to make porting easier from other database systems.

SQL Server:

From what I understand it doesn't support the DROP syntax directly. Instead, it says:

DROP TABLE cannot be used to drop a table that is referenced by a FOREIGN KEY constraint. The referencing FOREIGN KEY constraint or the referencing table must first be dropped. If both the referencing table and the table that holds the primary key are being dropped in the same DROP TABLE statement, the referencing table must be listed first.

Multiple tables can be dropped in any database. If a table being dropped references the primary key of another table that is also being dropped, the referencing table with the foreign key must be listed before the table holding the primary key that is being referenced.

When a table is dropped, rules or defaults on the table lose their binding, and any constraints or triggers associated with the table are automatically dropped. If you re-create a table, you must rebind the appropriate rules and defaults, re-create any triggers, and add all required constraints.

Note: There's also a related option, which is the ON DELETE CASCADE clause when you are defining the table (DDL), but its use case is different because it deletes records from any child tables but it doesn't cause SQL Server to delete the child tables when doing a DROP.

@big-andy-coates
Copy link
Contributor Author

big-andy-coates commented Nov 21, 2018

@miguno

I thought I'd made that pretty clear in my description - apparently not!

In answer to Q1:

  • desired behaviour is that DROP will only stop the query that was originally created with the stream/table. See 'Straight create / drop example' example above. It will fail if any other queries read or write to the stream/table.
  • desired behaviour is that FORCE DROP will stop any query that write to the stream/table, i.e. it will also stop any queries started by INSERT INTO statements. See 'With insert into example' example above.

In answer to Q2:

  • Both DROP and FORCE DROP will fail if there are downstream queries that read from the stream. (We'd need to support CASCADE DROP for that).
  • An INSERT INTO X SELECT * FROM Y reads from Y. So it's not just CTAS / CSAS statements.
  • The bit about ephemeral queries is a good point. Personally, I don't think they should stop DROP working. (I don't think they do today). Otherwise, if as a DB admin, you can't drop a table if a user is selecting from it... that would be weird. (I'll add this to the description above).

Regarding Cascade drop:

  • Can we really differentiate between the two tickets? This ticket is about controlling what queries are terminated as part of the drop command. The other is about what streams or tables are dropped. Very different.
  • Should we really differentiate? Yes. They are complementary.

I think your argument is that any "collateral damage" of this ticket (beyond terminating its own query from a CSAS/CTAS) should be limited to what I listed as (1) further above in this comment, whereas #2112 is more about (2)?

Yeah, [FORCE] DROP only affects things writing to this stream/table. CASCADE DROP would include other streams/tables built from this one. But as I said, they are complementary.

Regarding existing RDBMS: What's their behavior for DROP TABLE in the face of (1) and (2)? Will a DROP TABLE ... CASCADE, if supported (see below), cover both (1) and (2), or just either of the two?

I guess the closest thing an RDBMS will have is the cascade dropping of views built of a table. Which I guess is similar to having a hierarchy of streams and tables. So, like KSQL, they'll allow users to drop the leaves of the dependency graph, (.i.e. those things that have nothing reading from them), and will use a cascade drop to delete everything downstream of the entity being dropped. This is the same as (2).

RDMBS's obviously don't have long running queries like KSQL, so (1) doesn't apply. Because (1) is all about queries, not tables/streams.

From what I understand it [SQL Server] doesn't support the DROP syntax directly

I read that differently: it is supported, but you need to start at the leaves first. i.e. if you have table b with a foreign key to table a, then you need to drop table b before you can drop table a. This is just referential integrity of the dbs metadata.

@big-andy-coates
Copy link
Contributor Author

big-andy-coates commented Nov 21, 2018

Just having a chat with @hjafarpour about this and it struck me that this solution doesn't help when it comes to INSERT INTO commands.

Affectively, what most people want to do in a script is drop and recreate on each statement. The above discussion focuses on drop and recreate of CSAS/CSAT statements. But won't work for INSERT INTO statements. To drop and recreate the insert into statement I can't drop the INSERT INTO as it doesn't exist. I need to terminate the underlying query and then reissue my insert statement. But in a script I don't know the query Id so can't so it.

So how does a user terminate and restart an insert into query?

INSERT INTO D SELECT a, b, func(c) FROM S;

???

  • can't use the query id in a script as they are none-deterministic.
  • could terminate based on the sql statement, but that's brittle as that can change over time.

Maybe what we need is a way for users to query KSQL for query ids!

SELECT QueryId FROM RUNNING_QUERIES WHERE SOURCE = 'S' AND DESTINATION = 'D' AND SQL CONTAINS 'Insert Into';

Then we can add the ability to select into a variable and/or subqueries:

TERMINATE ALL (SELECT QueryId FROM RUNNING_QUERIES WHERE SOURCE = 'S' AND DESTINATION = 'D' AND SQL CONTAINS 'Insert Into');

Thoughts?

@miguno
Copy link
Contributor

miguno commented Nov 21, 2018

Thanks for the follow-up, Andy.

RDMBS's obviously don't have long running queries like KSQL, so (1) doesn't apply. Because (1) is all about queries, not tables/streams.

Yeah. Handling INSERT INTO when DROP'ing a table wouldn't be an issue for RDBMS because the INSERT INTO isn't continuously running like KSQL's INSERT INTO.

And that's why then, in this ticket:

  • For KSQL we do have to worry about (1) above, as you said.
  • For the proposed KSQL syntax: There's no prior art or SQL grammar from the RDBMS world that we could or should leverage / re-use.

From what I understand it [SQL Server] doesn't support the DROP syntax directly

I read that differently: it is supported, but you need to start at the leaves first. i.e. if you have table b with a foreign key to table a, then you need to drop table b before you can drop table a. This is just referential integrity of the dbs metadata.

Oh, yes, of course you can do the manual work of first DROPing the dependencies, and then DROP'ing the actual table. I was talking about the cascading DROP variant, which isn't supported by SQL Server (and also not supported by MySQL).

@miguno
Copy link
Contributor

miguno commented Nov 21, 2018

Affectively, what most people want to do in a script is drop and recreate on each statement. The above discussion focuses on drop and recreate of CSAS/CSAT statements. But won't work for INSERT INTO statements. To drop and recreate the insert into statement I can't drop the INSERT INTO as it doesn't exist. I need to terminate the underlying query and then reissue my insert statement. But in a script I don't know the query Id so can't so it.

I think this is a valid concern, but: This ticket is about dropping tables/streams (CTAS/CSAS), and that we are making the case that the user should optionally be able to force the termination of any INSERT INTO queries that are writing into the table/stream.

Your new concern above is about how users could stop/restart (drop and recreate) an INSERT INTO statement in isolation, i.e. without touching the target stream or table.

Right?

@big-andy-coates big-andy-coates mentioned this issue Nov 23, 2018
2 tasks
@big-andy-coates
Copy link
Contributor Author

Your new concern above is about how users could stop/restart (drop and recreate) an INSERT INTO statement in isolation, i.e. without touching the target stream or table.

I'm just realising that any solution we come up with for CSAS / CTAS should also be generic enough to work with INSERT INTO. Otherwise we're only solving half the problem.

The only solution that I can think of right now that can handle CSAS and CTAS and INSERT INTO is some queryable metadata that allows users to get at the queryids.

@hjafarpour @rodesai @dguy @apurvam @vcrfxia @rmoff can you think of any other statements we need to worry about? What's your opinion?

@big-andy-coates big-andy-coates self-assigned this Nov 26, 2018
@rmoff
Copy link
Contributor

rmoff commented Nov 26, 2018

Late to the convo but here're my thoughts:

FORCE DROP STREAM S;  

I think it would be better to retain DROP as the verb, and have FORCE / CASCADE / WITH EXTREME PREJUDICE operators after the object name

TERMINATE ALL FOR STREAM S;

This would still leave the stream object and make it a two-command process. I can't see a scenario where you'd want to do this, and not just the whole thing in one go

CREATE STREAM S2 AS SELECT * FROM FOO;
CREATE STREAM S3 AS SELECT * FROM S2;

We should still support an option to drop a stream and all child dependencies. This would happen when redeploying a pipeline. So maybe we want CASCADE here? I realise that this is touching on #2112 but we don't want to miss a trick by arbitrarily separating them just because it's two GH issues to start with.

Thinking about it, what's the reason not to automagically clean up queries if a drop is requested against a stream? I mean, should we default to "FORCE DROP" (i.e. drop a stream also terminates dependent queries)? Keep the simplest syntax for the most common operation, and then introduce a NO TERMINATE flag if we want to retain the query?

@big-andy-coates
Copy link
Contributor Author

@rmoff

Thinking about it, what's the reason not to automagically clean up queries if a drop is requested against a stream? I mean, should we default to "FORCE DROP" (i.e. drop a stream also terminates dependent queries)? Keep the simplest syntax for the most common operation, and then introduce a NO TERMINATE flag if we want to retain the query?

I think we should default to automatically terminating the implicit query created when the stream was created, when it is later dropped. However, if you're suggesting we should also terminate downstream queries, then I'd disagree - the reason we added referential integrity in the first place was to stop users dropping a query a down stream query relies on.

@miguno
Copy link
Contributor

miguno commented Dec 4, 2018

What about the following flow then:

  1. DROP STREAM s and DROP TABLE t (optional syntax: DROP STREAM s RESTRICT): Deletes the stream/table and its "implicit" query, and nothing more.
  2. DROP STREAM s CASCADE and DROP TABLE t CASCADE: Same as DROP RESTRICT above, but additionally also deletes any downstream streams/tables, including their respective "implicit" queries.

The one open question would then be how to handle any INSERT INTO statements for both (1) and (2), notably because for (2) the downstream streams/tables might have their own INSERT INTOs.

@big-andy-coates
Copy link
Contributor Author

It seems we have the following things to worry about when dropping a stream or table:

  1. The query that was created, if any, when the table/stream was created.
  2. Any other queries writing into this stream/table, i.e. INSERT INTO statements.
  3. Downstream tables/streams/insert-into statements and their underlying queries.

If any of the above exist then, at the moment, the DROP will fail.

So maybe it's enough to just have CASCADE handle both 2 & 3?

@miguno
Copy link
Contributor

miguno commented Dec 10, 2018

@big-andy-coates :

So maybe it's enough to just have CASCADE handle both 2 & 3?

As I mentioned in my previous comment, I think CASCADE should certainly cover (3), and my question was whether it should also cover (2). Personally, I think CASCADE should cover (2) as well, and it should also cover the recursive scenario where any downstream streams/tables have their own INSERT INTOs.

Now, regarding UX, it's obvious that a DROP CASCADE is rather impactful and potentially dangerous -- a human mistake can be costly. Should we provide any functionality like a dry-run option to explain to the user what exactly would happen if the DROP CASCADE would be carried out? Think: "First, do a dry-run DROP CASCADE. Then, if the results look good to you, you can run the real DROP CASCADE."

How is this handled in RBDMS that support CASCADE? Is there any such safety measure?

@rmoff
Copy link
Contributor

rmoff commented Dec 10, 2018 via email

@miguno
Copy link
Contributor

miguno commented Dec 11, 2018

Maybe we could have a dry run flag available for people who are scared? ;)

That's what I meant. A normal DROP CASCADE would "just do it". But cautious people might want to have the option to do a dry run first (via an optional flag).

IMHO, the minimum work would be DROP CASCADE without such a dry-run option, and the dry-run option would be the extra.

@rmoff
Copy link
Contributor

rmoff commented Dec 11, 2018 via email

@big-andy-coates
Copy link
Contributor Author

big-andy-coates commented Dec 12, 2018

Sounds sensible. What would the dry run syntax look like?

DROP STREAM x CASCADE DRY RUN

???

@miguno
Copy link
Contributor

miguno commented Dec 12, 2018

+1 @big-andy-coates

DROP STREAM <name> [CASCADE | RESTRICT] [DRY RUN]
DROP TABLE  <name> [CASCADE | RESTRICT] [DRY RUN]

@spena
Copy link
Member

spena commented Aug 31, 2020

#5987 looks to address this issue partially by terminating only the persistent query that belongs to the CAS statement. If other queries (inserts, cas, transient) are reading the stream/table then DROP will fail until any reference is terminated/dropped manually.

Marking #5987 as duplicate. I will use this issue to track the work as seems it is most popular feature to implement.

@mingfang
Copy link

mingfang commented Sep 7, 2020

DROP STREAM already supports an optional DELETE TOPIC
https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/drop-stream/

Why not add another optional DELETE QUERY

@spena
Copy link
Member

spena commented Sep 10, 2020

@mingfang That would be useful too, but DROP STREAM is more natural and SQL standard than adding the DELETE QUERY.

The query running in the background for such stream/table is only used by the stream. A user shouldn't have to interact with the query. That's the idea of terminating the query automatically so users interact with the STREAM only (no queries).

@jens-guenther
Copy link

jens-guenther commented Sep 18, 2020

I know I'm pretty late to the party, have no clues about the internals, not much knowledge about ksql itself, but an outsider's opinion :). I've read almost all tickets & PRs about TERMINATE ALL, DROP ... CASCADE, INSERT vs. UNION, and I have a personal issue with DROP S/T while a query is still running, which brought me here in the first place. So please, bear with me if I miss something.

IMHO you should take a step back and look from a higher level: The persistent query is not a first-class citizen in the KSQL language like streams or tables. I think this is the main reason why you have all these issues.
However, your queries are the R.E.A.L. powerhouse of ksql, giving us the ability to define our stream processors easily. To me, queries are the serverless functions of Kafka.

We have CREATE STREAM ...
We have CREATE TABLE ...
We should have CREATE QUERY

Imagine the possibilities:

CREATE QUERY name AS SELECT
CREATE QUERY name AS INSERT
CREATE QUERY name AS SCRIPT
...

To me, an e.g. CREATE STREAM AS SELECT is nice as it consolidates 2 steps in one (create stream, create query) but I wouldn't bother so much about writing those two statements if I'd have to.

Btw. I personally think that DROP ... CASCADE is a bad style (also in SQL) and a dinosaur. If you want to drop entire functional areas in a production database you would do this rather carefully and step by step. If you want to do this in your dev database you start it from scratch and apply your migrations.

@spena spena modified the milestones: 0.13.0, 0.14.0 Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
engine operability Issues pertaining to running and operating KSQL, notably in production popular Issues with significant user demand user-experience
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants