Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: surface error to user when command topic deleted while server running #6240

Merged
merged 12 commits into from
Sep 23, 2020

Conversation

stevenpyzhang
Copy link
Member

@stevenpyzhang stevenpyzhang commented Sep 16, 2020

Description

Stacked on top of #6164
If the command topic is deleted currently, the command topic consumer in the CommandRunner thread continues to poll the non-existent topic, spamming logs with

WARN [Consumer clientId=consumer-null-1, groupId=null] Received unknown topic or partition error in fetch for partition _confluent-ksql-default__command_topic-0

PR changes it so that if the command topic doesn't exist, the server enters DEGRADED state and adds warnings to all api requests sent to the /ksql that the command topic was deleted.

Also added a new CommandRunnerDegradedReason category to expose this reason in the server metrics.

Testing done

Started up the server, deleted the command topic

ksql> show streams;

 Stream Name         | Kafka Topic                 | Format 
------------------------------------------------------------
 KSQL_PROCESSING_LOG | default_ksql_processing_log | JSON   
------------------------------------------------------------
WARNING: The server has detected corruption in the command topic due to modifications performed on it.
DDL statements will not be processed any further.
If a backup of the command topic is available, restore the command topic using the backup file.
A server restart is required to restore full functionality

Used jconsole to verify the command runner status is DEGRADED and the reason is COMMAND_TOPIC_DELETED

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Some minor comments inline

INCOMPATIBLE_COMMAND(Errors:: commandRunnerDegradedIncompatibleCommandsErrorMessage);
CORRUPTED(Errors::commandRunnerDegradedBackupCorruptedErrorMessage),
INCOMPATIBLE_COMMAND(Errors::commandRunnerDegradedIncompatibleCommandsErrorMessage),
COMMAND_TOPIC_DELETED(Errors::commandRunnerDegradedCommandTopicDeletedErrorMessage);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember a discussion on the design doc indicated that maybe we don't want to differentiate between corrupted and deleted. What was the decision on that? (also good to document it here for posterity)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main reason is right now, backups are optional, so if backups aren't enabled, then it wouldn't really make sense to have the degraded reason be CORRUPTED.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking it over with Rohan a bit more, I've combined CORRUPTED and COMMAND_TOPIC_DELETED into CORRUPTED. The COMMAND_TOPIC_DELETED could show up as CORRUPTED after a restart and having different states for the same external action (deleting the command topic) is strange.

I've made the error message returned from the API generic enough so it address both the backup corruption and the command topic deleted/modified case

The server has detected corruption in the command topic due to modifications performed on it. DDL statements will not be processed any further.
If a backup of the command topic is available, restore the command topic using the backup file.
A server restart is required to restore full functionality

@@ -289,6 +302,9 @@ void fetchAndRunCommands() {
lastPollTime.set(clock.instant());
final List<QueuedCommand> commands = commandStore.getNewCommands(NEW_CMDS_TIMEOUT);
if (commands.isEmpty()) {
if (!commandTopicExists.get()) {
commandTopicDeleted = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can save this for another PR, but I think it makes sense to have this method return some status then to set a ton of flags and check them in the main loop

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll file a follow up issue for that refactoring improvement.

LOG.warn("The command topic offset was reset. CommandRunner thread exiting.");
state = new Status(
CommandRunnerStatus.DEGRADED,
CommandRunnerDegradedReason.COMMAND_TOPIC_DELETED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does offset reset always mean command topic deleted? it could mean the retention was configured incorrectly perhaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time we start up the server, we'd check/override the command topic properties such that the retention.ms=-1 although I guess it is possible for the user themselves to override the retention config on the command topic which could trigger this. I can enhance the error message so that it mentions the degraded state is a result of the command topic being deleted or the configurations being modified for it

The REASON can be updated to COMMAND_TOPIC_DELETED_OR_MODIFIED

INCOMPATIBLE_COMMAND(Errors:: commandRunnerDegradedIncompatibleCommandsErrorMessage);
CORRUPTED(Errors::commandRunnerDegradedBackupCorruptedErrorMessage),
INCOMPATIBLE_COMMAND(Errors::commandRunnerDegradedIncompatibleCommandsErrorMessage),
COMMAND_TOPIC_DELETED_OR_MODIFIED(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different than corrupted? We're leaking an implementation detail of how we detect corruption here. The underlying condition is the same - the command topic isn't in the state that the system expects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've combined the two now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants