Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runaway: Mitigation tidb_runaway_queries flooding #55664

Merged
merged 11 commits into from
Sep 13, 2024

Conversation

HuSharp
Copy link
Contributor

@HuSharp HuSharp commented Aug 26, 2024

What problem does this PR solve?

Issue Number: ref #54434

Problem Summary: There could be a large number of runaway queries that are identified in a short period.
Currently, the TiDB logs each occurrence which flood the log table. Since it's likely the most queries are identical, we shall introduce a way to compress the logs.

What changed and how does it work?

changed

  • Include sql_digest of this query in this table. If users add manual rules with sql_digest, they can easily identify the hits by filtering sql_digest.
  • There's a potential risk that tabletidb_runaway_queries could be flooded if massive queries are identified as runaway queries in a short period. For example, SQL injections. Since this table is informational, we can afford to lose details.
    1. modify flushes the records to KV at an interval of 10 seconds
    2. Provide a record map with hash (resouce_name, sql_digest, plan_digest, match_type). This map will be flushed to tidb_runaway_queries table every 10s. (This is because flush inserts all the records, so the map is added to minimize the number of inserts)
    3. Add an UNIQUE KEY runaway_task(resource_group_name, sql_digest, plan_digest), time represents the timestamp of the first occurrence. repeat represents the times that the message repeats since the last flush.

result

mysql> ALTER RESOURCE GROUP default QUERY_LIMIT=(EXEC_ELAPSED='1s', ACTION=KILL,  WATCH=EXACT DURATION='10m');
Query OK, 0 rows affected (0.22 sec)

// now is empty
mysql> select * from mysql.tidb_runaway_queries limit 3 \G;
Empty set (0.00 sec)

mysql> select sleep(2) from t;
ERROR 8253 (HY000): Query execution was interrupted, identified as runaway query

mysql> select * from mysql.tidb_runaway_queries limit 3 \G;
*************************** 1. row ***************************
resource_group_name: default
         start_time: 2024-09-05 11:59:57
            repeats: 1
         match_type: identify
             action: kill
         sample_sql: select sleep(2) from t
         sql_digest: 4adbc838b86c573265d4b39a3979d0a362b5f0336c91c26930c83ab187701a55
        plan_digest: 5d094f78efbce44b2923733b74e1d09233cb446318293492901c5e5d92e27dbc
        tidb_server: 127.0.0.1:4000
1 row in set (0.00 sec)

And then execute select sleep(3) from t; which has same UNIQUE KEY runaway_task(resource_group_name, sql_digest, plan_digest) as select sleep(3) from t;

mysql> select sleep(3) from t;
ERROR 8253 (HY000): Query execution was interrupted, identified as runaway query

// We can find time and repeats have been updated
mysql> select * from mysql.tidb_runaway_queries limit 3 \G;
*************************** 1. row ***************************
resource_group_name: default
         start_time: 2024-09-05 11:59:57
            repeats: 2
         match_type: identify
             action: kill
         sample_sql: select sleep(2) from t
         sql_digest: 4adbc838b86c573265d4b39a3979d0a362b5f0336c91c26930c83ab187701a55
        plan_digest: 5d094f78efbce44b2923733b74e1d09233cb446318293492901c5e5d92e27dbc
        tidb_server: 127.0.0.1:4000

And then execute select sleep(2) from t; which will be regarded as watch type rather than kill.

mysql> select sleep(2) from t;
ERROR 8253 (HY000): Query execution was interrupted, identified as runaway query

// We can find time and repeats have been updated
mysql> select * from mysql.tidb_runaway_queries limit 3 \G;
*************************** 1. row ***************************
resource_group_name: default
         start_time: 2024-09-05 11:59:57
            repeats: 2
         match_type: identify
             action: kill
         sample_sql: select sleep(2) from t
         sql_digest: 4adbc838b86c573265d4b39a3979d0a362b5f0336c91c26930c83ab187701a55
        plan_digest: 5d094f78efbce44b2923733b74e1d09233cb446318293492901c5e5d92e27dbc
        tidb_server: 127.0.0.1:4000
*************************** 2. row ***************************
resource_group_name: default
         start_time: 2024-09-05 12:00:13
            repeats: 1
         match_type: watch
             action: kill
         sample_sql: select sleep(2) from t
         sql_digest: 4adbc838b86c573265d4b39a3979d0a362b5f0336c91c26930c83ab187701a55
        plan_digest: 5d094f78efbce44b2923733b74e1d09233cb446318293492901c5e5d92e27dbc
        tidb_server: 127.0.0.1:4000
2 rows in set (0.00 sec)

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

runaway: Mitigation  tidb_runaway_queries flooding

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-tests-checked release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 26, 2024
Copy link

tiprow bot commented Aug 26, 2024

Hi @HuSharp. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

codecov bot commented Aug 26, 2024

Codecov Report

Attention: Patch coverage is 94.33962% with 3 lines in your changes missing coverage. Please review.

Project coverage is 56.8043%. Comparing base (fef43c5) to head (35d31aa).
Report is 58 commits behind head on master.

Additional details and impacted files
@@                Coverage Diff                @@
##             master     #55664         +/-   ##
=================================================
- Coverage   72.8403%   56.8043%   -16.0360%     
=================================================
  Files          1599       1731        +132     
  Lines        444670     635469     +190799     
=================================================
+ Hits         323899     360974      +37075     
- Misses       100835     250512     +149677     
- Partials      19936      23983       +4047     
Flag Coverage Δ
integration 37.3851% <5.6603%> (?)
unit 72.0964% <94.3396%> (+0.1343%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.9567% <ø> (ø)
parser ∅ <ø> (∅)
br 52.7356% <ø> (+6.9623%) ⬆️

Copy link
Member

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add some tests?

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 4, 2024
@HuSharp HuSharp force-pushed the add_union_key_runaway branch 2 times, most recently from 2523bdb to d8fbe0d Compare September 4, 2024 03:53
@HuSharp
Copy link
Contributor Author

HuSharp commented Sep 4, 2024

/retest-required

Copy link

tiprow bot commented Sep 4, 2024

@HuSharp: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@HuSharp
Copy link
Contributor Author

HuSharp commented Sep 4, 2024

can u add some tests?

added map-related and flood-related tests :)

@HuSharp
Copy link
Contributor Author

HuSharp commented Sep 4, 2024

/retest-required

Copy link

tiprow bot commented Sep 4, 2024

@HuSharp: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@HuSharp
Copy link
Contributor Author

HuSharp commented Sep 4, 2024

/retest-required

Copy link

tiprow bot commented Sep 4, 2024

@HuSharp: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@HuSharp
Copy link
Contributor Author

HuSharp commented Sep 4, 2024

/retest-required

Copy link

tiprow bot commented Sep 4, 2024

@HuSharp: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Member

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

@@ -270,6 +270,13 @@ func revertVersionAndVariables(t *testing.T, se sessiontypes.Session, ver int) {
// for version <= version195, tidb_enable_dist_task should be disabled before upgrade
MustExec(t, se, "update mysql.global_variables set variable_value='off' where variable_name='tidb_enable_dist_task'")
}
if ver < version212 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only for tests? if in real environment, do we have a way or need a way to revert?

Copy link
Contributor Author

@HuSharp HuSharp Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for testing purposes only. We will not encounter calling unsetStoreBootstrapped and supporting fallback/revert in the real world. PTAL, thx~ @tangenta

@HuSharp
Copy link
Contributor Author

HuSharp commented Sep 4, 2024

/retest-required

Copy link

tiprow bot commented Sep 4, 2024

@HuSharp: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@HuSharp
Copy link
Contributor Author

HuSharp commented Sep 4, 2024

/retest-required

Copy link

tiprow bot commented Sep 4, 2024

@HuSharp: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Signed-off-by: husharp <ihusharp@gmail.com>
Signed-off-by: husharp <ihusharp@gmail.com>
Signed-off-by: husharp <ihusharp@gmail.com>
Signed-off-by: husharp <ihusharp@gmail.com>
@HuSharp
Copy link
Contributor Author

HuSharp commented Sep 12, 2024

/test unit-test

Copy link

tiprow bot commented Sep 12, 2024

@HuSharp: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test unit-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@glorv
Copy link
Contributor

glorv commented Sep 12, 2024

/ok-to-test
/retest

@ti-chi-bot ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Sep 12, 2024
@HuSharp
Copy link
Contributor Author

HuSharp commented Sep 12, 2024

/retest-required

Signed-off-by: husharp <ihusharp@gmail.com>
Signed-off-by: husharp <ihusharp@gmail.com>
@HuSharp
Copy link
Contributor Author

HuSharp commented Sep 12, 2024

/retest-required

Copy link
Collaborator

@lcwangchao lcwangchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for session part

Copy link

@yudongusa yudongusa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

ti-chi-bot bot commented Sep 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: glorv, JmPotato, lcwangchao, nolouch, tangenta, yudongusa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Sep 13, 2024
@HuSharp
Copy link
Contributor Author

HuSharp commented Sep 13, 2024

/retest-required

Copy link

ti-chi-bot bot commented Sep 13, 2024

@HuSharp: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-lightning-integration-test f7ca2d1 link true /test pull-lightning-integration-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@HuSharp
Copy link
Contributor Author

HuSharp commented Sep 13, 2024

/retest-required

Copy link

tiprow bot commented Sep 13, 2024

@HuSharp: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
fast_test_tiprow 35d31aa link true /test fast_test_tiprow

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ti-chi-bot ti-chi-bot bot merged commit bb3456b into pingcap:master Sep 13, 2024
22 of 23 checks passed
@HuSharp HuSharp deleted the add_union_key_runaway branch September 13, 2024 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants