Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky EventListenerTest.DisableBGCompaction #9400

Closed

Conversation

pdillinger
Copy link
Contributor

Summary: Wasn't able to easily reproduce error, but easy to see a race
condition between TestFlushListener::OnFlushCompleted and
DBTestBase::Close(), which frees CF handles before closing DB.

Test Plan: CI etc.

Summary: Wasn't able to easily reproduce error, but easy to see a race
condition between TestFlushListener::OnFlushCompleted and
DBTestBase::Close(), which frees CF handles before closing DB.

Test Plan: CI etc.
@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@riversand963 riversand963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except a minor comment.

@@ -515,6 +515,9 @@ TEST_F(EventListenerTest, DisableBGCompaction) {
db_->GetColumnFamilyMetaData(handles_[1], &cf_meta);
}
ASSERT_GE(listener->slowdown_count, kSlowdownTrigger * 9);
// We don't want the listener executing during DBTestBase::Close() due to
// race on handles_.
dbfull()->TEST_WaitForBackgroundWork();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about also asserting on error_handler_.GetBGError() is OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better, I'll make TEST_WaitForBackgroundWork() return error_handler_.GetBGError() (like other similar functions) and ASSERT_OK on that.

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Feb 22, 2022
Summary:
We often see flaky tests due to `DB::Flush()` or `DBImpl::TEST_WaitForFlushMemTable()` not waiting until event listeners complete. For example, #9084, #9400, #9528, plus two new ones this week: "EventListenerTest.OnSingleDBFlushTest" and "DBFlushTest.FireOnFlushCompletedAfterCommittedResult". I ran a `make check` with the below race condition-coercing patch and fixed  issues it found besides old BlobDB.

```
 diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
index 0e1864788..aaba68c4a 100644
 --- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -861,6 +861,8 @@ void DBImpl::NotifyOnFlushCompleted(
        mutable_cf_options.level0_stop_writes_trigger);
   // release lock while notifying events
   mutex_.Unlock();
+  bg_cv_.SignalAll();
+  sleep(1);
   {
     for (auto& info : *flush_jobs_info) {
       info->triggered_writes_slowdown = triggered_writes_slowdown;
```

The reason I did not fix old BlobDB issues is because it appears to have a fundamental (non-test) issue. In particular, it uses an EventListener to keep track of the files. OnFlushCompleted() could be delayed until even after a compaction involving that flushed file completes, causing the compaction to unexpectedly delete an untracked file.

Pull Request resolved: #9617

Test Plan: `make check` including the race condition coercing patch

Reviewed By: hx235

Differential Revision: D34384022

Pulled By: ajkr

fbshipit-source-id: 2652ded39b415277c5d6a628414345223930514e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants