Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TSAN data race in EventListenerTest.MultiCF #9528

Closed
wants to merge 1 commit into from

Conversation

hx235
Copy link
Contributor

@hx235 hx235 commented Feb 8, 2022

Context:
EventListenerTest.MultiCF occasionally failed on TSAN data race as below:

WARNING: ThreadSanitizer: data race (pid=2047633)
  Read of size 8 at 0x7b6000001440 by main thread:
    #0 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::size()  (listener_test+0x52337c)
    #1 rocksdb::EventListenerTest_MultiCF_Test::TestBody() db/listener_test.cc:384:7 (listener_test+0x52337c)

  Previous write of size 8 at 0x7b6000001440 by thread T2:
    #0 void std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::_M_realloc_insert<rocksdb::DB* const&>(__gnu_cxx::__normal_iterator<rocksdb::DB**, std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> > >, rocksdb::DB* const&) (listener_test+0x550654)
    #1 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::push_back(rocksdb::DB* const&)  (listener_test+0x550654)
    #2 rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB*, rocksdb::FlushJobInfo const&) db/listener_test.cc:255:18 (listener_test+0x550654)

After investigation, it is due to the following:
(1) ASSERT_OK(Flush(i)); before the read std::vector::size() is supposed to be blocked on DB::Impl::bg_cv_ for memtable flush to finish and get signaled at the end of background flush , which happens after the write std::vector::push_back() . So the sequence of execution should have been synchronized as call flush() -> write -> return from flush() -> read and would not cause any TSAN data race.

(2) However, there are multiple places in the code can signal this DB::Impl::bg_cv_ and mistakenly wake up ASSERT_OK(Flush(i)); (or ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable());) too early (and with the lock available to them), resulting in non-synchronized read and write thus a TSAN data race.

  • Reproduced by the following, suggested by @ajkr:
diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
index 4ff87c1e4..52492e9cf 100644
--- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -22,7 +22,7 @@
 #include "test_util/sync_point.h"
 #include "util/cast_util.h"
 #include "util/concurrent_task_limiter_impl.h"
 namespace ROCKSDB_NAMESPACE {
 
 bool DBImpl::EnoughRoomForCompaction(
@@ -855,6 +855,7 @@ void DBImpl::NotifyOnFlushCompleted(
        mutable_cf_options.level0_stop_writes_trigger);
   // release lock while notifying events
   mutex_.Unlock();
+  bg_cv_.SignalAll();

Summary:

  • Added synchornization between read and write by ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency() mechanism

Test: ./listener_test --gtest_filter=EventListenerTest.MultiCF --gtest_repeat=10

  • pre-fix:
Repeating all tests (iteration 3) 
Note: Google Test filter = EventListenerTest.MultiCF
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from EventListenerTest
[ RUN      ] EventListenerTest.MultiCF
==================
WARNING: ThreadSanitizer: data race (pid=3377137)
  Read of size 8 at 0x7b6000000840 by main thread:
    #0 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::size() 
    #1 rocksdb::EventListenerTest_MultiCF_Test::TestBody() db/listener_test.cc:384 (listener_test+0x4bb300)

  Previous write of size 8 at 0x7b6000000840 by thread T2:
    #0 void std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::_M_realloc_insert<rocksdb::DB* const&>(__gnu_cxx::__normal_iterator<rocksdb::DB**, std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> > >, rocksdb::DB* const&) 
    #1 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::push_back(rocksdb::DB* const&) 
    #2 rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB*, rocksdb::FlushJobInfo const&) db/listener_test.cc:255 (listener_test+0x4e820f)
  • post-fix: All passed

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ASSERT_OK(Flush(i));
ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable());
ASSERT_EQ(listener->flushed_dbs_.size(), i);
ASSERT_EQ(listener->flushed_column_family_names_.size(), i);
ASSERT_TRUE(call_back);
Copy link
Contributor

@ajkr ajkr Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought the problem is TEST_WaitForFlushMemTable does not wait for event listener to finish. What did you think is the problem?

Try this patch, for example:

diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
index 4ff87c1e4..8afb1c782 100644
--- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -855,6 +855,8 @@ void DBImpl::NotifyOnFlushCompleted(
        mutable_cf_options.level0_stop_writes_trigger);
   // release lock while notifying events
   mutex_.Unlock();
+  bg_cv_.SignalAll();
+  usleep(100000);
   {
     for (auto& info : *flush_jobs_info) {
       info->triggered_writes_slowdown = triggered_writes_slowdown;

Copy link
Contributor Author

@hx235 hx235 Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's the problem but I am thinking of fixing it in the way we write test. The reason is that the design of "// release lock while notifying events" seems to be valid for optimizing performance and TEST_WaitForFlushMemTable's underlying bg_cv_ seems to be waiting for that same lock.

And I'm thinking technically TEST_WaitForFlushMemTable waits for memtable to be flushed but not necessarily needs to include listeners.

Maybe we could have another TEST_WaitForFlushNotificationFinish, which is essentially the same as using sync point here without using another cv.

(Intended to write more context in the PR but got distracted... )

(I am also open for more discussion for fixing these flaky cases, including #9516 as they are not urgent to be merged in)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is close to fixing in the test as you said, but the reason I commented this line in particular is because call_back might not have been set to true by the time this assertion executes, because TEST_WaitForFlushMemTable() doesn't wait long enough.

Copy link
Contributor Author

@hx235 hx235 Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is close to fixing in the test as you said, but the reason I commented this line in particular is because call_back might not have been set to true by the time this assertion executes, because TEST_WaitForFlushMemTable() doesn't wait long enough.

Oh ..... yep I somehow overlooked that and repeated myself with the same mistake behind the flaky test …

@hx235 hx235 marked this pull request as draft February 9, 2022 01:12
@hx235 hx235 changed the title Attempt to fix TSAN data race in EventListenerTest.MultiCF [Draft] Attempt to fix TSAN data race in EventListenerTest.MultiCF Feb 9, 2022
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 changed the title [Draft] Attempt to fix TSAN data race in EventListenerTest.MultiCF Fix TSAN data race in EventListenerTest.MultiCF Feb 10, 2022
Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I was thinking you could assert the callback was called 8 times after Close()

@hx235
Copy link
Contributor Author

hx235 commented Feb 10, 2022

LGTM. I was thinking you could assert the callback was called 8 times after Close()

Until I realized there is something called SyncPoint::LoadDependency(). And by the way, still writing my context....

@ajkr
Copy link
Contributor

ajkr commented Feb 10, 2022

SyncPoint::LoadDependency()

Works for me.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 marked this pull request as ready for review February 10, 2022 03:08
facebook-github-bot pushed a commit that referenced this pull request Feb 22, 2022
Summary:
We often see flaky tests due to `DB::Flush()` or `DBImpl::TEST_WaitForFlushMemTable()` not waiting until event listeners complete. For example, #9084, #9400, #9528, plus two new ones this week: "EventListenerTest.OnSingleDBFlushTest" and "DBFlushTest.FireOnFlushCompletedAfterCommittedResult". I ran a `make check` with the below race condition-coercing patch and fixed  issues it found besides old BlobDB.

```
 diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
index 0e1864788..aaba68c4a 100644
 --- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -861,6 +861,8 @@ void DBImpl::NotifyOnFlushCompleted(
        mutable_cf_options.level0_stop_writes_trigger);
   // release lock while notifying events
   mutex_.Unlock();
+  bg_cv_.SignalAll();
+  sleep(1);
   {
     for (auto& info : *flush_jobs_info) {
       info->triggered_writes_slowdown = triggered_writes_slowdown;
```

The reason I did not fix old BlobDB issues is because it appears to have a fundamental (non-test) issue. In particular, it uses an EventListener to keep track of the files. OnFlushCompleted() could be delayed until even after a compaction involving that flushed file completes, causing the compaction to unexpectedly delete an untracked file.

Pull Request resolved: #9617

Test Plan: `make check` including the race condition coercing patch

Reviewed By: hx235

Differential Revision: D34384022

Pulled By: ajkr

fbshipit-source-id: 2652ded39b415277c5d6a628414345223930514e
facebook-github-bot pushed a commit that referenced this pull request Sep 12, 2022
…race condition (#10291)

Summary:
**Context/Summary:**
Previous experience with bugs and flaky tests taught us there exist features in RocksDB vulnerable to race condition caused by acquiring db mutex at a particular timing. This PR aggressively exposes those vulnerable features by injecting spurious wakeup and sleep to cause acquiring db mutex at various timing in order to expose such race condition

**Testing:**
- `COERCE_CONTEXT_SWITCH=1 make -j56 check / make -j56 db_stress` should reveal
    - flaky tests caused by db mutex related race condition
       - Reverted #9528
       - A/B testing on `COMPILE_WITH_TSAN=1 make -j56 listener_test` w/ and w/o `COERCE_CONTEXT_SWITCH=1` followed by `./listener_test --gtest_filter=EventListenerTest.MultiCF --gtest_repeat=10`
       - `COERCE_CONTEXT_SWITCH=1` can cause expected test failure (i.e, expose target TSAN data race error) within 10 run while the other couldn't.
       - This proves our injection can expose flaky tests caused by db mutex related race condition faster.
    -  known or new race-condition-type of internal bug by continuously running this PR
- Performance
   - High ops-threads time: COERCE_CONTEXT_SWITCH=1 regressed by 4 times slower (2:01.16 vs 0:22.10 elapsed ). This PR will be run as a separate CI job so this regression won't affect any existing job.
```
TEST_TMPDIR=$db /usr/bin/time ./db_stress \
--ops_per_thread=100000 --expected_values_dir=$exp --clear_column_family_one_in=0 \
--write_buffer_size=524288 —target_file_size_base=524288 —ingest_external_file_one_in=100 —compact_files_one_in=1000 —compact_range_one_in=1000
```
  - Start-up time:  COERCE_CONTEXT_SWITCH=1 didn't regress by 25% (0:01.51 vs 0:01.29 elapsed)
```
TEST_TMPDIR=$db ./db_stress -ops_per_thread=100000000 -expected_values_dir=$exp --clear_column_family_one_in=0 & sleep 120; pkill -9 db_stress

TEST_TMPDIR=$db /usr/bin/time ./db_stress \
--ops_per_thread=1 -reopen=0 --expected_values_dir=$exp --clear_column_family_one_in=0 --destroy_db_initially=0
```

Pull Request resolved: #10291

Reviewed By: ajkr

Differential Revision: D39231182

Pulled By: hx235

fbshipit-source-id: 7ab6695430460e0826727fd8c66679b32b3e44b6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants