Skip to content

Commit

Permalink
Improve explicit user readahead performance (#5246)
Browse files Browse the repository at this point in the history
Summary:
Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`.

1. Stop creating new table readers when the user explicitly sets readahead size.
2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases).
3. Add `readahead_size` to db_bench.

**Benchmarks:**
https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737

For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%.
For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%.

**Test Plan:**
Updated `DBIteratorTest.ReadAhead` test to make sure that:
- no new table readers are created for iterators on setting ReadOptions.readahead_size
- At least "readahead" number of bytes are actually getting read on each iterator read.

TODO later:
- Use similar logic for compactions as well.
- This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later.
Pull Request resolved: #5246

Differential Revision: D15107946

Pulled By: sagar0

fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea
  • Loading branch information
sagar0 authored and facebook-github-bot committed Apr 27, 2019
1 parent 8c7eb59 commit 3548e42
Show file tree
Hide file tree
Showing 8 changed files with 73 additions and 42 deletions.
1 change: 1 addition & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
* Block-based table index now contains exact highest key in the file, rather than an upper bound. This may improve Get() and iterator Seek() performance in some situations, especially when direct IO is enabled and block cache is disabled. A setting BlockBasedTableOptions::index_shortening is introduced to control this behavior. Set it to kShortenSeparatorsAndSuccessor to get the old behavior.
* When reading from option file/string/map, customized envs can be filled according to object registry.
* Add an option `snap_refresh_nanos` (default to 0.5s) to periodically refresh the snapshot list in compaction jobs. Assign to 0 to disable the feature.
* Improve range scan performance when using explicit user readahead by not creating new table readers for every iterator.

### Public API Change
* Change the behavior of OptimizeForPointLookup(): move away from hash-based block-based-table index, and use whole key memtable filtering.
Expand Down
4 changes: 2 additions & 2 deletions db/db_iterator_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1943,8 +1943,8 @@ TEST_P(DBIteratorTest, ReadAhead) {
delete iter;
int64_t num_file_closes_readahead =
TestGetTickerCount(options, NO_FILE_CLOSES);
ASSERT_EQ(num_file_opens + 3, num_file_opens_readahead);
ASSERT_EQ(num_file_closes + 3, num_file_closes_readahead);
ASSERT_EQ(num_file_opens, num_file_opens_readahead);
ASSERT_EQ(num_file_closes, num_file_closes_readahead);
ASSERT_GT(bytes_read_readahead, bytes_read);
ASSERT_GT(bytes_read_readahead, read_options.readahead_size * 3);

Expand Down
6 changes: 6 additions & 0 deletions db/db_test_util.h
Original file line number Diff line number Diff line change
Expand Up @@ -438,6 +438,12 @@ class SpecialEnv : public EnvWrapper {
return s;
}

virtual Status Prefetch(uint64_t offset, size_t n) override {
Status s = target_->Prefetch(offset, n);
*bytes_read_ += n;
return s;
}

private:
std::unique_ptr<RandomAccessFile> target_;
anon::AtomicCounter* counter_;
Expand Down
3 changes: 0 additions & 3 deletions db/table_cache.cc
Original file line number Diff line number Diff line change
Expand Up @@ -213,9 +213,6 @@ InternalIterator* TableCache::NewIterator(
readahead = env_options.compaction_readahead_size;
create_new_table_reader = true;
}
} else {
readahead = options.readahead_size;
create_new_table_reader = readahead > 0;
}

auto& fd = file_meta.fd;
Expand Down
11 changes: 8 additions & 3 deletions include/rocksdb/options.h
Original file line number Diff line number Diff line change
Expand Up @@ -1131,9 +1131,14 @@ struct ReadOptions {
// Default: nullptr
const Slice* iterate_upper_bound;

// If non-zero, NewIterator will create a new table reader which
// performs reads of the given size. Using a large size (> 2MB) can
// improve the performance of forward iteration on spinning disks.
// RocksDB does auto-readahead for iterators on noticing more than two reads
// for a table file. The readahead starts at 8KB and doubles on every
// additional read upto 256KB.
// This option can help if most of the range scans are large, and if it is
// determined that a larger readahead than that enabled by auto-readahead is
// needed.
// Using a large readahead size (> 2MB) can typically improve the performance
// of forward iteration on spinning disks.
// Default: 0
size_t readahead_size;

Expand Down
76 changes: 47 additions & 29 deletions table/block_based_table_reader.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2167,10 +2167,6 @@ BlockBasedTable::PartitionedIndexIteratorState::PartitionedIndexIteratorState(
index_key_includes_seq_(index_key_includes_seq),
index_key_is_full_(index_key_is_full) {}

template <class TBlockIter, typename TValue>
const size_t BlockBasedTableIterator<TBlockIter, TValue>::kMaxReadaheadSize =
256 * 1024;

InternalIteratorBase<BlockHandle>*
BlockBasedTable::PartitionedIndexIteratorState::NewSecondaryIterator(
const BlockHandle& handle) {
Expand Down Expand Up @@ -2453,6 +2449,13 @@ void BlockBasedTableIterator<TBlockIter, TValue>::Prev() {
FindKeyBackward();
}

// Found that 256 KB readahead size provides the best performance, based on
// experiments, for auto readahead. Experiment data is in PR #3282.
template <class TBlockIter, typename TValue>
const size_t
BlockBasedTableIterator<TBlockIter, TValue>::kMaxAutoReadaheadSize =
256 * 1024;

template <class TBlockIter, typename TValue>
void BlockBasedTableIterator<TBlockIter, TValue>::InitDataBlock() {
BlockHandle data_block_handle = index_iter_->value();
Expand All @@ -2465,32 +2468,47 @@ void BlockBasedTableIterator<TBlockIter, TValue>::InitDataBlock() {
}
auto* rep = table_->get_rep();

// Automatically prefetch additional data when a range scan (iterator) does
// more than 2 sequential IOs. This is enabled only for user reads and when
// ReadOptions.readahead_size is 0.
if (!for_compaction_ && read_options_.readahead_size == 0) {
num_file_reads_++;
if (num_file_reads_ > 2) {
if (!rep->file->use_direct_io() &&
(data_block_handle.offset() +
static_cast<size_t>(data_block_handle.size()) +
kBlockTrailerSize >
readahead_limit_)) {
// Buffered I/O
// Discarding the return status of Prefetch calls intentionally, as we
// can fallback to reading from disk if Prefetch fails.
rep->file->Prefetch(data_block_handle.offset(), readahead_size_);
readahead_limit_ =
static_cast<size_t>(data_block_handle.offset() + readahead_size_);
// Keep exponentially increasing readahead size until
// kMaxReadaheadSize.
readahead_size_ = std::min(kMaxReadaheadSize, readahead_size_ * 2);
} else if (rep->file->use_direct_io() && !prefetch_buffer_) {
// Direct I/O
// Let FilePrefetchBuffer take care of the readahead.
prefetch_buffer_.reset(new FilePrefetchBuffer(
rep->file.get(), kInitReadaheadSize, kMaxReadaheadSize));
// Prefetch additional data for range scans (iterators). Enabled only for
// user reads.
// Implicit auto readahead:
// Enabled after 2 sequential IOs when ReadOptions.readahead_size == 0.
// Explicit user requested readahead:
// Enabled from the very first IO when ReadOptions.readahead_size is set.
if (!for_compaction_) {
if (read_options_.readahead_size == 0) {
// Implicit auto readahead
num_file_reads_++;
if (num_file_reads_ > kMinNumFileReadsToStartAutoReadahead) {
if (!rep->file->use_direct_io() &&
(data_block_handle.offset() +
static_cast<size_t>(data_block_handle.size()) +
kBlockTrailerSize >
readahead_limit_)) {
// Buffered I/O
// Discarding the return status of Prefetch calls intentionally, as
// we can fallback to reading from disk if Prefetch fails.
rep->file->Prefetch(data_block_handle.offset(), readahead_size_);
readahead_limit_ = static_cast<size_t>(data_block_handle.offset() +
readahead_size_);
// Keep exponentially increasing readahead size until
// kMaxAutoReadaheadSize.
readahead_size_ =
std::min(kMaxAutoReadaheadSize, readahead_size_ * 2);
} else if (rep->file->use_direct_io() && !prefetch_buffer_) {
// Direct I/O
// Let FilePrefetchBuffer take care of the readahead.
prefetch_buffer_.reset(
new FilePrefetchBuffer(rep->file.get(), kInitAutoReadaheadSize,
kMaxAutoReadaheadSize));
}
}
} else if (!prefetch_buffer_) {
// Explicit user requested readahead
// The actual condition is:
// if (read_options_.readahead_size != 0 && !prefetch_buffer_)
prefetch_buffer_.reset(new FilePrefetchBuffer(
rep->file.get(), read_options_.readahead_size,
read_options_.readahead_size));
}
}

Expand Down
12 changes: 7 additions & 5 deletions table/block_based_table_reader.h
Original file line number Diff line number Diff line change
Expand Up @@ -717,13 +717,15 @@ class BlockBasedTableIterator : public InternalIteratorBase<TValue> {
bool for_compaction_;
BlockHandle prev_index_value_;

static const size_t kInitReadaheadSize = 8 * 1024;
// All the below fields control iterator readahead
static const size_t kInitAutoReadaheadSize = 8 * 1024;
// Found that 256 KB readahead size provides the best performance, based on
// experiments.
static const size_t kMaxReadaheadSize;
size_t readahead_size_ = kInitReadaheadSize;
// experiments, for auto readahead. Experiment data is in PR #3282.
static const size_t kMaxAutoReadaheadSize;
static const int kMinNumFileReadsToStartAutoReadahead = 2;
size_t readahead_size_ = kInitAutoReadaheadSize;
size_t readahead_limit_ = 0;
int num_file_reads_ = 0;
int64_t num_file_reads_ = 0;
std::unique_ptr<FilePrefetchBuffer> prefetch_buffer_;
};

Expand Down
2 changes: 2 additions & 0 deletions tools/db_bench_tool.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1172,6 +1172,7 @@ DEFINE_int32(skip_list_lookahead, 0, "Used with skip_list memtablerep; try "
"position");
DEFINE_bool(report_file_operations, false, "if report number of file "
"operations");
DEFINE_int32(readahead_size, 0, "Iterator readahead size");

static const bool FLAGS_soft_rate_limit_dummy __attribute__((__unused__)) =
RegisterFlagValidator(&FLAGS_soft_rate_limit, &ValidateRateLimit);
Expand Down Expand Up @@ -4987,6 +4988,7 @@ void VerifyDBFromDB(std::string& truth_db_name) {
options.total_order_seek = FLAGS_total_order_seek;
options.prefix_same_as_start = FLAGS_prefix_same_as_start;
options.tailing = FLAGS_use_tailing_iterator;
options.readahead_size = FLAGS_readahead_size;

Iterator* single_iter = nullptr;
std::vector<Iterator*> multi_iters;
Expand Down

0 comments on commit 3548e42

Please sign in to comment.