Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-28184 Addendum PR #5521

Open
wants to merge 2 commits into
base: branch-2.5
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -259,10 +259,11 @@ private boolean readNextEntryAndRecordReaderPosition() throws IOException {
Entry readEntry = reader.next();
long readerPos = reader.getPosition();
OptionalLong fileLength;
if (logQueue.getQueueSize(walGroupId) > 1) {
if (logQueue.getQueueSize(walGroupId) > 2) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for less than equal to 2 log files in the queue because it is possible that we are doing this check just when the WAL is rolled and the reader has not read the trailer bytes of the old WAL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestBasicWALEntryStreamFSHLog#testCleanClosedWALs is failing in branch-2.5 with the original PR #5505
This test is not failing for master branch. There is lot of code refactor in master/branch-2 compared to branch-2.5.
@Apache9 @sunhelly Can you please review? Thank you !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to greater than 2 can fix the failing tests? A bit strange, could you please exlain more on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any updates here?

Copy link
Contributor Author

@shahrs87 shahrs87 Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any updates here?

@Apache9 Sorry couldn't update this thread in a long time. Got distracted somewhere and it fell off my radar.

Changing to greater than 2 can fix the failing tests? A bit strange, could you please exlain more on this?

Actually changing to greater than 2 fixes the failing test but looks like it is not the right fix.
The test is doing the following:

  1. Creating WAL named wal1
  2. Appending some entries to wal1
  3. Calling entryStream.next to read from wal1
  4. Roll the WAL to wal2
  5. Append some entries to wal2
  6. Call entryStream.next to read from wal2
  7. Test that there are NO uncleanlyClosedLogs metric.

The test is failing at #6 above. When it is calling entryStream.next on wal2, the replication code needs to switch the reader to the new WAL file. During rollWriter, we add it to AbstractFSWAL#inflightWALClosures map and close the old WAL file asynchronously here.

In closeWriter method, we append the trailer to the WAL and then close it.
During the closeWriter execution there will be 2 WALs in the logQueue.
Now in WALEntryStream#next method, after this change, we don't read the file length if logQueue size is greater than 1 and hence WALEntryStream is unaware of the trailer bytes and while switching the wal from wal1 to wal2, it gets the following exception:

2024-01-17T10:05:47,247 DEBUG [Listener at localhost/52964] wal.ProtobufLogReader(447): Encountered a malformed edit, seeking back to last good position in file, from 218 to 210
java.io.EOFException: Invalid PB, EOF? Ignoring; originalPosition=210, currentPosition=218
	at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:376) ~[classes/:?]
	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104) ~[classes/:?]
	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92) ~[classes/:?]
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:259) ~[classes/:?]
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:181) ~[classes/:?]
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102) ~[classes/:?]
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.peek(WALEntryStream.java:111) ~[classes/:?]
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.next(WALEntryStream.java:118) ~[classes/:?]
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.access$001(WALEntryStreamTestBase.java:82) ~[test-classes/:?]
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.lambda$next$0(WALEntryStreamTestBase.java:95) ~[test-classes/:?]
	at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:184) ~[test-classes/:?]
	at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:135) ~[test-classes/:?]
	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.next(WALEntryStreamTestBase.java:94) ~[test-classes/:?]
	at org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testCleanClosedWALs(TestBasicWALEntryStream.java:726) ~[test-classes/:?]

I think I know how to fix.
From PR-5505, we have the below check

    OptionalLong fileLength;
    if (logQueue.getQueueSize(walGroupId) > 1) {
      fileLength = OptionalLong.empty();
    } else {
      // if there is only one file in queue, check whether it is still being written to
      fileLength = walFileLengthProvider.getLogFileSizeIfBeingWritten(currentPath);
    }

Along with checking queue size, we also have to check if the currently replicated WAL is not in AbstractFSWAL#inflightWALClosures map then it is safe to not read the file size.

But currently there is NO way to access AbstractFSWAL object from WALEntryStream. I found one class named WALFileLengthProvider but that is a functional interface and as the name suggests it only provides log file size if that file is currently being written.

@Apache9 Any ideas on how can we expose AbstractFSWAL object to WALEntryStream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Apache9 Can you please take a look in my prev comment?

fileLength = OptionalLong.empty();
} else {
// if there is only one file in queue, check whether it is still being written to
// if there are less than or equal 2 files in the queue,
// check whether it is still being written to.
fileLength = walFileLengthProvider.getLogFileSizeIfBeingWritten(currentPath);
}
if (fileLength.isPresent() && readerPos > fileLength.getAsLong()) {
Expand Down