-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve execution payload retrying #5941
Conversation
…ool. Only retries a single block at a time and retries less often after repeated failures.
import tech.pegasys.teku.spec.logic.common.statetransition.results.BlockImportResult; | ||
import tech.pegasys.teku.spec.util.DataStructureUtil; | ||
|
||
class FailedExecutionPoolTest { |
Check notice
Code scanning / CodeQL
Unused classes and interfaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. With a nit and a reflection.
.map( | ||
error -> | ||
ExceptionUtil.hasCause(error, TimeoutException.class) | ||
|| ExceptionUtil.hasCause(error, SocketTimeoutException.class)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By checking timeouts befaviour of the underlining OkHttp I see there is another potential exception we can receive InterruptedIOException
(https://www.baeldung.com/okhttp-timeouts#call)
We should not hit that because I don't think the Web3J is setting the call timeout. (In our custom restClient for builder endpoint, we do).
So for completeness I'd suggest tu add it. Might be good to have a hasCause
accepting a varargs
of classes and have the OR there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I correctly understand it could be also websockets and ipc there. For Websockets it looks like timeout will be IOException
. Anyway, this method is detached and could be improved later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, interesting. The exception I got from OkHttp was an InterruptedIOException
with a SocketTimeoutException
as the cause. Probably reasonable to look for InterruptedIOException directly though.
private static final Logger LOG = LogManager.getLogger(); | ||
static final Duration MAX_RETRY_DELAY = Duration.ofSeconds(30); | ||
static final Duration SHORT_DELAY = Duration.ofSeconds(2); | ||
private final Queue<SignedBeaconBlock> awaitingExecution = new ArrayBlockingQueue<>(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I feel like naming it awaitingExecutionQueue
helps the reading
@@ -74,8 +74,7 @@ public SafeFuture<BatchImportResult> importBatch(final Batch batch) { | |||
lastBlockImportResult -> { | |||
if (lastBlockImportResult.isSuccessful()) { | |||
return BatchImportResult.IMPORTED_ALL_BLOCKS; | |||
} else if (lastBlockImportResult.getFailureReason() | |||
== BlockImportResult.FailureReason.FAILED_EXECUTION_PAYLOAD_EXECUTION) { | |||
} else if (lastBlockImportResult.hasFailedExecutingExecutionPayload()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we do the batch retry (in 5s) also in when we receive SYNCING but we aren't in a position to optimistically import. There might be some misleading logs, though.
It also means that the batch retry will overlap with the FailedExecutionPool
retry.
I also think that hasFailedExecutingExecutionPayload()
is more like hasNotExecutedExecutionPayload()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly the BatchImporter
never goes through the BlockManager
so when things fail here they never wind up in the FailedExecutionPool
at all and it's just the batch sync process that does the retry (which appears to work very well).
In terms of naming I think hasNotExecutedExecutionPayload
sounds more like a SYNCING
response where we'd optimistically import. It's a little bit of a stretch that a SYNCING responses is a failure in the period we aren't allowed to optimistically sync but it's the best name I can think of still...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I'm still waiting the day in which I stop confusing BlockManager
and BlockImporter
final SignedBeaconBlock nextBlock = awaitingExecution.remove(); | ||
awaitingExecution.add(block); | ||
retryingBlock = Optional.of(nextBlock); | ||
scheduleNextRetry(nextBlock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe modify interface to scheduleNextRetry()
(no param) and get retryingBlock
on the first line of the method to reduce complexity and avoid extra mistake possibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
currentDelay = MAX_RETRY_DELAY; | ||
} | ||
if (awaitingExecution.isEmpty() || isTimeout(importResult)) { | ||
scheduleNextRetry(block); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we have block with something like resource exhaustion attack or any other kind of failure which looks like timeout, we will continue DoS our EL without any chance to swap payload. Maybe it's better to not handle timeout especially for safety and simplicity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the logic there, but that's not really how it works out. The EL won't stop executing a block when the CL times out and ELs can generally only import one block at a time so if we try to execute a different block it will just be queued behind the first one and also timeout. Whereas if we execute the same block it can be easily deduplicated. Building up a queue of other blocks to try increases memory usage and makes it even harder for the EL to continue. Plus by retrying the same block we're most likely to get a cached response once the block finally does finish executing. We should be able to get that cached response even if we wind up executing another block in the mean time but there will be limits to how many import results fit in a cache and how long they're held for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the Geth code it looks like it will not forget some payload check result even if we query something after (and it sounds reasonable, we rarely make cache for 1 result in development). My point was about case of self-recovery , when client has internal timeout for execution (though Geth code doesn't contain it) or external self-recovery which will kill and restart app without signs of life. In case there exists very bad payload with looong execution we could reproduce EL torture with it again and again. But it will be definitely a network issue at this point, not only for Teku.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the EL can't really time out or it could be rejecting an invalid block, so it has to keep processing until the block is done. We can't be sure if it's still processing or not (the connection may have dropped without sending a proper FIN or something like that) but if the EL is alive it will have to keep processing until it completes the block. And if it does time out it should remember it as an invalid block (otherwise in PoW it would execute it again when it next sees it on gossip).
So we can't make it any worse, but by retrying the same block we open up some simple optimisations for the EL to avoid building up a list of blocks to execute next.
And as you say, if its feasible to create a block that takes that long to execute then the entire network will be DoSed anyway.
PR Description
Replaces
ReexecutingExecutionPayloadBlockManager
withFailedExecutionPool
. This simplifies the design a bit by using delegation instead of inheritance but also changes the algorithm for retrying:Fixed Issue(s)
fixes #5915
at least a start of #5914 and possibly enough...
Documentation
doc-change-required
label to this PR if updates are required.Changelog