Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve execution payload retrying #5941

Merged
merged 7 commits into from
Jul 18, 2022

Conversation

ajsutton
Copy link
Contributor

PR Description

Replaces ReexecutingExecutionPayloadBlockManager with FailedExecutionPool. This simplifies the design a bit by using delegation instead of inheritance but also changes the algorithm for retrying:

  • Delay between retries now uses exponential back-off up to a maximum of 30 seconds
  • Only a single block is re-executed at a time instead of all pending blocks
  • The same block is retried on timeout (assuming the EL will complete execution of that payload even though the call timed out and then be able respond from a cached result)
  • A different block is retried on error or syncing to give the best chance of finding a fork we can continue to follow and not get stuck on a side-fork that doesn't have EL data available or that the EL is consistently hitting an error when importing

Fixed Issue(s)

fixes #5915
at least a start of #5914 and possibly enough...

Documentation

  • I thought about documentation and added the doc-change-required label to this PR if updates are required.

Changelog

  • I thought about adding a changelog entry, and added one if I deemed necessary.

import tech.pegasys.teku.spec.logic.common.statetransition.results.BlockImportResult;
import tech.pegasys.teku.spec.util.DataStructureUtil;

class FailedExecutionPoolTest {

Check notice

Code scanning / CodeQL

Unused classes and interfaces

Unused class: FailedExecutionPoolTest is not referenced within this codebase. If not used as an external API it should be removed.
Copy link
Contributor

@tbenr tbenr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. With a nit and a reflection.

.map(
error ->
ExceptionUtil.hasCause(error, TimeoutException.class)
|| ExceptionUtil.hasCause(error, SocketTimeoutException.class))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By checking timeouts befaviour of the underlining OkHttp I see there is another potential exception we can receive InterruptedIOException (https://www.baeldung.com/okhttp-timeouts#call)
We should not hit that because I don't think the Web3J is setting the call timeout. (In our custom restClient for builder endpoint, we do).
So for completeness I'd suggest tu add it. Might be good to have a hasCause accepting a varargs of classes and have the OR there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I correctly understand it could be also websockets and ipc there. For Websockets it looks like timeout will be IOException. Anyway, this method is detached and could be improved later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, interesting. The exception I got from OkHttp was an InterruptedIOException with a SocketTimeoutException as the cause. Probably reasonable to look for InterruptedIOException directly though.

private static final Logger LOG = LogManager.getLogger();
static final Duration MAX_RETRY_DELAY = Duration.ofSeconds(30);
static final Duration SHORT_DELAY = Duration.ofSeconds(2);
private final Queue<SignedBeaconBlock> awaitingExecution = new ArrayBlockingQueue<>(10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I feel like naming it awaitingExecutionQueue helps the reading

@@ -74,8 +74,7 @@ public SafeFuture<BatchImportResult> importBatch(final Batch batch) {
lastBlockImportResult -> {
if (lastBlockImportResult.isSuccessful()) {
return BatchImportResult.IMPORTED_ALL_BLOCKS;
} else if (lastBlockImportResult.getFailureReason()
== BlockImportResult.FailureReason.FAILED_EXECUTION_PAYLOAD_EXECUTION) {
} else if (lastBlockImportResult.hasFailedExecutingExecutionPayload()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we do the batch retry (in 5s) also in when we receive SYNCING but we aren't in a position to optimistically import. There might be some misleading logs, though.
It also means that the batch retry will overlap with the FailedExecutionPool retry.

I also think that hasFailedExecutingExecutionPayload() is more like hasNotExecutedExecutionPayload()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly the BatchImporter never goes through the BlockManager so when things fail here they never wind up in the FailedExecutionPool at all and it's just the batch sync process that does the retry (which appears to work very well).

In terms of naming I think hasNotExecutedExecutionPayload sounds more like a SYNCING response where we'd optimistically import. It's a little bit of a stretch that a SYNCING responses is a failure in the period we aren't allowed to optimistically sync but it's the best name I can think of still...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I'm still waiting the day in which I stop confusing BlockManager and BlockImporter

final SignedBeaconBlock nextBlock = awaitingExecution.remove();
awaitingExecution.add(block);
retryingBlock = Optional.of(nextBlock);
scheduleNextRetry(nextBlock);
Copy link
Contributor

@zilm13 zilm13 Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe modify interface to scheduleNextRetry() (no param) and get retryingBlock on the first line of the method to reduce complexity and avoid extra mistake possibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

currentDelay = MAX_RETRY_DELAY;
}
if (awaitingExecution.isEmpty() || isTimeout(importResult)) {
scheduleNextRetry(block);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we have block with something like resource exhaustion attack or any other kind of failure which looks like timeout, we will continue DoS our EL without any chance to swap payload. Maybe it's better to not handle timeout especially for safety and simplicity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the logic there, but that's not really how it works out. The EL won't stop executing a block when the CL times out and ELs can generally only import one block at a time so if we try to execute a different block it will just be queued behind the first one and also timeout. Whereas if we execute the same block it can be easily deduplicated. Building up a queue of other blocks to try increases memory usage and makes it even harder for the EL to continue. Plus by retrying the same block we're most likely to get a cached response once the block finally does finish executing. We should be able to get that cached response even if we wind up executing another block in the mean time but there will be limits to how many import results fit in a cache and how long they're held for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the Geth code it looks like it will not forget some payload check result even if we query something after (and it sounds reasonable, we rarely make cache for 1 result in development). My point was about case of self-recovery , when client has internal timeout for execution (though Geth code doesn't contain it) or external self-recovery which will kill and restart app without signs of life. In case there exists very bad payload with looong execution we could reproduce EL torture with it again and again. But it will be definitely a network issue at this point, not only for Teku.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the EL can't really time out or it could be rejecting an invalid block, so it has to keep processing until the block is done. We can't be sure if it's still processing or not (the connection may have dropped without sending a proper FIN or something like that) but if the EL is alive it will have to keep processing until it completes the block. And if it does time out it should remember it as an invalid block (otherwise in PoW it would execute it again when it next sees it on gossip).

So we can't make it any worse, but by retrying the same block we open up some simple optimisations for the EL to avoid building up a list of blocks to execute next.

And as you say, if its feasible to create a block that takes that long to execute then the entire network will be DoSed anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid unintended reexecution of expired blocks in ReexecutingExecutionPayloadBlockManager
3 participants