Improve execution payload retrying #5941

ajsutton · 2022-07-18T05:11:36Z

PR Description

Replaces ReexecutingExecutionPayloadBlockManager with FailedExecutionPool. This simplifies the design a bit by using delegation instead of inheritance but also changes the algorithm for retrying:

Delay between retries now uses exponential back-off up to a maximum of 30 seconds
Only a single block is re-executed at a time instead of all pending blocks
The same block is retried on timeout (assuming the EL will complete execution of that payload even though the call timed out and then be able respond from a cached result)
A different block is retried on error or syncing to give the best chance of finding a fork we can continue to follow and not get stuck on a side-fork that doesn't have EL data available or that the EL is consistently hitting an error when importing

Fixed Issue(s)

fixes #5915
at least a start of #5914 and possibly enough...

Documentation

I thought about documentation and added the doc-change-required label to this PR if updates are required.

Changelog

I thought about adding a changelog entry, and added one if I deemed necessary.

…ool. Only retries a single block at a time and retries less often after repeated failures.

...ransition/src/test/java/tech/pegasys/teku/statetransition/block/FailedExecutionPoolTest.java

+import tech.pegasys.teku.spec.logic.common.statetransition.results.BlockImportResult;
+import tech.pegasys.teku.spec.util.DataStructureUtil;
+
+class FailedExecutionPoolTest {


tbenr

LGTM. With a nit and a reflection.

tbenr · 2022-07-18T10:12:55Z

...atetransition/src/main/java/tech/pegasys/teku/statetransition/block/FailedExecutionPool.java

+        .map(
+            error ->
+                ExceptionUtil.hasCause(error, TimeoutException.class)
+                    || ExceptionUtil.hasCause(error, SocketTimeoutException.class))


By checking timeouts befaviour of the underlining OkHttp I see there is another potential exception we can receive InterruptedIOException (https://www.baeldung.com/okhttp-timeouts#call)
We should not hit that because I don't think the Web3J is setting the call timeout. (In our custom restClient for builder endpoint, we do).
So for completeness I'd suggest tu add it. Might be good to have a hasCause accepting a varargs of classes and have the OR there.

If I correctly understand it could be also websockets and ipc there. For Websockets it looks like timeout will be IOException. Anyway, this method is detached and could be improved later

hmm, interesting. The exception I got from OkHttp was an InterruptedIOException with a SocketTimeoutException as the cause. Probably reasonable to look for InterruptedIOException directly though.

tbenr · 2022-07-18T10:19:47Z

...atetransition/src/main/java/tech/pegasys/teku/statetransition/block/FailedExecutionPool.java

+  private static final Logger LOG = LogManager.getLogger();
+  static final Duration MAX_RETRY_DELAY = Duration.ofSeconds(30);
+  static final Duration SHORT_DELAY = Duration.ofSeconds(2);
+  private final Queue<SignedBeaconBlock> awaitingExecution = new ArrayBlockingQueue<>(10);


nit: I feel like naming it awaitingExecutionQueue helps the reading

tbenr · 2022-07-18T13:42:03Z

beacon/sync/src/main/java/tech/pegasys/teku/beacon/sync/forward/multipeer/BatchImporter.java

@@ -74,8 +74,7 @@ public SafeFuture<BatchImportResult> importBatch(final Batch batch) {
              lastBlockImportResult -> {
                if (lastBlockImportResult.isSuccessful()) {
                  return BatchImportResult.IMPORTED_ALL_BLOCKS;
-                } else if (lastBlockImportResult.getFailureReason()
-                    == BlockImportResult.FailureReason.FAILED_EXECUTION_PAYLOAD_EXECUTION) {
+                } else if (lastBlockImportResult.hasFailedExecutingExecutionPayload()) {


So we do the batch retry (in 5s) also in when we receive SYNCING but we aren't in a position to optimistically import. There might be some misleading logs, though.
It also means that the batch retry will overlap with the FailedExecutionPool retry.

I also think that hasFailedExecutingExecutionPayload() is more like hasNotExecutedExecutionPayload()

Interestingly the BatchImporter never goes through the BlockManager so when things fail here they never wind up in the FailedExecutionPool at all and it's just the batch sync process that does the retry (which appears to work very well).

In terms of naming I think hasNotExecutedExecutionPayload sounds more like a SYNCING response where we'd optimistically import. It's a little bit of a stretch that a SYNCING responses is a failure in the period we aren't allowed to optimistically sync but it's the best name I can think of still...

Nice, I'm still waiting the day in which I stop confusing BlockManager and BlockImporter

zilm13 · 2022-07-18T18:23:48Z

...atetransition/src/main/java/tech/pegasys/teku/statetransition/block/FailedExecutionPool.java

+        final SignedBeaconBlock nextBlock = awaitingExecution.remove();
+        awaitingExecution.add(block);
+        retryingBlock = Optional.of(nextBlock);
+        scheduleNextRetry(nextBlock);


Maybe modify interface to scheduleNextRetry() (no param) and get retryingBlock on the first line of the method to reduce complexity and avoid extra mistake possibility

zilm13 · 2022-07-18T18:52:33Z

...atetransition/src/main/java/tech/pegasys/teku/statetransition/block/FailedExecutionPool.java

+        currentDelay = MAX_RETRY_DELAY;
+      }
+      if (awaitingExecution.isEmpty() || isTimeout(importResult)) {
+        scheduleNextRetry(block);


So if we have block with something like resource exhaustion attack or any other kind of failure which looks like timeout, we will continue DoS our EL without any chance to swap payload. Maybe it's better to not handle timeout especially for safety and simplicity

I can see the logic there, but that's not really how it works out. The EL won't stop executing a block when the CL times out and ELs can generally only import one block at a time so if we try to execute a different block it will just be queued behind the first one and also timeout. Whereas if we execute the same block it can be easily deduplicated. Building up a queue of other blocks to try increases memory usage and makes it even harder for the EL to continue. Plus by retrying the same block we're most likely to get a cached response once the block finally does finish executing. We should be able to get that cached response even if we wind up executing another block in the mean time but there will be limits to how many import results fit in a cache and how long they're held for.

Looking at the Geth code it looks like it will not forget some payload check result even if we query something after (and it sounds reasonable, we rarely make cache for 1 result in development). My point was about case of self-recovery , when client has internal timeout for execution (though Geth code doesn't contain it) or external self-recovery which will kill and restart app without signs of life. In case there exists very bad payload with looong execution we could reproduce EL torture with it again and again. But it will be definitely a network issue at this point, not only for Teku.

But the EL can't really time out or it could be rejecting an invalid block, so it has to keep processing until the block is done. We can't be sure if it's still processing or not (the connection may have dropped without sending a proper FIN or something like that) but if the EL is alive it will have to keep processing until it completes the block. And if it does time out it should remember it as an invalid block (otherwise in PoW it would execute it again when it next sees it on gossip).

So we can't make it any worse, but by retrying the same block we open up some simple optimisations for the EL to avoid building up a list of blocks to execute next.

And as you say, if its feasible to create a block that takes that long to execute then the entire network will be DoSed anyway.

…ion-pool

ajsutton added 4 commits July 18, 2022 14:23

Implement FailedExecutionPool

c8959fa

Replace ReexecutingExecutionPayloadBlockManager with FailedExecutionP…

09552ad

…ool. Only retries a single block at a time and retries less often after repeated failures.

Avoid adding block twice.

008d98b

Detect timeout exceptions correctly.

ed12e60

github-advanced-security bot found potential problems Jul 18, 2022

View reviewed changes

tbenr approved these changes Jul 18, 2022

View reviewed changes

zilm13 reviewed Jul 18, 2022

View reviewed changes

ajsutton added 3 commits July 19, 2022 08:34

Treat InterruptedIOException as a timeout

2be1137

Review feedback

b42be8a

Merge branch 'master' of github.com:ConsenSys/teku into failed-execut…

347edca

…ion-pool

ajsutton merged commit ac8b955 into Consensys:master Jul 18, 2022

ajsutton deleted the failed-execution-pool branch July 18, 2022 23:19

ajsutton mentioned this pull request Jul 19, 2022

--Xnetwork-safe-slots-to-import-optimistically can't be set to value lower than 128 #5943

Closed

zilm13 mentioned this pull request Jul 26, 2022

Possible sync stall on execution layer outage #5824

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve execution payload retrying #5941

Improve execution payload retrying #5941

ajsutton commented Jul 18, 2022

tbenr left a comment

tbenr Jul 18, 2022

zilm13 Jul 18, 2022

ajsutton Jul 18, 2022

tbenr Jul 18, 2022

tbenr Jul 18, 2022

ajsutton Jul 18, 2022

tbenr Jul 19, 2022

zilm13 Jul 18, 2022 •

edited

Loading

ajsutton Jul 18, 2022

zilm13 Jul 18, 2022

ajsutton Jul 18, 2022

zilm13 Jul 19, 2022

ajsutton Jul 19, 2022

Improve execution payload retrying #5941

Improve execution payload retrying #5941

Conversation

ajsutton commented Jul 18, 2022

PR Description

Fixed Issue(s)

Documentation

Changelog

tbenr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zilm13 Jul 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zilm13 Jul 18, 2022 •

edited

Loading