Skip to content

Commit

Permalink
[hotfix][docs] Clarify semantic of tolerable checkpoint failure number
Browse files Browse the repository at this point in the history
  • Loading branch information
pnowojski committed Mar 18, 2022
1 parent ab08b52 commit e9028ca
Show file tree
Hide file tree
Showing 4 changed files with 14 additions and 5 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,8 @@ Checkpoint 其他的属性包括:

- *checkpoint 可容忍连续失败次数*:该属性定义可容忍多少次连续的 checkpoint 失败。超过这个阈值之后会触发作业错误 fail over。
默认次数为“0”,这意味着不容忍 checkpoint 失败,作业将在第一次 checkpoint 失败时fail over。
可容忍的checkpoint失败仅适用于下列情形:Job Manager的IOException,TaskManager做checkpoint时异步部分的失败,
checkpoint超时等。TaskManager做checkpoint时同步部分的失败会直接触发作业fail over。其它的checkpoint失败(如一个checkpoint被另一个checkpoint包含)会被忽略掉。

- *并发 checkpoint 的数目*: 默认情况下,在上一个 checkpoint 未完成(失败或者成功)的情况下,系统不会触发另一个 checkpoint。这确保了拓扑不会在 checkpoint 上花费太多时间,从而影响正常的处理流程。
不过允许多个 checkpoint 并行进行是可行的,对于有确定的处理延迟(例如某方法所调用比较耗时的外部服务),但是仍然想进行频繁的 checkpoint 去最小化故障后重跑的 pipelines 来说,是有意义的。
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,9 +67,13 @@ Other parameters for checkpointing include:

Note that this value also implies that the number of concurrent checkpoints is *one*.

- *tolerable checkpoint failure number*: This defines how many consecutive checkpoint failures will be tolerated,
before the whole job is failed over. The default value is `0`, which means no checkpoint failures will be tolerated,
and the job will fail on first reported checkpoint failure.
- *tolerable checkpoint failure number*: This defines how many consecutive checkpoint failures will
be tolerated, before the whole job is failed over. The default value is `0`, which means no
checkpoint failures will be tolerated, and the job will fail on first reported checkpoint failure.
This only applies to the following failure reasons: IOException on the Job Manager, failures in
the async phase on the Task Managers and checkpoint expiration due to a timeout. Failures
originating from the sync phase on the Task Managers are always forcing failover of an affected
task. Other types of checkpoint failures (such as checkpoint being subsumed) are being ignored.

- *number of concurrent checkpoints*: By default, the system will not trigger another checkpoint while one is still in progress.
This ensures that the topology does not spend too much time on checkpoints and not make progress with processing the streams.
Expand Down
2 changes: 1 addition & 1 deletion docs/static/generated/rest_v1_dispatcher.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ info:
license:
name: Apache 2.0
url: https://www.apache.org/licenses/LICENSE-2.0.html
version: v1/1.15-SNAPSHOT
version: v1/1.16-SNAPSHOT
paths:
/cluster:
delete:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,10 @@ public class ExecutionCheckpointingOptions {
.noDefaultValue()
.withDescription(
"The tolerable checkpoint consecutive failure number. If set to 0, that means "
+ "we do not tolerance any checkpoint failure.");
+ "we do not tolerance any checkpoint failure. This only applies to the following failure reasons: IOException on the "
+ "Job Manager, failures in the async phase on the Task Managers and checkpoint expiration due to a timeout. Failures "
+ "originating from the sync phase on the Task Managers are always forcing failover of an affected task. Other types of "
+ "checkpoint failures (such as checkpoint being subsumed) are being ignored.");

public static final ConfigOption<CheckpointConfig.ExternalizedCheckpointCleanup>
EXTERNALIZED_CHECKPOINT =
Expand Down

0 comments on commit e9028ca

Please sign in to comment.