Fix broken error handling around Idempotent producer + Ensure strict ordering when Net.MaxOpenRequests = 1 #2943

richardartoul · 2024-07-19T22:26:37Z

There have been numerous issues filed lately that:

Strict ordering no longer works after request pipelining was introduced even when Net.MaxOpenRequests is set to 1.
Requests can still end up failing / mis-sequenced when Idempotent producer is enabled.

I was able to reproduce these issues both locally and in a staging environment and have fixed both of them. P.R has many comments explaining the changes.

Relevant issues:

Signed-off-by: Richard Artoul <richardartoul@gmail.com>

richardartoul · 2024-07-20T00:28:54Z

Hmm... I think it may still be possible for concurrent requests to be issued sendResponse writes to a channel, and once the response is picked up off that channel there could be an inflight request from the next request + an inflight retry actually. I'm not sure how to resolve that other then somehow "waiting" for a response to be processed (either failing as an error or being retried)

puellanivis

I’m somewhat concerned by the number of panic()s in the code. Are these really so serious that they need to potentially crash the whole program? and/or kill off a processing goroutine, without any information up-the-chain that processing has terminated?

puellanivis · 2024-07-21T05:47:22Z

async_producer.go

@@ -249,6 +250,19 @@ func (pe ProducerError) Unwrap() error {
 type ProducerErrors []*ProducerError

 func (pe ProducerErrors) Error() string {
+	if len(pe) > 0 {


Should this ever actually be produced with zero messages?

If it’s unlikely to ever happen with len(pe) == 0 then that should be the guard condition, and then the complex error message should be the unindented path.

puellanivis · 2024-07-21T05:48:39Z

async_producer.go

@@ -695,6 +709,9 @@ func (pp *partitionProducer) dispatch() {
 		// All messages being retried (sent or not) have already had their retry count updated
 		// Also, ignore "special" syn/fin messages used to sync the brokerProducer and the topicProducer.
 		if pp.parent.conf.Producer.Idempotent && msg.retries == 0 && msg.flags == 0 {
+			if msg.hasSequence {
+				panic("assertion failure: reassigning producer epoch and sequence number to message that already has them")


https://go.dev/wiki/CodeReviewComments#dont-panic

Is the condition here so bad that we need to panic? (That is, is it entirely unrecoverable?)

puellanivis · 2024-07-21T05:57:37Z

produce_set.go

+			Logger.Println(
+				"assertion failed: message out of sequence added to batch",
+				"producer_id",
+				ps.producerID,
+				set.recordsToSend.RecordBatch.ProducerID,
+				"producer_epoch",
+				ps.producerEpoch,
+				set.recordsToSend.RecordBatch.ProducerEpoch,
+				"sequence_number",
+				msg.sequenceNumber,
+				set.recordsToSend.RecordBatch.FirstSequence,
+				"buffer_count",
+				ps.bufferCount,
+				"msg_has_sequence",
+				msg.hasSequence)


I would recommend leaving the log message on the same line as the method call, so that it’s easily findable via grep, and otherwise one-line isolates well.

Additionally, if the line is long enough to break up, then the possibility of adding even more fields is high, so each entry should end with a comma and a newline, so adding new fiels to the end of the call don’t produce unnecessary line changes, where the only change is in punctuation due to syntax requirements.

Then I would pair up each log field name with the log field value, all together:

Logger.Println("assertion failed: message out of sequence added to batch", "producer_id", ps.producerID, set.recordsToSend.RecordBatch.ProducerID, "producer_epoch", ps.producerEpoch, set.recordsToSend.RecordBatch.ProducerEpoch, "sequence_number", msg.sequenceNumber, set.recordsToSend.RecordBatch.FirstSequence, "buffer_count", ps.bufferCount, "msg_has_sequence", msg.hasSequence, )

puellanivis · 2024-07-21T06:00:37Z

async_producer.go

+	}
+
+	if !succeeded {
+		Logger.Printf("Failed retrying batch for %v-%d because of %v while looking up for new leader, no more retries\n", topic, partition)


[nitpick] Newlines at the end should be unnecessary for loggers? (I mean, this is generally the case, but I don’t know if that is specifically true here.)

Three % verbs are specified but only two arguments are given.

puellanivis · 2024-07-21T06:02:45Z

async_producer.go

+	// as expected. This retry loop is very important since prematurely (and unnecessarily) failing
+	// an idempotent batch is ~equivalent to data loss.
+	succeeded := false
+	for i := 0; i < p.conf.Producer.Retry.Max; i++ {


Suggest using a different variable name if we’re trying retries/tries rather than indices.

[off-by-one smell] Are we counting retries, or tries? That is, if I’ve asked for 5 retries max, then that’s 6 total tries.

richardartoul force-pushed the main branch from ecba376 to 3bba76c Compare July 19, 2024 22:27

richardartoul added 11 commits July 19, 2024 17:29

sync produce

0a07760

Signed-off-by: Richard Artoul <richardartoul@gmail.com>

add sequence for retries

78b02d0

Signed-off-by: Richard Artoul <richardartoul@gmail.com>

panic if no sequence

46d19e5

Signed-off-by: Richard Artoul <richardartoul@gmail.com>

maintain producer ID and epoch when retrying batch

e715c7d

Signed-off-by: Richard Artoul <richardartoul@gmail.com>

panic out of order

d22f96f

Signed-off-by: Richard Artoul <richardartoul@gmail.com>

wip

337aab2

Signed-off-by: Richard Artoul <richardartoul@gmail.com>

improve and add comments

b7cc6fc

Signed-off-by: Richard Artoul <richardartoul@gmail.com>

remove unnecessary check

82e8795

Signed-off-by: Richard Artoul <richardartoul@gmail.com>

better comment

401c990

Signed-off-by: Richard Artoul <richardartoul@gmail.com>

Add kafkastorageerror as retriable

2d7cdb6

Signed-off-by: Richard Artoul <richardartoul@gmail.com>

fix typo

1cf850b

Signed-off-by: Richard Artoul <richardartoul@gmail.com>

richardartoul force-pushed the main branch from 3bba76c to 1cf850b Compare July 19, 2024 22:29

This was referenced Jul 19, 2024

AsyncProducer produces messages in out-of-order when retries happen #2619

Open

Sarama Async Producer Encounters 'Out of Order' Error: what are the reasons? #2803

Open

Does sarama still guarantee message ordering? #2860

Open

richardartoul added 3 commits July 20, 2024 17:03

improve error logging

1125d60

more logging

b73a5f2

dont clear message before logging

0222838

puellanivis reviewed Jul 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken error handling around Idempotent producer + Ensure strict ordering when Net.MaxOpenRequests = 1 #2943

Fix broken error handling around Idempotent producer + Ensure strict ordering when Net.MaxOpenRequests = 1 #2943

richardartoul commented Jul 19, 2024

richardartoul commented Jul 20, 2024 •

edited

Loading

puellanivis left a comment

puellanivis Jul 21, 2024

puellanivis Jul 21, 2024

puellanivis Jul 21, 2024

puellanivis Jul 21, 2024

puellanivis Jul 21, 2024

Fix broken error handling around Idempotent producer + Ensure strict ordering when Net.MaxOpenRequests = 1 #2943

Are you sure you want to change the base?

Fix broken error handling around Idempotent producer + Ensure strict ordering when Net.MaxOpenRequests = 1 #2943

Conversation

richardartoul commented Jul 19, 2024

richardartoul commented Jul 20, 2024 • edited Loading

puellanivis left a comment

Choose a reason for hiding this comment

puellanivis Jul 21, 2024

Choose a reason for hiding this comment

puellanivis Jul 21, 2024

Choose a reason for hiding this comment

puellanivis Jul 21, 2024

Choose a reason for hiding this comment

puellanivis Jul 21, 2024

Choose a reason for hiding this comment

puellanivis Jul 21, 2024

Choose a reason for hiding this comment

richardartoul commented Jul 20, 2024 •

edited

Loading