[BEAM-2660] Set PubsubIO batch size using builder #3619

cjmcgraw · 2017-07-23T01:42:37Z

BEAM-2660 asks for controlling batch size using the PubsubIO.Write.Builder

This PR adds Two values configurable through the PubsubIO.Write.Builder:

maxBatchSize - controls the bulk batch request size
maxBatchByteSize - controls the bulk batch bytes request size

In this PR I have also made a modification to the PubsubIO.Write.PubsubBoundedWriter. Now the writer will dynamically track the number of bytes allocated for all messages. If the number of bytes exceeds the threshold it will publish before adding more messages.

If the message size exceeds the maxBatchByteSize then an exception will be thrown

An example use case of the new parameter is:

PubsubIO.writeMessages()
    .withMaxBatchSize(100)
    .withMaxBatchByteSize(100000)
   .to("my-topic")

reuvenlax · 2017-08-14T17:50:34Z

R: @reuvenlax

coveralls · 2017-08-14T19:16:58Z

Coverage decreased (-0.6%) to 69.968% when pulling c2abeb9 on cjmcgraw:update-pubsubIO into f398748 on apache:master.

jkff · 2017-12-16T02:18:30Z

Is this PR still relevant? @reuvenlax ?

cjmcgraw · 2017-12-21T01:58:59Z

I am still interested in merging this pull request if possible. My company still has a use case for beam with tuples exceeding the default byte size.

BenFradet · 2018-05-10T10:54:36Z

any updates on this moving forward?

kennknowles · 2018-06-28T16:27:37Z

We have turned on autoformatting of the codebase, which causes small conflicts across the board. You can probably safely rebase and just keep your changes. Like this:

$ git rebase
... see some conflicts
$ git diff
... confirmed that the conflicts are just autoformatting
... so we can just keep our changes are do our own autoformat
$ git checkout --theirs --
$ git add -u
$ git rebase --continue
$ ./gradlew spotlessJavaApply

Please ping me if you run into any difficulty.

…xceed a pre-defined maximum batch byte size

aromanenko-dev · 2018-07-11T13:13:51Z

.../java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java

@@ -732,9 +734,20 @@ private PubsubIO() {}
  /** Implementation of {@link #write}. */
  @AutoValue
  public abstract static class Write<T> extends PTransform<PCollection<T>, PDone> {
+    private static final int MAX_PUBLISH_BATCH_BYTE_SIZE_DEFAULT = 1000000;


I think, more precisely it should be 10mb, or 10 * 1024 * 1024 bytes.

aromanenko-dev · 2018-07-16T08:18:50Z

@reuvenlax @jkff could you take a look on this one?
This PR seems good for me and I wished to have this merged. I updated this PR to resolve merging conflicts.

reuvenlax · 2018-07-17T04:31:52Z

@aromanenko-dev which runner do you need this for?

aromanenko-dev · 2018-07-19T15:51:48Z

@reuvenlax Hmm, I guess it should work despite of which runner is used. No?

cjmcgraw · 2018-07-20T21:28:53Z

@aromanenko-dev pubsub is google cloud specific. But this change is not runner specific

reuvenlax · 2018-07-26T06:22:54Z

@aromanenko-dev it somewhat is. It turns out that Dataflow has it's own implementation of the PubSub source, and so this PR will not change any behavior for Dataflow - only for non-Dataflow runners. This PR is still a good one I believe, however I want to make sure that you know it will not affect the Dataflow runner.

aromanenko-dev · 2018-08-07T13:36:11Z

Run Dataflow ValidatesRunner

dadrian · 2018-08-07T14:57:08Z

It turns out that Dataflow has it's own implementation of the PubSub source

What? For one, this PR doesn't touch the source, just the sink. Second, if that's the case, how do we get this fixed in the Dataflow runner? I currently have code running in prod that rolls it's own Pubsub client to compensate for this size limitation, and I'd really like to get rid of it.

reuvenlax · 2018-08-07T17:38:40Z

@dadrian true of both the source and the sink, at least for Dataflow streaming. Dataflow's batch runner does use this code.

I can go ahead and merge this PR. If you have a use case where this is needed for Dataflow streaming you need to contact Google with a bug report.

aromanenko-dev · 2018-08-08T09:10:41Z

@cjmcgraw

pubsub is google cloud specific. But this change is not runner specific

Yes, that is why I was wondering how it's related to any specific runner and @reuvenlax explained that it's happened that Dataflow runner has it's own implementation for Pubsub support.

@reuvenlax

however I want to make sure that you know it will not affect the Dataflow runner.

As @dadrian mentioned above, this PR affects only sink part of PubsubIO, not source. To be honest, I don't know if Dataflow runner uses PubsubIO sink or not (not familiar with this part of code), so I can't guarantee this.
Do you think that running Dataflow_ValidatesRunner job is not enough? Do we need to run any other tests for that?

In general, this LGTM except some raised concerns about Dataflow runner below. If it's tested by Dataflow_ValidatesRunner job (and it was passed) then I'd like to have this merged.

reuvenlax · 2018-08-08T13:31:50Z

.../java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java


-        if (output.size() >= MAX_PUBLISH_BATCH_SIZE) {
+        if (payload.length > maxPublishBatchByteSize) {


I think it's better to default to success here and simply publish this message. Otherwise this could be considered a backwards-incompatible change, as it changes IO semantics.

reuvenlax · 2018-08-08T13:34:05Z

I have one change requested on the code to make it backwards compatible, and then I will merge this PR. I just wanted to make sure @cjmcgraw understood that this PR will not affect behavior of Dataflow streaming (in general, runners are allowed to replace IOs with their own internal versions).

cjmcgraw · 2018-08-08T21:23:54Z

currently my company is using this as a batch for loading prediction tuples in fast batch. We are using this in Dataflow as we speak, and have been since this fork was created. Our use case most likely won't need to be streaming. So the change is effective for my problem.

That being said I am not fully groking the issue here. I'd like to get clarity for when/if someone stumbles across this in the future.

@dadrian

What? For one, this PR doesn't touch the source, just the sink. Second, if that's the case, how do we get this fixed in the Dataflow runner? I currently have code running in prod that rolls it's own Pubsub client to compensate for this size limitation, and I'd really like to get rid of it.

@reuvenlax

@dadrian true of both the source and the sink, at least for Dataflow streaming. Dataflow's batch runner does use this code.

@aromanenko-dev

Yes, that is why I was wondering how it's related to any specific runner and @reuvenlax explained that it's happened that Dataflow runner has it's own implementation for Pubsub support.

If I recall the limitation with the sink was that it was using the gcloud SDK to submit a grpc request. There was a hard coded default of the maximum number of bytes that one bulk request could be. I simply allowed the hard coded value to be dynamic.

Since the implementation was in the builder for the sink, I applied the values to both the bounded and unbounded sinks.

The source request didn't have a maximum message size API parameter. So it will be enforced by Pubsub instead of Beam.

If I am understanding this all correctly. This means that it can be used in both the bounded and unbounded cases.

reuvenlax · 2018-08-09T14:06:21Z

@cjmcgraw This PR will work for you then if you are using batch. As I mentioned above I only have one comment - throwing SizeLimitExceededException is backwards-incompatible behavior that might cause existing pipelines to stop working - and I am happy to merge this PR. This change looks perfectly good to me otherwise.

dadrian · 2018-08-09T15:29:06Z

It's not a backwards incompatible change (assuming the default max size actually matches the max size of a request you can send to Pubsub). The behavior right now is to throw an exception in the Google Cloud SDK internals, rather than from the Beam SDK.

The whole point of this PR is to prevent Beam from submitting batches that are larger than the underlying libraries support.

reuvenlax · 2018-08-09T18:41:22Z

@dadrian good point. I was having trouble finding Pub/Sub documentation about these limits, but I've validated that these are the limits. I'll go ahead an merge now.

Carl McGraw and others added 4 commits July 11, 2018 11:35

Added maxPublishBatchSize parameter to PubsubBoundedWriter class.

b47133f

updated BoundedPubsubWriter to dynamically flush if queued messages e…

fa2d110

…xceed a pre-defined maximum batch byte size

updated UnboundedPubsubSink to accept new parameters.

9899d8a

Resolve merging conflicts

c83e853

aromanenko-dev force-pushed the update-pubsubIO branch from c2abeb9 to c83e853 Compare July 11, 2018 09:47

aromanenko-dev requested changes Jul 11, 2018

View reviewed changes

set maximum batch size to 10mb (10 * 1024 * 1024)

806fa19

reuvenlax reviewed Aug 8, 2018

View reviewed changes

aromanenko-dev approved these changes Aug 8, 2018

View reviewed changes

nevillelyh mentioned this pull request Aug 9, 2018

PubsubIO batch size should be configurable spotify/scio#936

Closed

reuvenlax merged commit ec3ed84 into apache:master Aug 9, 2018

nguyent mentioned this pull request Aug 13, 2018

[BEAM-3218] Added Quota checks for PubsubMessage in PubsubBoundedWriter #4275

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-2660] Set PubsubIO batch size using builder #3619

[BEAM-2660] Set PubsubIO batch size using builder #3619

cjmcgraw commented Jul 23, 2017

reuvenlax commented Aug 14, 2017

coveralls commented Aug 14, 2017

jkff commented Dec 16, 2017

cjmcgraw commented Dec 21, 2017

BenFradet commented May 10, 2018

kennknowles commented Jun 28, 2018

aromanenko-dev Jul 11, 2018

aromanenko-dev commented Jul 16, 2018

reuvenlax commented Jul 17, 2018

aromanenko-dev commented Jul 19, 2018 •

edited

Loading

cjmcgraw commented Jul 20, 2018

reuvenlax commented Jul 26, 2018

aromanenko-dev commented Aug 7, 2018

dadrian commented Aug 7, 2018

reuvenlax commented Aug 7, 2018

aromanenko-dev commented Aug 8, 2018 •

edited

Loading

reuvenlax Aug 8, 2018

reuvenlax commented Aug 8, 2018

cjmcgraw commented Aug 8, 2018

reuvenlax commented Aug 9, 2018

dadrian commented Aug 9, 2018

reuvenlax commented Aug 9, 2018


		if (output.size() >= MAX_PUBLISH_BATCH_SIZE) {
		if (payload.length > maxPublishBatchByteSize) {

[BEAM-2660] Set PubsubIO batch size using builder #3619

[BEAM-2660] Set PubsubIO batch size using builder #3619

Conversation

cjmcgraw commented Jul 23, 2017

reuvenlax commented Aug 14, 2017

coveralls commented Aug 14, 2017

jkff commented Dec 16, 2017

cjmcgraw commented Dec 21, 2017

BenFradet commented May 10, 2018

kennknowles commented Jun 28, 2018

aromanenko-dev Jul 11, 2018

Choose a reason for hiding this comment

aromanenko-dev commented Jul 16, 2018

reuvenlax commented Jul 17, 2018

aromanenko-dev commented Jul 19, 2018 • edited Loading

cjmcgraw commented Jul 20, 2018

reuvenlax commented Jul 26, 2018

aromanenko-dev commented Aug 7, 2018

dadrian commented Aug 7, 2018

reuvenlax commented Aug 7, 2018

aromanenko-dev commented Aug 8, 2018 • edited Loading

reuvenlax Aug 8, 2018

Choose a reason for hiding this comment

reuvenlax commented Aug 8, 2018

cjmcgraw commented Aug 8, 2018

reuvenlax commented Aug 9, 2018

dadrian commented Aug 9, 2018

reuvenlax commented Aug 9, 2018

aromanenko-dev commented Jul 19, 2018 •

edited

Loading

aromanenko-dev commented Aug 8, 2018 •

edited

Loading