HDDS-4119. Improve performance of the BufferPool management of Ozone client #1336

elek · 2020-08-17T20:05:05Z

What changes were proposed in this pull request?

Teragen reported to be slow with low number of mappers compared to HDFS.

In my test (one pipeline, 3 yarn nodes) 10 g teragen with HDFS was ~3 mins but with Ozone it was 6 mins. It could be fixed with using more mappers, but when I investigated the execution I found a few problems reagrding to the BufferPool management.

IncrementalChunkBuffer is slow and it might not be required as BufferPool itself is incremental
For each write operation the bufferPool.allocateBufferIfNeeded is called which can be a slow operation (positions should be calculated).
There is no explicit support for write(byte) operations

In the flamegraphs it's clearly visible that with low number of mappers the client is busy with buffer operations. After the patch the Rpc call and the checksum calculation give the majority of the time.

Overall write performance is improved with at least 30% when minimal number of threads/mappers are used.

Thanks

Special thanks to @lokeshj1703, who helped me find the small mistakes in the original verison of the patch.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4119

How was this patch tested?

Teragen 10/100g with 2/30 mappers.

(https://github.com/elek/ozone-perf-env/tree/master/teragen-hdfs)

adoroszlai

Thanks @elek and @lokeshj1703 for working on this. Is it intentionally created with target branch apache:HDDS-4119 instead of apache:master?

hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/BlockOutputStream.java

hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/common/ChunkBuffer.java

...lient/src/test/java/org/apache/hadoop/hdds/scm/storage/TestBlockOutputStreamCorrectness.java

adoroszlai · 2020-08-24T09:05:25Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/common/ChunkBuffer.java

@@ -44,9 +45,6 @@ static ChunkBuffer allocate(int capacity) {
   *   When increment <= 0, entire buffer is allocated in the beginning.
   */
  static ChunkBuffer allocate(int capacity, int increment) {
-    if (increment > 0 && increment < capacity) {
-      return new IncrementalChunkBuffer(capacity, increment, false);


Can you please also change TestChunkBuffer#runTestIncrementalChunkBuffer to explicitly create IncrementalChunkBuffer? Currently it uses this factory method, and so with this patch it really tests ChunkBufferImplWithByteBuffer.

Wow, nice catch.

IncrementalChunkBuffer was added to address cases where ozone client were running into OOM with keys less than chunk size , as without this, the smallest buffer which will be allocated will always be equal to the chunk size(4MB by default).

Please see https://issues.apache.org/jira/browse/HDDS-2331 for more details.
I would prefer to not remove this logic of incremental chunk buffer and may be hide it within an internal config.

@elek , how much of perf gain we will have of we still do incremental buffer allocation?

@elek , how much of perf gain we will have of we still do incremental buffer allocation?

I can repeat the test to get exact numbers, but I couldn't get good performance without removing the incremental buffer. You can easily test it with the new unit test, if you write a lot of data with byte=1, it's still low.

IncrementalChunkBuffer was added to address cases where ozone client were running into OOM with keys less than chunk size , as without this, the smallest buffer which will be allocated will always be equal to the chunk size(4MB by default).

I think it's a valid (and important question), but as far as I see it's safe to remove the IncrementalByteBuffer. As far as I see the situation is slightly different since HDDS-2331. I tried to test this patch with the commands from the HDDS-2331:

ozone freon rk --numOfThreads 1 --numOfVolumes 1 --numOfBuckets 1 --replicationType RATIS --factor ONE --keySize 1048576 --numOfKeys 5120 --bufferSize 65536

I couldn't reproduce the OOM.

Based on my understanding:

We already have an increment by the ByteBuffer but size of the increment is 4MB (adding one more buffer when required)

4MB seems to be acceptable even with many clients in the same JVM, especially if we can have acceptable performance.

Let's say I have 100 Ozone clients (in the same JVM!!!) which write 1kb keys. I will have (4MB-1kb) *100 overhead without the IncrementalChunkBuffer (as far as I understood). It's still <400MB in exchange for 30-100% performance gain. Sounds like a good deal.

But let me know if you see any problems here.

Let's say the 400MB overhead is unacceptable (or my calculation was wrong and the overhead is higher ;-) )

As far as I see the BufferPool is created per key. I think it would be possible to set the buffer size to min(keySize, bufferSize). With this approach the first and only buffer of the BufferPool can have exactly the required size (which covers all the where the key size is < 4MB)

the situation is slightly different since HDDS-2331

Note that default chunk size was 16MB at the time when HDDS-2331 was reported. The benefit from IncrementalChunkBuffer is less now with 4MB default size.

I would suggest separating the two parts of the PR:

reorganize the position calculation and allocation

remove the usage of the Incremental buffer

While we can continue to searching for the safest method to do 2 (or do something instead of the removal), we can merge the first part where we already have an agreement.

See #1374 about the 2nd.

hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/BlockOutputStream.java

adoroszlai · 2020-08-27T12:18:53Z

hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/BlockOutputStream.java

-      if (currentBuffer.hasRemaining()) {
-        writeChunk(currentBuffer);


Note the condition is different here, might not be safe to replace with writeChunkIfNeeded().

Thanks the help. I fully reverted these lines in #23ba2d1 and build is green again.

elek · 2020-09-01T12:57:09Z

See HDDS-4186 about the incremental chunk buffer.

elek · 2020-09-07T06:21:59Z

Is there any more objections/comments? All the comments are addressed. Can we merge this?

elek · 2020-09-11T14:24:53Z

Thanks the review @bshashikant and @adoroszlai

I am merging it now...

…client (apache#1336)

* master: (47 commits) HDDS-4104. Provide a way to get the default value and key of java-based-configuration easily (apache#1369) HDDS-4250. Fix wrong logger name (apache#1429) HDDS-4244. Container deleted wrong replica cause mis-replicated. (apache#1423) HDDS-4053. Volume space: add quotaUsageInBytes and update it when write and delete key. (apache#1296) HDDS-4210. ResolveBucket during checkAcls fails. (apache#1398) HDDS-4075. Retry request on different OM on AccessControlException (apache#1303) HDDS-4166. Documentation index page redirects to the wrong address (apache#1372) HDDS-4039. Reduce the number of fields in hdds.proto to improve performance (apache#1289) HDDS-4155. Directory and filename can end up with same name in a path. (apache#1361) HDDS-3927. Rename Ozone OM,DN,SCM runtime options to conform to naming conventions (apache#1401) HDDS-4119. Improve performance of the BufferPool management of Ozone client (apache#1336) HDDS-4217.Remove test TestOzoneContainerRatis (apache#1408) HDDS-4218.Remove test TestRatisManager (apache#1409) HDDS-4129. change MAX_QUOTA_IN_BYTES to Long.MAX_VALUE. (apache#1337) HDDS-4228: add field 'num' to ALLOCATE_BLOCK of scm audit log. (apache#1413) HDDS-4196. Add an endpoint in Recon to query Prometheus (apache#1390) HDDS-4211. [OFS] Better owner and group display for listing Ozone volumes and buckets (apache#1397) HDDS-4150. recon.api.TestEndpoints test is flaky (apache#1396) HDDS-4170 - Fix typo in method description. (apache#1406) HDDS-4064. Show container verbose info with verbose option (apache#1290) ...

…ponse * HDDS-4122-quota-attempt2: (51 commits) Remove redundant check status calls in children of AbstractOMOpenKeyDeleteRequest Remove unused inports and fix super constructor calls Move common volume byte usage update code to AbstractOMKeyDeleteResponse Add volume quota update to open key delete response, and group duplicate code HDDS-4104. Provide a way to get the default value and key of java-based-configuration easily (apache#1369) HDDS-4250. Fix wrong logger name (apache#1429) HDDS-4244. Container deleted wrong replica cause mis-replicated. (apache#1423) HDDS-4053. Volume space: add quotaUsageInBytes and update it when write and delete key. (apache#1296) HDDS-4210. ResolveBucket during checkAcls fails. (apache#1398) HDDS-4075. Retry request on different OM on AccessControlException (apache#1303) HDDS-4166. Documentation index page redirects to the wrong address (apache#1372) HDDS-4039. Reduce the number of fields in hdds.proto to improve performance (apache#1289) HDDS-4155. Directory and filename can end up with same name in a path. (apache#1361) HDDS-3927. Rename Ozone OM,DN,SCM runtime options to conform to naming conventions (apache#1401) HDDS-4119. Improve performance of the BufferPool management of Ozone client (apache#1336) HDDS-4217.Remove test TestOzoneContainerRatis (apache#1408) HDDS-4218.Remove test TestRatisManager (apache#1409) HDDS-4129. change MAX_QUOTA_IN_BYTES to Long.MAX_VALUE. (apache#1337) HDDS-4228: add field 'num' to ALLOCATE_BLOCK of scm audit log. (apache#1413) HDDS-4196. Add an endpoint in Recon to query Prometheus (apache#1390) ...

elek added 11 commits August 13, 2020 10:36

teragenfix

16347b7

revert genesis changes

61a8e6a

cleanup patch

2e32191

Cleanup tests and block output stream

962cfd5

fix buffer pool allocation

d4456c0

unit test fix

1c4f272

additional debug log

65a542f

fix write(byte) with the help of Lokesh

256db2a

Additional fixes from Lokesh

118d8ce

rat and checkstyle fixes

514f711

checkstyle fixes

dd99deb

adoroszlai reviewed Aug 24, 2020

View reviewed changes

elek changed the base branch from HDDS-4119 to master August 25, 2020 13:26

elek added 3 commits August 26, 2020 13:21

Address review comments

a2d8fc5

checkstyle fixes

ad9c07c

move conditions to the helper methods

9ab01a7

adoroszlai reviewed Aug 27, 2020

View reviewed changes

adoroszlai and others added 6 commits August 27, 2020 16:47

Revert single writeChunk() call with different condition

40721a1

restore orginal writeChunk logic in handleFlush

8969b42

Use incremental chunk buffer for time being

7bf5b29

Merge remote-tracking branch 'elek/HDDS-4119' into HDDS-4119

bc5b38b

Merge remote-tracking branch 'origin/master' into HDDS-4119

0bce14d

fix merge problem

c4144cc

elek mentioned this pull request Sep 1, 2020

HDDS-4185. Remove IncrementalByteBuffer from Ozone client #1374

Closed

revert change in retry

23ba2d1

bshashikant approved these changes Sep 8, 2020

View reviewed changes

elek merged commit 72e3215 into apache:master Sep 11, 2020

llemec pushed a commit to llemec/hadoop-ozone that referenced this pull request Sep 15, 2020

HDDS-4119. Improve performance of the BufferPool management of Ozone …

0fd7984

…client (apache#1336)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-4119. Improve performance of the BufferPool management of Ozone client #1336

HDDS-4119. Improve performance of the BufferPool management of Ozone client #1336

elek commented Aug 17, 2020

adoroszlai left a comment

adoroszlai Aug 24, 2020

elek Aug 26, 2020

bshashikant Aug 26, 2020

elek Aug 27, 2020

adoroszlai Aug 27, 2020

elek Aug 31, 2020

elek Sep 1, 2020

adoroszlai Aug 27, 2020

elek Sep 3, 2020

elek commented Sep 1, 2020

elek commented Sep 7, 2020

elek commented Sep 11, 2020

		if (currentBuffer.hasRemaining()) {
		writeChunk(currentBuffer);

HDDS-4119. Improve performance of the BufferPool management of Ozone client #1336

HDDS-4119. Improve performance of the BufferPool management of Ozone client #1336

Conversation

elek commented Aug 17, 2020

What changes were proposed in this pull request?

Thanks

What is the link to the Apache JIRA

How was this patch tested?

adoroszlai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elek commented Sep 1, 2020

elek commented Sep 7, 2020

elek commented Sep 11, 2020