Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-4119. Improve performance of the BufferPool management of Ozone client #1336

Merged
merged 21 commits into from
Sep 11, 2020

Conversation

elek
Copy link
Member

@elek elek commented Aug 17, 2020

What changes were proposed in this pull request?

Teragen reported to be slow with low number of mappers compared to HDFS.

In my test (one pipeline, 3 yarn nodes) 10 g teragen with HDFS was ~3 mins but with Ozone it was 6 mins. It could be fixed with using more mappers, but when I investigated the execution I found a few problems reagrding to the BufferPool management.

  1. IncrementalChunkBuffer is slow and it might not be required as BufferPool itself is incremental
  2. For each write operation the bufferPool.allocateBufferIfNeeded is called which can be a slow operation (positions should be calculated).
  3. There is no explicit support for write(byte) operations

In the flamegraphs it's clearly visible that with low number of mappers the client is busy with buffer operations. After the patch the Rpc call and the checksum calculation give the majority of the time.

Overall write performance is improved with at least 30% when minimal number of threads/mappers are used.

Thanks

Special thanks to @lokeshj1703, who helped me find the small mistakes in the original verison of the patch.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4119

How was this patch tested?

Teragen 10/100g with 2/30 mappers.

(https://github.com/elek/ozone-perf-env/tree/master/teragen-hdfs)

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @elek and @lokeshj1703 for working on this. Is it intentionally created with target branch apache:HDDS-4119 instead of apache:master?

@@ -44,9 +45,6 @@ static ChunkBuffer allocate(int capacity) {
* When increment <= 0, entire buffer is allocated in the beginning.
*/
static ChunkBuffer allocate(int capacity, int increment) {
if (increment > 0 && increment < capacity) {
return new IncrementalChunkBuffer(capacity, increment, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also change TestChunkBuffer#runTestIncrementalChunkBuffer to explicitly create IncrementalChunkBuffer? Currently it uses this factory method, and so with this patch it really tests ChunkBufferImplWithByteBuffer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, nice catch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IncrementalChunkBuffer was added to address cases where ozone client were running into OOM with keys less than chunk size , as without this, the smallest buffer which will be allocated will always be equal to the chunk size(4MB by default).

Please see https://issues.apache.org/jira/browse/HDDS-2331 for more details.
I would prefer to not remove this logic of incremental chunk buffer and may be hide it within an internal config.

@elek , how much of perf gain we will have of we still do incremental buffer allocation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elek , how much of perf gain we will have of we still do incremental buffer allocation?

I can repeat the test to get exact numbers, but I couldn't get good performance without removing the incremental buffer. You can easily test it with the new unit test, if you write a lot of data with byte=1, it's still low.

IncrementalChunkBuffer was added to address cases where ozone client were running into OOM with keys less than chunk size , as without this, the smallest buffer which will be allocated will always be equal to the chunk size(4MB by default).

I think it's a valid (and important question), but as far as I see it's safe to remove the IncrementalByteBuffer. As far as I see the situation is slightly different since HDDS-2331. I tried to test this patch with the commands from the HDDS-2331:

ozone freon rk --numOfThreads 1 --numOfVolumes 1 --numOfBuckets 1 --replicationType RATIS --factor ONE --keySize 1048576 --numOfKeys 5120 --bufferSize 65536

I couldn't reproduce the OOM.

Based on my understanding:

  1. We already have an increment by the ByteBuffer but size of the increment is 4MB (adding one more buffer when required)
  2. 4MB seems to be acceptable even with many clients in the same JVM, especially if we can have acceptable performance.

Let's say I have 100 Ozone clients (in the same JVM!!!) which write 1kb keys. I will have (4MB-1kb) *100 overhead without the IncrementalChunkBuffer (as far as I understood). It's still <400MB in exchange for 30-100% performance gain. Sounds like a good deal.

But let me know if you see any problems here.

  1. Let's say the 400MB overhead is unacceptable (or my calculation was wrong and the overhead is higher ;-) )

As far as I see the BufferPool is created per key. I think it would be possible to set the buffer size to min(keySize, bufferSize). With this approach the first and only buffer of the BufferPool can have exactly the required size (which covers all the where the key size is < 4MB)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the situation is slightly different since HDDS-2331

Note that default chunk size was 16MB at the time when HDDS-2331 was reported. The benefit from IncrementalChunkBuffer is less now with 4MB default size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest separating the two parts of the PR:

  1. reorganize the position calculation and allocation
  2. remove the usage of the Incremental buffer

While we can continue to searching for the safest method to do 2 (or do something instead of the removal), we can merge the first part where we already have an agreement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1374 about the 2nd.

@elek elek changed the base branch from HDDS-4119 to master August 25, 2020 13:26
Comment on lines 486 to 487
if (currentBuffer.hasRemaining()) {
writeChunk(currentBuffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the condition is different here, might not be safe to replace with writeChunkIfNeeded().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks the help. I fully reverted these lines in #23ba2d1 and build is green again.

@elek
Copy link
Member Author

elek commented Sep 1, 2020

See HDDS-4186 about the incremental chunk buffer.

@elek
Copy link
Member Author

elek commented Sep 7, 2020

Is there any more objections/comments? All the comments are addressed. Can we merge this?

@elek
Copy link
Member Author

elek commented Sep 11, 2020

Thanks the review @bshashikant and @adoroszlai

I am merging it now...

@elek elek merged commit 72e3215 into apache:master Sep 11, 2020
llemec pushed a commit to llemec/hadoop-ozone that referenced this pull request Sep 15, 2020
errose28 added a commit to errose28/ozone that referenced this pull request Sep 18, 2020
* master: (47 commits)
  HDDS-4104. Provide a way to get the default value and key of java-based-configuration easily (apache#1369)
  HDDS-4250. Fix wrong logger name (apache#1429)
  HDDS-4244. Container deleted wrong replica cause mis-replicated. (apache#1423)
  HDDS-4053. Volume space: add quotaUsageInBytes and update it when write and delete key. (apache#1296)
  HDDS-4210. ResolveBucket during checkAcls fails. (apache#1398)
  HDDS-4075. Retry request on different OM on AccessControlException (apache#1303)
  HDDS-4166. Documentation index page redirects to the wrong address (apache#1372)
  HDDS-4039. Reduce the number of fields in hdds.proto to improve performance (apache#1289)
  HDDS-4155. Directory and filename can end up with same name in a path. (apache#1361)
  HDDS-3927. Rename Ozone OM,DN,SCM runtime options to conform to naming conventions (apache#1401)
  HDDS-4119. Improve performance of the BufferPool management of Ozone client (apache#1336)
  HDDS-4217.Remove test TestOzoneContainerRatis (apache#1408)
  HDDS-4218.Remove test TestRatisManager (apache#1409)
  HDDS-4129. change MAX_QUOTA_IN_BYTES to Long.MAX_VALUE. (apache#1337)
  HDDS-4228: add field 'num' to ALLOCATE_BLOCK of scm audit log. (apache#1413)
  HDDS-4196. Add an endpoint in Recon to query Prometheus (apache#1390)
  HDDS-4211. [OFS] Better owner and group display for listing Ozone volumes and buckets (apache#1397)
  HDDS-4150. recon.api.TestEndpoints test is flaky (apache#1396)
  HDDS-4170 - Fix typo in method description. (apache#1406)
  HDDS-4064. Show container verbose info with verbose option (apache#1290)
  ...
errose28 added a commit to errose28/ozone that referenced this pull request Sep 18, 2020
…ponse

* HDDS-4122-quota-attempt2: (51 commits)
  Remove redundant check status calls in children of AbstractOMOpenKeyDeleteRequest
  Remove unused inports and fix super constructor calls
  Move common volume byte usage update code to AbstractOMKeyDeleteResponse
  Add volume quota update to open key delete response, and group duplicate code
  HDDS-4104. Provide a way to get the default value and key of java-based-configuration easily (apache#1369)
  HDDS-4250. Fix wrong logger name (apache#1429)
  HDDS-4244. Container deleted wrong replica cause mis-replicated. (apache#1423)
  HDDS-4053. Volume space: add quotaUsageInBytes and update it when write and delete key. (apache#1296)
  HDDS-4210. ResolveBucket during checkAcls fails. (apache#1398)
  HDDS-4075. Retry request on different OM on AccessControlException (apache#1303)
  HDDS-4166. Documentation index page redirects to the wrong address (apache#1372)
  HDDS-4039. Reduce the number of fields in hdds.proto to improve performance (apache#1289)
  HDDS-4155. Directory and filename can end up with same name in a path. (apache#1361)
  HDDS-3927. Rename Ozone OM,DN,SCM runtime options to conform to naming conventions (apache#1401)
  HDDS-4119. Improve performance of the BufferPool management of Ozone client (apache#1336)
  HDDS-4217.Remove test TestOzoneContainerRatis (apache#1408)
  HDDS-4218.Remove test TestRatisManager (apache#1409)
  HDDS-4129. change MAX_QUOTA_IN_BYTES to Long.MAX_VALUE. (apache#1337)
  HDDS-4228: add field 'num' to ALLOCATE_BLOCK of scm audit log. (apache#1413)
  HDDS-4196. Add an endpoint in Recon to query Prometheus (apache#1390)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants