-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-4119. Improve performance of the BufferPool management of Ozone client #1336
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @elek and @lokeshj1703 for working on this. Is it intentionally created with target branch apache:HDDS-4119
instead of apache:master
?
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/BlockOutputStream.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/common/ChunkBuffer.java
Outdated
Show resolved
Hide resolved
...lient/src/test/java/org/apache/hadoop/hdds/scm/storage/TestBlockOutputStreamCorrectness.java
Show resolved
Hide resolved
...lient/src/test/java/org/apache/hadoop/hdds/scm/storage/TestBlockOutputStreamCorrectness.java
Outdated
Show resolved
Hide resolved
@@ -44,9 +45,6 @@ static ChunkBuffer allocate(int capacity) { | |||
* When increment <= 0, entire buffer is allocated in the beginning. | |||
*/ | |||
static ChunkBuffer allocate(int capacity, int increment) { | |||
if (increment > 0 && increment < capacity) { | |||
return new IncrementalChunkBuffer(capacity, increment, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please also change TestChunkBuffer#runTestIncrementalChunkBuffer
to explicitly create IncrementalChunkBuffer
? Currently it uses this factory method, and so with this patch it really tests ChunkBufferImplWithByteBuffer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, nice catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IncrementalChunkBuffer was added to address cases where ozone client were running into OOM with keys less than chunk size , as without this, the smallest buffer which will be allocated will always be equal to the chunk size(4MB by default).
Please see https://issues.apache.org/jira/browse/HDDS-2331 for more details.
I would prefer to not remove this logic of incremental chunk buffer and may be hide it within an internal config.
@elek , how much of perf gain we will have of we still do incremental buffer allocation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elek , how much of perf gain we will have of we still do incremental buffer allocation?
I can repeat the test to get exact numbers, but I couldn't get good performance without removing the incremental buffer. You can easily test it with the new unit test, if you write a lot of data with byte=1, it's still low.
IncrementalChunkBuffer was added to address cases where ozone client were running into OOM with keys less than chunk size , as without this, the smallest buffer which will be allocated will always be equal to the chunk size(4MB by default).
I think it's a valid (and important question), but as far as I see it's safe to remove the IncrementalByteBuffer
. As far as I see the situation is slightly different since HDDS-2331. I tried to test this patch with the commands from the HDDS-2331:
ozone freon rk --numOfThreads 1 --numOfVolumes 1 --numOfBuckets 1 --replicationType RATIS --factor ONE --keySize 1048576 --numOfKeys 5120 --bufferSize 65536
I couldn't reproduce the OOM.
Based on my understanding:
- We already have an increment by the
ByteBuffer
but size of the increment is 4MB (adding one more buffer when required) - 4MB seems to be acceptable even with many clients in the same JVM, especially if we can have acceptable performance.
Let's say I have 100 Ozone clients (in the same JVM!!!) which write 1kb keys. I will have (4MB-1kb) *100 overhead without the IncrementalChunkBuffer
(as far as I understood). It's still <400MB in exchange for 30-100% performance gain. Sounds like a good deal.
But let me know if you see any problems here.
- Let's say the 400MB overhead is unacceptable (or my calculation was wrong and the overhead is higher ;-) )
As far as I see the BufferPool
is created per key. I think it would be possible to set the buffer size to min(keySize, bufferSize)
. With this approach the first and only buffer of the BufferPool can have exactly the required size (which covers all the where the key size is < 4MB)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the situation is slightly different since HDDS-2331
Note that default chunk size was 16MB at the time when HDDS-2331 was reported. The benefit from IncrementalChunkBuffer is less now with 4MB default size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest separating the two parts of the PR:
- reorganize the position calculation and allocation
- remove the usage of the Incremental buffer
While we can continue to searching for the safest method to do 2 (or do something instead of the removal), we can merge the first part where we already have an agreement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1374 about the 2nd.
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/BlockOutputStream.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/BlockOutputStream.java
Show resolved
Hide resolved
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/BlockOutputStream.java
Outdated
Show resolved
Hide resolved
if (currentBuffer.hasRemaining()) { | ||
writeChunk(currentBuffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the condition is different here, might not be safe to replace with writeChunkIfNeeded()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks the help. I fully reverted these lines in #23ba2d1 and build is green again.
See HDDS-4186 about the incremental chunk buffer. |
Is there any more objections/comments? All the comments are addressed. Can we merge this? |
Thanks the review @bshashikant and @adoroszlai I am merging it now... |
* master: (47 commits) HDDS-4104. Provide a way to get the default value and key of java-based-configuration easily (apache#1369) HDDS-4250. Fix wrong logger name (apache#1429) HDDS-4244. Container deleted wrong replica cause mis-replicated. (apache#1423) HDDS-4053. Volume space: add quotaUsageInBytes and update it when write and delete key. (apache#1296) HDDS-4210. ResolveBucket during checkAcls fails. (apache#1398) HDDS-4075. Retry request on different OM on AccessControlException (apache#1303) HDDS-4166. Documentation index page redirects to the wrong address (apache#1372) HDDS-4039. Reduce the number of fields in hdds.proto to improve performance (apache#1289) HDDS-4155. Directory and filename can end up with same name in a path. (apache#1361) HDDS-3927. Rename Ozone OM,DN,SCM runtime options to conform to naming conventions (apache#1401) HDDS-4119. Improve performance of the BufferPool management of Ozone client (apache#1336) HDDS-4217.Remove test TestOzoneContainerRatis (apache#1408) HDDS-4218.Remove test TestRatisManager (apache#1409) HDDS-4129. change MAX_QUOTA_IN_BYTES to Long.MAX_VALUE. (apache#1337) HDDS-4228: add field 'num' to ALLOCATE_BLOCK of scm audit log. (apache#1413) HDDS-4196. Add an endpoint in Recon to query Prometheus (apache#1390) HDDS-4211. [OFS] Better owner and group display for listing Ozone volumes and buckets (apache#1397) HDDS-4150. recon.api.TestEndpoints test is flaky (apache#1396) HDDS-4170 - Fix typo in method description. (apache#1406) HDDS-4064. Show container verbose info with verbose option (apache#1290) ...
…ponse * HDDS-4122-quota-attempt2: (51 commits) Remove redundant check status calls in children of AbstractOMOpenKeyDeleteRequest Remove unused inports and fix super constructor calls Move common volume byte usage update code to AbstractOMKeyDeleteResponse Add volume quota update to open key delete response, and group duplicate code HDDS-4104. Provide a way to get the default value and key of java-based-configuration easily (apache#1369) HDDS-4250. Fix wrong logger name (apache#1429) HDDS-4244. Container deleted wrong replica cause mis-replicated. (apache#1423) HDDS-4053. Volume space: add quotaUsageInBytes and update it when write and delete key. (apache#1296) HDDS-4210. ResolveBucket during checkAcls fails. (apache#1398) HDDS-4075. Retry request on different OM on AccessControlException (apache#1303) HDDS-4166. Documentation index page redirects to the wrong address (apache#1372) HDDS-4039. Reduce the number of fields in hdds.proto to improve performance (apache#1289) HDDS-4155. Directory and filename can end up with same name in a path. (apache#1361) HDDS-3927. Rename Ozone OM,DN,SCM runtime options to conform to naming conventions (apache#1401) HDDS-4119. Improve performance of the BufferPool management of Ozone client (apache#1336) HDDS-4217.Remove test TestOzoneContainerRatis (apache#1408) HDDS-4218.Remove test TestRatisManager (apache#1409) HDDS-4129. change MAX_QUOTA_IN_BYTES to Long.MAX_VALUE. (apache#1337) HDDS-4228: add field 'num' to ALLOCATE_BLOCK of scm audit log. (apache#1413) HDDS-4196. Add an endpoint in Recon to query Prometheus (apache#1390) ...
What changes were proposed in this pull request?
Teragen reported to be slow with low number of mappers compared to HDFS.
In my test (one pipeline, 3 yarn nodes) 10 g teragen with HDFS was ~3 mins but with Ozone it was 6 mins. It could be fixed with using more mappers, but when I investigated the execution I found a few problems reagrding to the BufferPool management.
In the flamegraphs it's clearly visible that with low number of mappers the client is busy with buffer operations. After the patch the Rpc call and the checksum calculation give the majority of the time.
Overall write performance is improved with at least 30% when minimal number of threads/mappers are used.
Thanks
Special thanks to @lokeshj1703, who helped me find the small mistakes in the original verison of the patch.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4119
How was this patch tested?
Teragen 10/100g with 2/30 mappers.
(https://github.com/elek/ozone-perf-env/tree/master/teragen-hdfs)