Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44088: [Java] Fix copyFrom in BaseVariableWidthViewVector #44078

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

ViggoC
Copy link
Contributor

@ViggoC ViggoC commented Sep 12, 2024

Fix bugs in copyFromSafe and handleSafe

Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@ViggoC ViggoC changed the title [Java] Fix BaseVariableWidthViewVector GH-44088: [Java] Fix copyFrom in BaseVariableWidthViewVector Sep 12, 2024
Copy link

⚠️ GitHub issue #44088 has been automatically assigned in GitHub to PR creator.

@ViggoC ViggoC marked this pull request as ready for review September 12, 2024 14:52
@lidavidm
Copy link
Member

CC @vibhatha

@vibhatha
Copy link
Collaborator

On it @lidavidm

@@ -261,6 +261,15 @@ public void testDataBufferBasedAllocationInOtherBuffer() {
}
}

@Test
public void testSetSafe() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vibhatha I wrote a test case for setSafe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test case with long string as well? and a mix of short and long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, I can do that. But let's fix this one first. What do you think about the modification I mentioned in comment of reallocViewBuffer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I am still checking it. Please give an hour to verify this. I am thinking what I have already missed and what this is doing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue here is that the getValueCapacity depends on the capacity of the viewBuffer and validityBuffer, so in the while loop, when just one is changed, and the other value is kept the same, the minimum will always be the unchanged value. So I think the main issue is the error in that function. Thank you for this test case, I don't think the existing test cases covered this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix in 15f78da

@@ -1478,7 +1479,7 @@ protected final void handleSafe(int index, int dataLength) {
reallocViewBuffer(Math.max(writePosition, targetCapacity));
}

while (index >= getValueCapacity()) {
while (index >= getValidityBufferValueCapacity()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vibhatha I try to fix testSetSafe, but the logic here makes me confused. The writePosition can only be larger than targetCapacity when lastSet < 0. But I don't think there is a need for special treatment in this situation. Is there a problem if I change it to

final long targetCapacity = roundUpToMultipleOf16((long) index * ELEMENT_SIZE + dataLength);
if (viewBuffer.capacity() < targetCapacity) {
  reallocViewBuffer(targetCapacity);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ViggoC I have only skimmed through the PR. Please give me some time. I will verify.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think changing to this won't hurt much. But the actual issue is just changing the validity buffer capacity, it must be changed along with the viewBuffer capacity as the getCapacity is a function of both changes. So just changing one is the issue in the original code. What do you think? Am I missing your point?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 14, 2024
}
}

// test target vector with different initialCapacity
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also failed prior to this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, prior to this change, the dataBufferIndex and offset is just copied to viewBuffer by

      from.getDataBuffer().getBytes(start, viewBuffer, copyStart, ELEMENT_SIZE);

But they might be difference from fromVector's.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!

Comment on lines +1547 to +1553
viewBuffer.setBytes(start, from.getDataBuffer(), copyStart, LENGTH_WIDTH + PREFIX_WIDTH);
int writePosition = start + LENGTH_WIDTH + PREFIX_WIDTH;
// set buf id
viewBuffer.setInt(writePosition, dataBuffers.size() - 1);
writePosition += BUF_INDEX_WIDTH;
// set offset
viewBuffer.setInt(writePosition, (int) thisDataBuf.writerIndex());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have to do this, why not copy the 16 bytes from the from.getDataBuffer() from the copyStart till it reads 16 bytes. And I think the existing logic covers that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the buf id and offset will be different if the lengths of dataBuffers is different from the from vector.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I have missed that. Good catch. Can we add a test case for that as well. I mean create vector 1 some content and then vector 2 with a different content. Copy content from vector 1 to vector 2, and validate this?

@vibhatha
Copy link
Collaborator

@ViggoC what do you think about the following diff on top of your changes

diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthViewVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthViewVector.java
index 327548e0bc..0434024dec 100644
--- a/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthViewVector.java
+++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthViewVector.java
@@ -1252,6 +1252,7 @@ public abstract class BaseVariableWidthViewVector extends BaseValueVector
     // We need to check and reallocate the validity buffer
     while (index >= getValueCapacity()) {
       reallocValidityBuffer();
+      reallocViewBuffer();
     }
     BitVectorHelper.unsetBit(validityBuffer, index);
   }
@@ -1461,22 +1462,9 @@ public abstract class BaseVariableWidthViewVector extends BaseValueVector
   }
 
   protected final void handleSafe(int index, int dataLength) {
-    final long lastSetCapacity = lastSet < 0 ? 0 : (long) index * ELEMENT_SIZE;
-    final long targetCapacity = roundUpToMultipleOf16(lastSetCapacity + dataLength);
-    // for views, we need each buffer with 16 byte alignment, so we need to check the last written
-    // index
-    // in the viewBuffer and allocate a new buffer which has 16 byte alignment for adding new
-    // values.
-    long writePosition = (long) index * ELEMENT_SIZE;
-    if (viewBuffer.capacity() <= writePosition || viewBuffer.capacity() < targetCapacity) {
-      /*
-       * Everytime we want to increase the capacity of the viewBuffer, we need to make sure that the new capacity
-       * meets 16 byte alignment.
-       * If the targetCapacity is larger than the writePosition, we may not necessarily
-       * want to allocate the targetCapacity to viewBuffer since when it is >={@link #INLINE_SIZE} either way
-       * we are writing to the dataBuffer.
-       */
-      reallocViewBuffer(Math.max(writePosition, targetCapacity));
+    final long targetCapacity = roundUpToMultipleOf16((long) index * ELEMENT_SIZE + dataLength);
+    if (viewBuffer.capacity() < targetCapacity) {
+      reallocViewBuffer(targetCapacity);
     }
 
     while (index >= getValidityBufferValueCapacity()) {

@vibhatha
Copy link
Collaborator

@danepitkin could you please help us to run the workflows?

@ViggoC
Copy link
Contributor Author

ViggoC commented Sep 18, 2024

@vibhatha It LGTM, I'll apply this patch and add more test cases you commented.

Copy link
Collaborator

@vibhatha vibhatha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ViggoC for working on this. PR LGTM.

@vibhatha
Copy link
Collaborator

@lidavidm could you please take a look as well?

Copy link
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@@ -1251,6 +1252,7 @@ public void setNull(int index) {
// We need to check and reallocate the validity buffer
while (index >= getValueCapacity()) {
reallocValidityBuffer();
reallocViewBuffer();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why do we need to call reallocViewBuffer here? Could it be possible to have null values not backed by a view buffer and only realloc when we add non-null values?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants