Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43956: [C++][Format] Add initial Decimal32/Decimal64 implementations #43957

Open
wants to merge 49 commits into
base: main
Choose a base branch
from

Conversation

zeroshade
Copy link
Member

@zeroshade zeroshade commented Sep 4, 2024

Rationale for this change

Widening the Decimal128/256 type to allow for bitwidths of 32 and 64 allows for more interoperability with other libraries and utilities which already support these types. This provides even more opportunities for zero-copy interactions between things such as libcudf and various databases.

What changes are included in this PR?

This PR contains the basic C++ implementations for Decimal32/Decimal64 types, arrays, builders and scalars. It also includes the minimum necessary to get everything compiling and tests passing without also extending the acero kernels and parquet handling (both of which will be handled in follow-up PRs).

Are these changes tested?

Yes, tests were extended where applicable to add decimal32/decimal64 cases.

Are there any user-facing changes?

Currently if a user is using decimal(precision, scale) rather than decimal128(precision, scale) they will get a Decimal128Type if the precision is <= 38 (max precision for Decimal128) and Decimal256Type if the precision is higher. Following the same pattern, this change means that using decimal(precision, scale) instead of the specific decimal32/decimal64/decimal128/decimal256 functions results in the following functionality:

  • for precisions [1 : 9] => Decimal32Type
  • for precisions [10 : 18] => Decimal64Type
  • for precisions [19 : 38] => Decimal128Type
  • for precisions [39 : 76] => Decimal256Type

While many of our tests currently make the assumption that decimal with a low precision would be Decimal128 and had to be updated, this may cause an initial surprise if users are making the same assumptions.

Copy link

github-actions bot commented Sep 4, 2024

⚠️ GitHub issue #43956 has been automatically assigned in GitHub to PR creator.

cpp/src/arrow/type.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/builder_dict.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/codegen_internal.h Outdated Show resolved Hide resolved
@@ -171,7 +171,8 @@ using PrimitiveArrowTypes =
using TemporalArrowTypes =
::testing::Types<Date32Type, Date64Type, TimestampType, Time32Type, Time64Type>;

using DecimalArrowTypes = ::testing::Types<Decimal128Type, Decimal256Type>;
using DecimalArrowTypes =
::testing::Types</*Decimal32Type, Decimal64Type,*/ Decimal128Type, Decimal256Type>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here. (Should we file issues to come back to these?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are commented out because we didn't implement casting for the new decimal types. This is mentioned in the issue as check boxes to do rather than as an entirely separate issue currently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's going to be a separate PR, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i didn't want to make this already large PR even larger. I'll implement the cast kernels and so on as a follow-up PR

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Sep 5, 2024
@pitrou
Copy link
Member

pitrou commented Sep 5, 2024

Following the same pattern, this change means that using decimal(precision, scale) instead of the specific decimal32/decimal64/decimal128/decimal256 functions results in the following functionality

I'm afraid this may massively break user code. I would suggest another approach:

  • deprecate the decimal() factory while keeping its current behavior of always returning at least decimal128
  • introduce a new smallest_decimal() factory that is documented to return the smallest possible type, and explicitly makes no guarantees about the stability of the return type

@wgtmac
Copy link
Member

wgtmac commented Sep 5, 2024

Following the same pattern, this change means that using decimal(precision, scale) instead of the specific decimal32/decimal64/decimal128/decimal256 functions results in the following functionality

I'm afraid this may massively break user code. I would suggest another approach:

  • deprecate the decimal() factory while keeping its current behavior of always returning at least decimal128
  • introduce a new smallest_decimal() factory that is documented to return the smallest possible type, and explicitly makes no guarantees about the stability of the return type

I just have the same concern. +1 on the proposed workaround.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Sep 5, 2024
@zeroshade
Copy link
Member Author

@pitrou @bkietz @wgtmac I've updated this based on the suggestion, created a smallest_decimal function and added a deprecated message to the docstring for decimal.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Sep 5, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 18, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Sep 18, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 18, 2024
@zeroshade
Copy link
Member Author

@pitrou can you take another pass here? I believe I've addressed all of your comments. Thanks!

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments. Thanks for the updates!

}

#endif // __MINGW32__

TEST(Decimal128Test, TestFromBigEndian) {
TEST(Decimal32Test, TestFromBigEndian) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a number of tests are mostly or exactly identical between the Decimal tests, perhaps you can write a generic version and call it for each concrete test type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added generic versions for most of them. the LeftShift/RightShift and Negate tests are sufficiently different in their values that it's harder to make them generic

cpp/src/arrow/util/decimal_test.cc Show resolved Hide resolved
cpp/src/arrow/util/decimal_test.cc Show resolved Hide resolved
cpp/src/arrow/util/decimal_test.cc Show resolved Hide resolved
cpp/src/arrow/util/decimal_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/util/decimal_test.cc Outdated Show resolved Hide resolved
// ceil(log10(2 ^ kMantissaBits))
static constexpr int kMantissaDigits = 8;
// log10(2 ^ kMantissaBits) ~= 7.2, let's be conservative to ensure more accuracy
// with our conversions for Decimal values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did it come up in some of the tests otherwise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we had a persistent off-by-one rounding issue with Decimal32 and Decimal64 for the FromReal tests which was fixed by this, and had no detrimental effect on Decimal128/256

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned elsewhere, float->Decimal32 probably needs to fall back on the approx algorithm.

I'm a bit surprised for Decimal64, though. Is the RoundedRightShift right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reran the tests to confirm, and yea it's only the Decimal32 case which runs into this problem (and only for Float since the double case falls back to the approx algorithm).

In the float case, Decimal32 won't hit the condition to fall back on the approx algorithm as MaxPrecision == 9 and MantissaDigits is 8 (or 7 if I keep this change). In either case, using 7 for the mantissa digits is likely more accurate given that log10(2 ^ 24) is ~7.2 as opposed to doubles which uses 16 because log(2 ^ 53) ~= 15.9 so it makes sense to ceiling that to 16.

The only alternative would be to just have Decimal32 always go to the approx algorithm even for the float case despite maxprecision being larger than the mantissa digits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched this back to 8 and just unconditionally send Decimal32 to the approx algorithm. Let me know if this is sufficient or if I should put it back.

cpp/src/arrow/util/decimal.cc Outdated Show resolved Hide resolved
Comment on lines 165 to 167
constexpr int is_dec32_or_dec64 =
DecimalType::kByteWidth <= BasicDecimal64::kByteWidth;
const int mul_step = std::max(1, kMaxPrecision - precision - is_dec32_or_dec64);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With decimal32 and decimal64 I needed to reduce the mul_step by 1 in order to eliminate off-by-one rounding errors. I didn't want to unconditionally add extra operations to decimal128/256 by lowering their mul_step so I only did it for decimal32/64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's several things that need to be arbitrarily lowered, and sounds a bit unexpected.
Random hacks like this are a bad smell, especially if there's no detailed explanation other that "it works better".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, if you don't want to debug these now, I would favor removing the hacks, adding temporary workarounds in the tests, and opening an issue together with value(s) reproducing the problem. Then this can be investigated (by you, me or @bkietz for example :-)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really that unexpected honestly. The algorithm explicitly states that it can cause off-by-1 rounding errors because of the interleaved multiplication and rounded division that we're doing.

// NOTE: if precision is the full precision then the algorithm will
// lose the last digit. If precision is almost the full precision,
// there can be an off-by-one error due to rounding.

In the case where we need this, we using a precision of 16 for Decimal64 which is "almost" the full precision of 18 and thus hitting the aforementioned "off-by-one error due to rounding" that is mentioned in the existing comment. By forcing the multiplication step to 1 instead of 2 in this case, we eliminate the off-by-one error (though we'd likely still run into it if we were using 17 or 18 precision which would already be using the minimum multiplication step of 1).

That said, the issue that @bkietz filed to possibly replace a lot of this logic with fewer steps (using multiplications by 5 etc.) could potentially alleviate this issue directly also.

If we really don't like this and are okay with the off-by-one rounding error mentioned in the comment, then I can adjust the tests to simply not test this case which has the off-by-one error for Decimal64.

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adjusted the tests to simply not hit this case anymore with the off-by-one rounding issue for decimal64.

Let me know if you prefer that to the above hack with explanation.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 19, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 19, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 19, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants