GH-43956: [C++][Format] Add initial Decimal32/Decimal64 implementations #43957

zeroshade · 2024-09-04T19:40:18Z

Rationale for this change

Widening the Decimal128/256 type to allow for bitwidths of 32 and 64 allows for more interoperability with other libraries and utilities which already support these types. This provides even more opportunities for zero-copy interactions between things such as libcudf and various databases.

What changes are included in this PR?

This PR contains the basic C++ implementations for Decimal32/Decimal64 types, arrays, builders and scalars. It also includes the minimum necessary to get everything compiling and tests passing without also extending the acero kernels and parquet handling (both of which will be handled in follow-up PRs).

Are these changes tested?

Yes, tests were extended where applicable to add decimal32/decimal64 cases.

Are there any user-facing changes?

Currently if a user is using decimal(precision, scale) rather than decimal128(precision, scale) they will get a Decimal128Type if the precision is <= 38 (max precision for Decimal128) and Decimal256Type if the precision is higher. Following the same pattern, this change means that using decimal(precision, scale) instead of the specific decimal32/decimal64/decimal128/decimal256 functions results in the following functionality:

for precisions [1 : 9] => Decimal32Type
for precisions [10 : 18] => Decimal64Type
for precisions [19 : 38] => Decimal128Type
for precisions [39 : 76] => Decimal256Type

While many of our tests currently make the assumption that decimal with a low precision would be Decimal128 and had to be updated, this may cause an initial surprise if users are making the same assumptions.

GitHub Issue: [Format] Add Decimal32 and Decimal64 to Arrow #43956

github-actions · 2024-09-04T19:40:45Z

⚠️ GitHub issue #43956 has been automatically assigned in GitHub to PR creator.

cpp/src/arrow/type.cc

cpp/src/arrow/array/builder_dict.h

cpp/src/arrow/compute/kernels/codegen_internal.h

lidavidm · 2024-09-05T00:37:17Z

cpp/src/arrow/testing/gtest_util.h

@@ -171,7 +171,8 @@ using PrimitiveArrowTypes =
 using TemporalArrowTypes =
    ::testing::Types<Date32Type, Date64Type, TimestampType, Time32Type, Time64Type>;

-using DecimalArrowTypes = ::testing::Types<Decimal128Type, Decimal256Type>;
+using DecimalArrowTypes =
+    ::testing::Types</*Decimal32Type, Decimal64Type,*/ Decimal128Type, Decimal256Type>;


Ditto here. (Should we file issues to come back to these?)

These are commented out because we didn't implement casting for the new decimal types. This is mentioned in the issue as check boxes to do rather than as an entirely separate issue currently.

But it's going to be a separate PR, right?

yes, i didn't want to make this already large PR even larger. I'll implement the cast kernels and so on as a follow-up PR

pitrou · 2024-09-05T10:01:26Z

Following the same pattern, this change means that using decimal(precision, scale) instead of the specific decimal32/decimal64/decimal128/decimal256 functions results in the following functionality

I'm afraid this may massively break user code. I would suggest another approach:

deprecate the decimal() factory while keeping its current behavior of always returning at least decimal128
introduce a new smallest_decimal() factory that is documented to return the smallest possible type, and explicitly makes no guarantees about the stability of the return type

cpp/src/arrow/type.cc

wgtmac · 2024-09-05T15:26:12Z

Following the same pattern, this change means that using decimal(precision, scale) instead of the specific decimal32/decimal64/decimal128/decimal256 functions results in the following functionality

I'm afraid this may massively break user code. I would suggest another approach:

deprecate the decimal() factory while keeping its current behavior of always returning at least decimal128

introduce a new smallest_decimal() factory that is documented to return the smallest possible type, and explicitly makes no guarantees about the stability of the return type

I just have the same concern. +1 on the proposed workaround.

zeroshade · 2024-09-05T17:51:37Z

@pitrou @bkietz @wgtmac I've updated this based on the suggestion, created a smallest_decimal function and added a deprecated message to the docstring for decimal.

zeroshade · 2024-09-18T18:53:35Z

@pitrou can you take another pass here? I believe I've addressed all of your comments. Thanks!

pitrou

Some more comments. Thanks for the updates!

pitrou · 2024-09-19T13:14:04Z

cpp/src/arrow/util/decimal_test.cc

 }

 #endif  // __MINGW32__

-TEST(Decimal128Test, TestFromBigEndian) {
+TEST(Decimal32Test, TestFromBigEndian) {


It seems a number of tests are mostly or exactly identical between the Decimal tests, perhaps you can write a generic version and call it for each concrete test type?

added generic versions for most of them. the LeftShift/RightShift and Negate tests are sufficiently different in their values that it's harder to make them generic

cpp/src/arrow/util/decimal_test.cc

pitrou · 2024-09-19T13:26:34Z

cpp/src/arrow/util/decimal_internal.h

-  // ceil(log10(2 ^ kMantissaBits))
-  static constexpr int kMantissaDigits = 8;
+  // log10(2 ^ kMantissaBits) ~= 7.2, let's be conservative to ensure more accuracy
+  // with our conversions for Decimal values


Did it come up in some of the tests otherwise?

Yea, we had a persistent off-by-one rounding issue with Decimal32 and Decimal64 for the FromReal tests which was fixed by this, and had no detrimental effect on Decimal128/256

As I mentioned elsewhere, float->Decimal32 probably needs to fall back on the approx algorithm.

I'm a bit surprised for Decimal64, though. Is the RoundedRightShift right?

I reran the tests to confirm, and yea it's only the Decimal32 case which runs into this problem (and only for Float since the double case falls back to the approx algorithm).

In the float case, Decimal32 won't hit the condition to fall back on the approx algorithm as MaxPrecision == 9 and MantissaDigits is 8 (or 7 if I keep this change). In either case, using 7 for the mantissa digits is likely more accurate given that log10(2 ^ 24) is ~7.2 as opposed to doubles which uses 16 because log(2 ^ 53) ~= 15.9 so it makes sense to ceiling that to 16.

The only alternative would be to just have Decimal32 always go to the approx algorithm even for the float case despite maxprecision being larger than the mantissa digits.

I've switched this back to 8 and just unconditionally send Decimal32 to the approx algorithm. Let me know if this is sufficient or if I should put it back.

cpp/src/arrow/util/decimal.cc

pitrou · 2024-09-19T13:41:56Z

cpp/src/arrow/util/decimal.cc

+        constexpr int is_dec32_or_dec64 =
+            DecimalType::kByteWidth <= BasicDecimal64::kByteWidth;
+        const int mul_step = std::max(1, kMaxPrecision - precision - is_dec32_or_dec64);


With decimal32 and decimal64 I needed to reduce the mul_step by 1 in order to eliminate off-by-one rounding errors. I didn't want to unconditionally add extra operations to decimal128/256 by lowering their mul_step so I only did it for decimal32/64

That's several things that need to be arbitrarily lowered, and sounds a bit unexpected.
Random hacks like this are a bad smell, especially if there's no detailed explanation other that "it works better".

That said, if you don't want to debug these now, I would favor removing the hacks, adding temporary workarounds in the tests, and opening an issue together with value(s) reproducing the problem. Then this can be investigated (by you, me or @bkietz for example :-)).

It's not really that unexpected honestly. The algorithm explicitly states that it can cause off-by-1 rounding errors because of the interleaved multiplication and rounded division that we're doing.

// NOTE: if precision is the full precision then the algorithm will
// lose the last digit. If precision is almost the full precision,
// there can be an off-by-one error due to rounding.

In the case where we need this, we using a precision of 16 for Decimal64 which is "almost" the full precision of 18 and thus hitting the aforementioned "off-by-one error due to rounding" that is mentioned in the existing comment. By forcing the multiplication step to 1 instead of 2 in this case, we eliminate the off-by-one error (though we'd likely still run into it if we were using 17 or 18 precision which would already be using the minimum multiplication step of 1).

That said, the issue that @bkietz filed to possibly replace a lot of this logic with fewer steps (using multiplications by 5 etc.) could potentially alleviate this issue directly also.

If we really don't like this and are okay with the off-by-one rounding error mentioned in the comment, then I can adjust the tests to simply not test this case which has the off-by-one error for Decimal64.

What do you think?

I've adjusted the tests to simply not hit this case anymore with the off-by-one rounding issue for decimal64.

Let me know if you prefer that to the above hack with explanation.

zeroshade requested review from felipecrv, lidavidm, bkietz, pitrou and joellubi September 4, 2024 19:40

zeroshade requested review from wgtmac and westonpace as code owners September 4, 2024 19:40

github-actions bot added Component: Parquet Component: C++ awaiting committer review Awaiting committer review Component: Documentation Component: Python labels Sep 4, 2024

lidavidm reviewed Sep 5, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Sep 5, 2024

wgtmac reviewed Sep 5, 2024

View reviewed changes

cpp/src/arrow/type.cc Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Sep 5, 2024

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 18, 2024

linting

924da38

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Sep 18, 2024

zeroshade requested a review from pitrou September 18, 2024 18:50

remove abs from FromReal, only constexpr in C++23 and newer

03a0674

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 18, 2024

pitrou reviewed Sep 19, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 19, 2024

zeroshade added 2 commits September 19, 2024 11:56

simplify a bunch of tests with a generic typed_test

c5e25ac

use FromRealApprox

9a22410

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 19, 2024

static_cast instead of implicit cast

2a2edaf

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 19, 2024

remove special cases, adjust tests

dc5311c

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 19, 2024

zeroshade requested a review from pitrou September 20, 2024 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-43956: [C++][Format] Add initial Decimal32/Decimal64 implementations #43957

GH-43956: [C++][Format] Add initial Decimal32/Decimal64 implementations #43957

zeroshade commented Sep 4, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Sep 4, 2024

lidavidm Sep 5, 2024

zeroshade Sep 5, 2024

pitrou Sep 16, 2024

zeroshade Sep 16, 2024

pitrou commented Sep 5, 2024

wgtmac commented Sep 5, 2024

zeroshade commented Sep 5, 2024

zeroshade commented Sep 18, 2024

pitrou left a comment

pitrou Sep 19, 2024

zeroshade Sep 19, 2024

pitrou Sep 19, 2024

zeroshade Sep 19, 2024

pitrou Sep 19, 2024

zeroshade Sep 19, 2024

zeroshade Sep 19, 2024

pitrou Sep 19, 2024

zeroshade Sep 19, 2024

pitrou Sep 19, 2024

pitrou Sep 19, 2024

zeroshade Sep 19, 2024

zeroshade Sep 19, 2024

GH-43956: [C++][Format] Add initial Decimal32/Decimal64 implementations #43957

Are you sure you want to change the base?

GH-43956: [C++][Format] Add initial Decimal32/Decimal64 implementations #43957

Conversation

zeroshade commented Sep 4, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Sep 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Sep 5, 2024

wgtmac commented Sep 5, 2024

zeroshade commented Sep 5, 2024

zeroshade commented Sep 18, 2024

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeroshade commented Sep 4, 2024 •

edited by github-actions bot

Loading