-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43956: [C++][Format] Add initial Decimal32/Decimal64 implementations #43957
base: main
Are you sure you want to change the base?
Conversation
|
@@ -171,7 +171,8 @@ using PrimitiveArrowTypes = | |||
using TemporalArrowTypes = | |||
::testing::Types<Date32Type, Date64Type, TimestampType, Time32Type, Time64Type>; | |||
|
|||
using DecimalArrowTypes = ::testing::Types<Decimal128Type, Decimal256Type>; | |||
using DecimalArrowTypes = | |||
::testing::Types</*Decimal32Type, Decimal64Type,*/ Decimal128Type, Decimal256Type>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto here. (Should we file issues to come back to these?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are commented out because we didn't implement casting for the new decimal types. This is mentioned in the issue as check boxes to do rather than as an entirely separate issue currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's going to be a separate PR, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i didn't want to make this already large PR even larger. I'll implement the cast kernels and so on as a follow-up PR
I'm afraid this may massively break user code. I would suggest another approach:
|
I just have the same concern. +1 on the proposed workaround. |
@pitrou can you take another pass here? I believe I've addressed all of your comments. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments. Thanks for the updates!
cpp/src/arrow/util/decimal_test.cc
Outdated
} | ||
|
||
#endif // __MINGW32__ | ||
|
||
TEST(Decimal128Test, TestFromBigEndian) { | ||
TEST(Decimal32Test, TestFromBigEndian) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a number of tests are mostly or exactly identical between the Decimal tests, perhaps you can write a generic version and call it for each concrete test type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added generic versions for most of them. the LeftShift/RightShift and Negate tests are sufficiently different in their values that it's harder to make them generic
// ceil(log10(2 ^ kMantissaBits)) | ||
static constexpr int kMantissaDigits = 8; | ||
// log10(2 ^ kMantissaBits) ~= 7.2, let's be conservative to ensure more accuracy | ||
// with our conversions for Decimal values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did it come up in some of the tests otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, we had a persistent off-by-one rounding issue with Decimal32 and Decimal64 for the FromReal
tests which was fixed by this, and had no detrimental effect on Decimal128/256
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned elsewhere, float->Decimal32 probably needs to fall back on the approx algorithm.
I'm a bit surprised for Decimal64, though. Is the RoundedRightShift right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reran the tests to confirm, and yea it's only the Decimal32 case which runs into this problem (and only for Float since the double case falls back to the approx algorithm).
In the float case, Decimal32 won't hit the condition to fall back on the approx algorithm as MaxPrecision == 9 and MantissaDigits is 8 (or 7 if I keep this change). In either case, using 7 for the mantissa digits is likely more accurate given that log10(2 ^ 24) is ~7.2 as opposed to doubles which uses 16 because log(2 ^ 53) ~= 15.9 so it makes sense to ceiling that to 16.
The only alternative would be to just have Decimal32 always go to the approx algorithm even for the float case despite maxprecision being larger than the mantissa digits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've switched this back to 8 and just unconditionally send Decimal32 to the approx algorithm. Let me know if this is sufficient or if I should put it back.
cpp/src/arrow/util/decimal.cc
Outdated
constexpr int is_dec32_or_dec64 = | ||
DecimalType::kByteWidth <= BasicDecimal64::kByteWidth; | ||
const int mul_step = std::max(1, kMaxPrecision - precision - is_dec32_or_dec64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With decimal32 and decimal64 I needed to reduce the mul_step
by 1 in order to eliminate off-by-one rounding errors. I didn't want to unconditionally add extra operations to decimal128/256 by lowering their mul_step
so I only did it for decimal32/64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's several things that need to be arbitrarily lowered, and sounds a bit unexpected.
Random hacks like this are a bad smell, especially if there's no detailed explanation other that "it works better".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, if you don't want to debug these now, I would favor removing the hacks, adding temporary workarounds in the tests, and opening an issue together with value(s) reproducing the problem. Then this can be investigated (by you, me or @bkietz for example :-)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really that unexpected honestly. The algorithm explicitly states that it can cause off-by-1 rounding errors because of the interleaved multiplication and rounded division that we're doing.
// NOTE: if
precision
is the full precision then the algorithm will
// lose the last digit. Ifprecision
is almost the full precision,
// there can be an off-by-one error due to rounding.
In the case where we need this, we using a precision of 16 for Decimal64 which is "almost" the full precision of 18 and thus hitting the aforementioned "off-by-one error due to rounding" that is mentioned in the existing comment. By forcing the multiplication step to 1 instead of 2 in this case, we eliminate the off-by-one error (though we'd likely still run into it if we were using 17 or 18 precision which would already be using the minimum multiplication step of 1).
That said, the issue that @bkietz filed to possibly replace a lot of this logic with fewer steps (using multiplications by 5 etc.) could potentially alleviate this issue directly also.
If we really don't like this and are okay with the off-by-one rounding error mentioned in the comment, then I can adjust the tests to simply not test this case which has the off-by-one error for Decimal64.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've adjusted the tests to simply not hit this case anymore with the off-by-one rounding issue for decimal64.
Let me know if you prefer that to the above hack with explanation.
Rationale for this change
Widening the Decimal128/256 type to allow for bitwidths of 32 and 64 allows for more interoperability with other libraries and utilities which already support these types. This provides even more opportunities for zero-copy interactions between things such as libcudf and various databases.
What changes are included in this PR?
This PR contains the basic C++ implementations for Decimal32/Decimal64 types, arrays, builders and scalars. It also includes the minimum necessary to get everything compiling and tests passing without also extending the acero kernels and parquet handling (both of which will be handled in follow-up PRs).
Are these changes tested?
Yes, tests were extended where applicable to add decimal32/decimal64 cases.
Are there any user-facing changes?
Currently if a user is using
decimal(precision, scale)
rather thandecimal128(precision, scale)
they will get aDecimal128Type
if the precision is <= 38 (max precision for Decimal128) andDecimal256Type
if the precision is higher. Following the same pattern, this change means that usingdecimal(precision, scale)
instead of the specificdecimal32
/decimal64
/decimal128
/decimal256
functions results in the following functionality:Decimal32Type
Decimal64Type
Decimal128Type
Decimal256Type
While many of our tests currently make the assumption that
decimal
with a low precision would beDecimal128
and had to be updated, this may cause an initial surprise if users are making the same assumptions.