-
-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-37802: Slightly improve perfomance of PyLong_FromSize_t() #15192
Conversation
Misc/NEWS.d/next/Core and Builtins/2019-08-09-18-28-57.bpo-37802.pKxcAW.rst
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
BTW, I think the behavior of That's certainly not a reason not to merge this, though. I'll happily rebase my changes on top of this once it's in. |
(Posted; see https://bugs.python.org/issue37812 and #15203 .) |
0ba7b3c
to
9c52079
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me... except it's not ambitious enough! 🙂 Details below.
(Happily the more ambitious version is only a line or two more.)
@mdickinson anything should be done to get this merged? |
9c52079
to
b4c286f
Compare
I decided to deduplicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat! This simplifies the code and it's faster (based on the measurements you posted in the issue thread) -- that's a nice combination. 😄
I'd be glad to see the signed versions get a similar treatment. (As a separate PR.) Might not deliver a performance win at the same level, since your hypothesis is that the speedup here is from cutting out the sign logic...
but it should be performance-neutral at worst, and it's plausible it could be an improvement by generating less code so it's friendlier to the cache. Also could speed things up just as a natural consequence of deduplicating the code, by making a single place to apply all the optimizations we have in one function or another, plus any new ones you think of while staring at it.
(And even if it does turn out only performance-neutral: I think opinions will vary, but for code deduplication at this scale I'd certainly be in favor.)
Objects/longobject.c
Outdated
@@ -410,6 +414,30 @@ PyLong_FromUnsignedLong(unsigned long ival) | |||
return (PyObject *)v; | |||
} | |||
|
|||
/* Same as above, but for unsigned long. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: better to avoid "Same as above". Partly because the "above" is already a few dozen lines away and so potentially a bit unclear... and more so because this kind of reference from a distance is highly vulnerable to getting broken by future changes to the code 🙂 .
If the other thing's description were long, a good solution would be to refer to it by name: "Like long_from_uint, but ...".
In this case the other description is only a line long in itself, so it's simplest just to copy the common part: "Create a new int object from a C unsigned long".
(Maybe even skip the word "C"? I think that really makes sense only for PyLong_FromLong
, where there's a Python type that (used to, and in some ways still) has the same name as the C type.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might make sense to simply remove these comments, because it's quite obvious that they do the same thing as long_from_uint()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but then you have to go look up what long_from_uint
does. :-)
These functions are part of the API exposed outside this module, so in general they should have a doc-comment to explain what their intended meaning is. In this case the meaning is pretty simple, and almost self-explanatory from the name... but that just means that a very short doc-comment is enough. Which leaves little to be gained from skipping it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-existing comments never lie and don't require updates 🙂 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-existing comments never lie and don't require updates 🙂.
Between the choice of keeping the comment as is or removing it, I would support removing it more. But, I think it would be useful to readers to include @gnprice's suggestion (removing the "C" part as well):
/* Same as above, but for unsigned long. */ | |
/* Create a new int object from an unsigned long. */ |
I think it's fairly common for developers to look at the implementation for a single function in isolation rather than reading the entire file at once, so it's not always optimal to rely on comments in earlier sections for context. To some degree, each function should aim to explain itself.
Wise words!
Yet I think for doc-comments the risk is worth it. If the comment simply
describes the interface of the function, i.e. the things a caller needs to
know about it... then if that ever needs an update, there will be much
bigger burdens than updating the comment. Especially for functions in the
public API like these.
But that is why I would like the comments to describe the interface
directly (like they do in master), rather than have a dependency on
something more fluid like the arrangement of this code. :-)
…On Thu, Aug 29, 2019, 11:16 Sergey Fedoseev ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In Objects/longobject.c
<#15192 (comment)>:
> @@ -410,6 +414,30 @@ PyLong_FromUnsignedLong(unsigned long ival)
return (PyObject *)v;
}
+/* Same as above, but for unsigned long. */
Non-existing comments never lie and don't require updates 🙂 .
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15192?email_source=notifications&email_token=AAAG4DJKKLGMQMQG4ULR2JTQHAG7DA5CNFSM4IKUBGHKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCDEW34I#discussion_r319206328>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAG4DMBSVF5OI5LF3HEMEDQHAG7DANCNFSM4IKUBGHA>
.
|
The current approach using inline function have the same problems as here (https://bugs.python.org/issue38015). I'll rework it when #15718 will be merged. |
Hmm, you measured this as being a significant speedup in your microbenchmark, right? That hardly seems like a problem 😉 even if there's a way to make the microbenchmark go even faster still. I like this PR as it is 🙂 and I'd be glad to see it merged. It actually simplifies the code at the same time as optimizing it, so there's no need to weigh one value against another. As I commented over on #15718, turning If that is your plan, then it's also basically additive upon the change in this PR. So I think it'd still be good to merge this PR, even if a subsequent PR goes on to turn the function into a macro. |
It induces the same problems as your is_small_int() function on 32-bit platforms. |
Right but that "problem" is that a few extra instructions are emitted, right? And so the microbenchmark is a few percent slower than it would be. If the microbenchmark is nevertheless faster than before, then this is still an optimization. |
https://bugs.python.org/msg351052 I leave it to you to benchmark it on 32-bit platform. |
Hmm -- do you mean you haven't tried it on a 32-bit platform? Because you said that it had a problem there, I assumed that meant you'd seen something empirically. If you have, I think it would be helpful to say what results you got. |
I mean I checked how it compiles on godbolt.org and that was sufficient to me to not benchmark it. |
Cool, got it. Well -- as I said upthread, if this PR were to end up making something as big as If you can find a way to get all the desired speedups without that kind of complexity in the code, that'd be great. As I see it, I would also be glad to see a PR version merged that did something like
because that's good for the source code and no worse than a wash on performance. (Note also that anyone who's working to squeeze out the last drop of performance in running their code is much more likely to be using a 64-bit platform already.) |
Agreed, I don't think anyone using a 32-bit platform is going to be overly concerned about a 5% drop in performance. A more significant performance loss may be an issue, but not when it's that small. IMO, an equivalent or larger increase in the performance for 64-bit outweighs an equal or smaller loss for 32-bit (within a reasonable amount). |
I agree that'd be great to write some generic function, but C doesn't allow to write type generic functions and using inline function here induces these unwanted type casts that makes emitted code less efficient in unpredictable way. Such inline function creates a false feeling that it works like type generic function (as it was with There are platforms besides x86 and I don't want to spend time assessing how inline function degrades performance on each of these platforms, and you? (update):
Where did you get this ~5%? Did you run benchmark? |
Here's inline function vs macro demo: https://godbolt.org/z/K5tbF2.
|
0d33a8e
to
d0fdead
Compare
Do you mean x86 (32 bits) or x32 (64-bit integers, but use 32 bit pointers)? |
@vstinner x86 (32 bits). |
(That's from here: Just as you quoted me saying: it's what you found for |
Consider this the other way around: I don't think it's a good use of time to investigate what small unnecessary bits of work each compiler on each platform emits, or going to great lengths in our code to coax them into doing a bit better. (Remember that there's absolutely nothing in the language that means a compiler can't emit the exact same code in the static-function case as you're seeing in the macro case -- it's purely a matter of how smart the compiler gets. See e.g. this demo: #15216 (comment) ) That can be well worth it when the payoff is large -- when there's an opportunity for a significant win on ordinary Python code. (Or as a good proxy for that: a large win on code that shows up in profiles of ordinary Python code.) But many-line macros and other code complications have a real cost, and the payoff has to be worth the cost. Compilers will emit less-than-optimal code. That's a fact of life, no matter what we do to our code. We have to keep going anyway and write the code in a way that helps it make sense to each other as humans. When the compiler emits code that's much slower than it could be, in a place where that significantly matters, that's one thing... but if we insisted on squeezing the last cycle out of every function, we'd never get anything else done. |
This 5% of difference is the cost of 2 extra instructions on x86_64, and if you check demo, you can see that in this case there is more substantial difference (53 vs 38 instructions and some of them are in loop). |
This code is trivial, making it a macro doesn't make it less trivial. It makes it less "beautiful", but this code was modified once in the last 10 years, so I doubt there are many people interested in reading it.
My PR provides speed-up by eliminating these lines: Lines 390 to 391 in c59295a
I guess the person who wrote these lines many years ago was thinking the same way as you, years have passed, but compilers are still not so smart. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I have one last request.
d0fdead
to
62d3ba8
Compare
Thanks @sir-sigurd, that's a nice micro-optimization ;-) |
https://bugs.python.org/issue37802