Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-37802: Slightly improve perfomance of PyLong_FromSize_t() #15192

Merged
merged 1 commit into from
Sep 12, 2019

Conversation

sir-sigurd
Copy link
Contributor

@sir-sigurd sir-sigurd commented Aug 9, 2019

Objects/longobject.c Outdated Show resolved Hide resolved
Copy link
Member

@mdickinson mdickinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gnprice
Copy link
Contributor

gnprice commented Aug 10, 2019

BTW, I think the behavior of CHECK_SMALL_INT with its implicit return (and similarly the new CHECK_SMALL_UINT) is pretty surprising when reading the code. I wrote up a patch the other day that replaces it with an explicit return; I'll take this as a prompt to go make an issue for it and send a PR.

That's certainly not a reason not to merge this, though. I'll happily rebase my changes on top of this once it's in.

Objects/longobject.c Outdated Show resolved Hide resolved
@gnprice
Copy link
Contributor

gnprice commented Aug 10, 2019

BTW, I think the behavior of CHECK_SMALL_INT with its implicit return (and similarly the new CHECK_SMALL_UINT) is pretty surprising when reading the code. I wrote up a patch the other day that replaces it with an explicit return; I'll take this as a prompt to go make an issue for it and send a PR.

That's certainly not a reason not to merge this, though. I'll happily rebase my changes on top of this once it's in.

(Posted; see https://bugs.python.org/issue37812 and #15203 .)

Copy link
Contributor

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me... except it's not ambitious enough! 🙂 Details below.

(Happily the more ambitious version is only a line or two more.)

Objects/longobject.c Show resolved Hide resolved
Objects/longobject.c Outdated Show resolved Hide resolved
@sir-sigurd
Copy link
Contributor Author

@mdickinson anything should be done to get this merged?

@aeros aeros added the performance Performance or resource usage label Aug 25, 2019
@sir-sigurd
Copy link
Contributor Author

I decided to deduplicate PyLong_FromUnsigned* functions in this PR, to apply this optimization to all of them at once.

Copy link
Contributor

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! This simplifies the code and it's faster (based on the measurements you posted in the issue thread) -- that's a nice combination. 😄

I'd be glad to see the signed versions get a similar treatment. (As a separate PR.) Might not deliver a performance win at the same level, since your hypothesis is that the speedup here is from cutting out the sign logic...

but it should be performance-neutral at worst, and it's plausible it could be an improvement by generating less code so it's friendlier to the cache. Also could speed things up just as a natural consequence of deduplicating the code, by making a single place to apply all the optimizations we have in one function or another, plus any new ones you think of while staring at it.

(And even if it does turn out only performance-neutral: I think opinions will vary, but for code deduplication at this scale I'd certainly be in favor.)

@@ -410,6 +414,30 @@ PyLong_FromUnsignedLong(unsigned long ival)
return (PyObject *)v;
}

/* Same as above, but for unsigned long. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: better to avoid "Same as above". Partly because the "above" is already a few dozen lines away and so potentially a bit unclear... and more so because this kind of reference from a distance is highly vulnerable to getting broken by future changes to the code 🙂 .

If the other thing's description were long, a good solution would be to refer to it by name: "Like long_from_uint, but ...".

In this case the other description is only a line long in itself, so it's simplest just to copy the common part: "Create a new int object from a C unsigned long".

(Maybe even skip the word "C"? I think that really makes sense only for PyLong_FromLong, where there's a Python type that (used to, and in some ways still) has the same name as the C type.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make sense to simply remove these comments, because it's quite obvious that they do the same thing as long_from_uint().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but then you have to go look up what long_from_uint does. :-)

These functions are part of the API exposed outside this module, so in general they should have a doc-comment to explain what their intended meaning is. In this case the meaning is pretty simple, and almost self-explanatory from the name... but that just means that a very short doc-comment is enough. Which leaves little to be gained from skipping it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-existing comments never lie and don't require updates 🙂 .

Copy link
Contributor

@aeros aeros Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sir-sigurd:

Non-existing comments never lie and don't require updates 🙂.

Between the choice of keeping the comment as is or removing it, I would support removing it more. But, I think it would be useful to readers to include @gnprice's suggestion (removing the "C" part as well):

Suggested change
/* Same as above, but for unsigned long. */
/* Create a new int object from an unsigned long. */

I think it's fairly common for developers to look at the implementation for a single function in isolation rather than reading the entire file at once, so it's not always optimal to rely on comments in earlier sections for context. To some degree, each function should aim to explain itself.

Objects/longobject.c Outdated Show resolved Hide resolved
@gnprice
Copy link
Contributor

gnprice commented Aug 29, 2019 via email

@sir-sigurd
Copy link
Contributor Author

The current approach using inline function have the same problems as here (https://bugs.python.org/issue38015). I'll rework it when #15718 will be merged.

@gnprice
Copy link
Contributor

gnprice commented Sep 9, 2019

The current approach using inline function have the same problems as here (https://bugs.python.org/issue38015). I'll rework it when #15718 will be merged.

Hmm, you measured this as being a significant speedup in your microbenchmark, right? That hardly seems like a problem 😉 even if there's a way to make the microbenchmark go even faster still.

I like this PR as it is 🙂 and I'd be glad to see it merged. It actually simplifies the code at the same time as optimizing it, so there's no need to weigh one value against another.

As I commented over on #15718, turning get_small_int into a macro makes the code inside it quite a bit more complex to read. For this one... is the idea that you would write long_from_uint as a macro? That's a 20-line function, so that sounds probably worse :-/. I think that'd be much too big a cost to pay in the code for the payoff of this microbenchmark speedup.

If that is your plan, then it's also basically additive upon the change in this PR. So I think it'd still be good to merge this PR, even if a subsequent PR goes on to turn the function into a macro.

@sir-sigurd
Copy link
Contributor Author

Hmm, you measured this as being a significant speedup in your microbenchmark, right? That hardly seems like a problem wink even if there's a way to make the microbenchmark go even faster still.

It induces the same problems as your is_small_int() function on 32-bit platforms.

@gnprice
Copy link
Contributor

gnprice commented Sep 9, 2019

It induces the same problems as your is_small_int() function on 32-bit platforms.

Right but that "problem" is that a few extra instructions are emitted, right? And so the microbenchmark is a few percent slower than it would be.

If the microbenchmark is nevertheless faster than before, then this is still an optimization.

@sir-sigurd
Copy link
Contributor Author

sir-sigurd commented Sep 9, 2019

It induces the same problems as your is_small_int() function on 32-bit platforms.

Right but that "problem" is that a few extra instructions are emitted, right? And so the microbenchmark is a few percent slower than it would be.

If the microbenchmark is nevertheless faster than before, then this is still an optimization.

https://bugs.python.org/msg351052

I leave it to you to benchmark it on 32-bit platform.

@gnprice
Copy link
Contributor

gnprice commented Sep 9, 2019

I leave it to you to benchmark it on 32-bit platform.

Hmm -- do you mean you haven't tried it on a 32-bit platform? Because you said that it had a problem there, I assumed that meant you'd seen something empirically.

If you have, I think it would be helpful to say what results you got.

@sir-sigurd
Copy link
Contributor Author

I mean I checked how it compiles on godbolt.org and that was sufficient to me to not benchmark it.

@gnprice
Copy link
Contributor

gnprice commented Sep 9, 2019

I mean I checked how it compiles on godbolt.org and that was sufficient to me to not benchmark it.

Cool, got it.

Well -- as I said upthread, if this PR were to end up making something as big as long_from_uint into a macro, I think that would be much more of a cost than the micro-optimization is worth.

If you can find a way to get all the desired speedups without that kind of complexity in the code, that'd be great.

As I see it, I would also be glad to see a PR version merged that did something like

  • a nice simplification to the code (like your current version here)
  • ~10% speedup in microbenchmark on x86_64 (like you quoted for this version on the issue thread)
  • ~5% slowdown in microbenchmark on x86_32 (like you found for get_small_ints as function vs. macro)

because that's good for the source code and no worse than a wash on performance.

(Note also that anyone who's working to squeeze out the last drop of performance in running their code is much more likely to be using a 64-bit platform already.)

@aeros
Copy link
Contributor

aeros commented Sep 9, 2019

@gnprice:

Note also that anyone who's working to squeeze out the last drop of performance in running their code is much more likely to be using a 64-bit platform already

Agreed, I don't think anyone using a 32-bit platform is going to be overly concerned about a 5% drop in performance. A more significant performance loss may be an issue, but not when it's that small. IMO, an equivalent or larger increase in the performance for 64-bit outweighs an equal or smaller loss for 32-bit (within a reasonable amount).

@sir-sigurd
Copy link
Contributor Author

sir-sigurd commented Sep 9, 2019

If you can find a way to get all the desired speedups without that kind of complexity in the code, that'd be great.

I agree that'd be great to write some generic function, but C doesn't allow to write type generic functions and using inline function here induces these unwanted type casts that makes emitted code less efficient in unpredictable way. Such inline function creates a false feeling that it works like type generic function (as it was with is_small_int()), but it doesn't.

There are platforms besides x86 and I don't want to spend time assessing how inline function degrades performance on each of these platforms, and you?

(update):

~5% slowdown in microbenchmark on x86_32 (like you found for get_small_ints as function vs. macro)

Where did you get this ~5%? Did you run benchmark?

@sir-sigurd
Copy link
Contributor Author

Here's inline function vs macro demo: https://godbolt.org/z/K5tbF2.
On x86-32:

  • inline function version: 53 instructions
  • macro version: 38 instructions

@vstinner
Copy link
Member

vstinner commented Sep 9, 2019

On x86-32

Do you mean x86 (32 bits) or x32 (64-bit integers, but use 32 bit pointers)?

@sir-sigurd
Copy link
Contributor Author

@vstinner x86 (32 bits).

@gnprice
Copy link
Contributor

gnprice commented Sep 10, 2019

~5% slowdown in microbenchmark on x86_32 (like you found for get_small_ints as function vs. macro)

Where did you get this ~5%? Did you run benchmark?

(That's from here:
https://bugs.python.org/issue38015#msg351255

Just as you quoted me saying: it's what you found for get_small_ints as a function vs. as a macro.)

@gnprice
Copy link
Contributor

gnprice commented Sep 10, 2019

There are platforms besides x86 and I don't want to spend time assessing how inline function degrades performance on each of these platforms, and you?

Consider this the other way around: I don't think it's a good use of time to investigate what small unnecessary bits of work each compiler on each platform emits, or going to great lengths in our code to coax them into doing a bit better.

(Remember that there's absolutely nothing in the language that means a compiler can't emit the exact same code in the static-function case as you're seeing in the macro case -- it's purely a matter of how smart the compiler gets. See e.g. this demo: #15216 (comment) )

That can be well worth it when the payoff is large -- when there's an opportunity for a significant win on ordinary Python code. (Or as a good proxy for that: a large win on code that shows up in profiles of ordinary Python code.) But many-line macros and other code complications have a real cost, and the payoff has to be worth the cost.

Compilers will emit less-than-optimal code. That's a fact of life, no matter what we do to our code. We have to keep going anyway and write the code in a way that helps it make sense to each other as humans.

When the compiler emits code that's much slower than it could be, in a place where that significantly matters, that's one thing... but if we insisted on squeezing the last cycle out of every function, we'd never get anything else done.

@sir-sigurd
Copy link
Contributor Author

~5% slowdown in microbenchmark on x86_32 (like you found for get_small_ints as function vs. macro)

Where did you get this ~5%? Did you run benchmark?

(That's from here:
https://bugs.python.org/issue38015#msg351255

Just as you quoted me saying: it's what you found for get_small_ints as a function vs. as a macro.)

This 5% of difference is the cost of 2 extra instructions on x86_64, and if you check demo, you can see that in this case there is more substantial difference (53 vs 38 instructions and some of them are in loop).

@sir-sigurd
Copy link
Contributor Author

sir-sigurd commented Sep 10, 2019

But many-line macros and other code complications have a real cost, and the payoff has to be worth the cost.

This code is trivial, making it a macro doesn't make it less trivial. It makes it less "beautiful", but this code was modified once in the last 10 years, so I doubt there are many people interested in reading it.

(Remember that there's absolutely nothing in the language that means a compiler can't emit the exact same code in the static-function case as you're seeing in the macro case -- it's purely a matter of how smart the compiler gets. See e.g. this demo: #15216 (comment) )

My PR provides speed-up by eliminating these lines:

if (ival < PyLong_BASE)
return PyLong_FromLong(ival);

I guess the person who wrote these lines many years ago was thinking the same way as you, years have passed, but compilers are still not so smart.

Objects/longobject.c Outdated Show resolved Hide resolved
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I have one last request.

@gpshead gpshead merged commit c6734ee into python:master Sep 12, 2019
@sir-sigurd sir-sigurd deleted the long-from-sizet-2 branch September 12, 2019 15:06
@vstinner
Copy link
Member

Thanks @sir-sigurd, that's a nice micro-optimization ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants