-
-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-36502: Correct documentation of str.isspace
.
#15019
Conversation
The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace. Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs. Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The `unicodedata` module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there. Also add to the `isspace` test a thorough check that the implementation agrees with the intended definition.
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. You can check yourself to see if the CLA has been received. Thanks again for your contribution, we look forward to reviewing it! |
I think this would require an issue in the tracker. There are several issues regarding definition of whitespace and usage. Some isspace related issues : |
Ah, thanks for those links! Indeed this looks like exactly the documentation bug this patch fixes. This is a proposal to change the behavior of That will represent a change in behavior; I'm not sure what that means for releasing such a change. It's also significant work (as we don't currently parse that part of the Unicode database at all), which perhaps partly explains why it hasn't been done in the 6 years since it was proposed. For any release before the change to use |
Lib/test/test_unicode.py
Outdated
@@ -617,7 +618,14 @@ def test_isspace(self): | |||
self.checkequalnofix(True, '\u2000', 'isspace') | |||
self.checkequalnofix(True, '\u200a', 'isspace') | |||
self.checkequalnofix(False, '\u2014', 'isspace') | |||
# apparently there are no non-BMP spaces chars in Unicode 6 | |||
for i in range(0x10000): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have a constant declared at the top level of the module. Why not testing the full range of the UCS? range(0x110000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have a constant declared at the top level of the module.
Ah, good idea. I think in fact perhaps I'll make the range itself a constant, and then be able to write:
for codepoint in BMP_CODEPOINTS:
char = chr(codepoint)
because that seems to get at the meaning very well.
(Or perhaps for char in bmp_chars():
. OTOH I just spent a few minutes trying to see a clean way to make it for char in BMP_CHARS:
, and couldn't find one.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not testing the full range of the UCS? range(0x110000)
Well, the most causal answer to "why" 🙂 is that I first added this as a test in test_unicodedata.py
before seeing the existing test_isspace
here -- and two of the central tests there (running through the whole database and computing overall checksums) have a loop much like this one, so I followed their lead.
Those two tests (test_method_checksum
and test_function_checksum
) look to be substantially unchanged since they were first added in 6a20ee7 back in 2000. They probably should be updated to go through the full range of codepoints; I think back in 2000 this was the full range of codepoints.
(Hmm... as I look closer at what test_method_checksum
in particular does, it's really all about methods on str
, and doesn't actually mention unicodedata
. Perhaps it should move to test_unicode.py
.)
The one real cost of looping through all planes (vs. just the BMP) is test runtime -- the test takes about 50ms for just the BMP (on my fast desktop) but 900ms for everything. Fast test suites are precious, so in test_isspace
it seems a bit of a shame to spend almost a second looping through values where nothing at all is expected to happen. Well worth it for the loops in test_unicodedata
, though, where there's plenty happening the whole way along.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have a constant declared at the top level of the module. Why not testing the full range of the UCS? range(0x110000)
Just pushed an update making both of these changes. Thanks for the review!
I was about to edit the existing two loops in test_unicodedata
to run the full range too, but then realized that that requires updating the checksum. I feel like that should be a separate change and not mixed in with this one. So I'll keep that around as a commit on a branch on top of this one; I can send it right after this PR is complete, or also happy to squash it in here if you think that'd be preferable.
Lib/test/test_unicodedata.py
Outdated
@@ -102,6 +106,14 @@ def test_function_checksum(self): | |||
result = h.hexdigest() | |||
self.assertEqual(result, self.expectedchecksum) | |||
|
|||
def test_isspace_invariant(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like you are testing str.isspace() rather than the unicodedata module here. I suggest to move this test to test_unicode.py instead, after test_isspace().
You wrote that the test takes 900 ms: that's slow. I suggest to decorate the test with:
@support.requires_resource('cpu')
So it will be skipped on slow CIs but only run on fast CIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like you are testing str.isspace() rather than the unicodedata module here. I suggest to move this test to test_unicode.py instead, after test_isspace().
Sure; I've waffled on that myself. Will move it back to there.
(Note that the existing test at the top of test_unicodedata.py similarly doesn't mention the unicodedata module. Perhaps it'd be good to move it to test_unicode.py too, though I think not as part of this PR.)
You wrote that the test takes 900 ms: that's slow. I suggest to decorate the test [...]
Ah, thanks! Will do.
Lib/test/test_unicodedata.py
Outdated
@@ -14,6 +14,10 @@ | |||
encoding = 'utf-8' | |||
errors = 'surrogatepass' | |||
|
|||
def all_chars(): | |||
'''Each Unicode codepoint, as a one-character string.''' | |||
for codepoint in range(0x110000): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, I only expected a constant, not a whole function. I'm not sure that such function is needed, since it's only called once. I suggest to remove the function, and use sys.maxunicode rather than 0x110000. So the test will be updated automatically if sys.maxunicode is increased.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it a generator because I think that makes the abstraction a bit crisper: the test just says for char in all_chars()
, and the meaning is it iterates through all possible characters. But open-coding the loop works too; I can change it.
So the test will be updated automatically if sys.maxunicode is increased.
I doubt very much that sys.maxunicode will ever be changed in the future. 😉 Expanding Unicode beyond the original 16 bits was traumatic enough -- it's still a widespread cause of bugs, decades later -- and when they did, they gave themselves quite a lot of safety margin.
If you'd find the code clearer to read that way, though, that'd certainly be a good reason.
FWIW I find the hex constant somewhat clearer. One reason is that I don't have to wonder about the fencepost -- note that the correct version would be range(sys.maxunicode + 1)
.
As one form of data, a quick grep finds the existing usage in the codebase leans toward the explicit hex:
$ git grep maxunicode | wc -l
31
$ git grep -ie 0x10ffff -e 0x110000 | wc -l
100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt very much that sys.maxunicode will ever be changed in the future.
It's not just about being future-proof: a named constant also helps to better self-explain the code, it helps the readability. I always wanted to replace hardcoded constants with named constants: I just did it in PR #15067.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. In this case, as I said I find the explicit literal somewhat clearer; but I definitely think that's the important criterion here, and I don't mind changing it.
@vstinner I just pushed another update:
Thanks again for the review! I wrote the explicit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart the hardcoded 0x110000 constant (so I don't approve the change yet, sorry!).
Lib/test/test_unicode.py
Outdated
for ch in ['\U00010401', '\U00010427', '\U00010429', '\U0001044E', | ||
'\U0001F40D', '\U0001F46F']: | ||
self.assertFalse(ch.isspace(), '{!a} is not space.'.format(ch)) | ||
|
||
@support.requires_resource('cpu') | ||
def test_isspace_invariant(self): | ||
for codepoint in range(0x110000): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to use sys.maxunicode + 1 here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK; done!
str.isspace
.str.isspace
.
@vstinner Ping -- I believe this will be quick for you to act on, because I made the small change you requested in the last round. I see you're back online this week, so replying just to make sure this returns to your inbox 🙂 |
I'm having trouble backporting to |
Sorry @gnprice and @vstinner, I had trouble checking out the |
@gnprice: The bot failed to automatically backport the change to Python 3.8. Can you please try to create a PR for Python 3.8? |
The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace. Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs. Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The `unicodedata` module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there. Also add to the isspace() test a thorough check that the implementation agrees with the intended definition.
GH-15296 is a backport of this pull request to the 3.8 branch. |
Thanks @vstinner for the reviews and the merge!
Done. Turned out to be an easy merge conflict in the imports block. |
The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace. Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs. Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The `unicodedata` module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there. Also add to the isspace() test a thorough check that the implementation agrees with the intended definition.
…ythonGH-15296) The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace. Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs. Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The `unicodedata` module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there. Also add to the isspace() test a thorough check that the implementation agrees with the intended definition. (cherry picked from commit 8c1c426) Co-authored-by: Greg Price <gnprice@gmail.com>
The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace. Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs. Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The `unicodedata` module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there. Also add to the isspace() test a thorough check that the implementation agrees with the intended definition. (cherry picked from commit 8c1c426) Co-authored-by: Greg Price <gnprice@gmail.com>
The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace. Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs. Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The `unicodedata` module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there. Also add to the isspace() test a thorough check that the implementation agrees with the intended definition.
The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace. Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs. Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The `unicodedata` module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there. Also add to the isspace() test a thorough check that the implementation agrees with the intended definition.
The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace. Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs. Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The `unicodedata` module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there. Also add to the isspace() test a thorough check that the implementation agrees with the intended definition.
The documented definition was much broader than the real one:
there are tons of characters with general category "Other",
and we don't (and shouldn't) treat most of them as whitespace.
Rewrite the definition to agree with the comment on
_PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py,
which is what generates that function and so ultimately governs.
Add suitable breadcrumbs so that a reader who wants to pin down
exactly what this definition means (what's a "bidirectional class"
of "B"?) can do so. The
unicodedata
module documentation is anappropriate central place for our references to Unicode's own copious
documentation, so point there.
Also add to the
isspace
test a thorough check that theimplementation agrees with the intended definition.
https://bugs.python.org/issue36502