bpo-36502: Correct documentation of `str.isspace`. #15019

gnprice · 2019-07-30T07:59:50Z

The documented definition was much broader than the real one:
there are tons of characters with general category "Other",
and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on
_PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py,
which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down
exactly what this definition means (what's a "bidirectional class"
of "B"?) can do so. The unicodedata module documentation is an
appropriate central place for our references to Unicode's own copious
documentation, so point there.

Also add to the isspace test a thorough check that the
implementation agrees with the intended definition.

https://bugs.python.org/issue36502

The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace. Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs. Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The `unicodedata` module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there. Also add to the `isspace` test a thorough check that the implementation agrees with the intended definition.

the-knights-who-say-ni · 2019-07-30T07:59:53Z

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

tirkarthi · 2019-07-30T11:29:49Z

I think this would require an issue in the tracker. There are several issues regarding definition of whitespace and usage. Some isspace related issues :

mangrisano · 2019-07-30T11:30:09Z

/cc @vstinner @ezio-melotti @malemburg @benjaminp

gnprice · 2019-07-30T12:31:22Z

There are several issues regarding definition of whitespace and usage. Some isspace related issues :

Ah, thanks for those links!

https://bugs.python.org/issue36502

Indeed this looks like exactly the documentation bug this patch fixes.

https://bugs.python.org/issue18236

This is a proposal to change the behavior of isspace to use the Unicode White_Space property, which didn't yet exist when Python first gained Unicode support. That sounds like very much the right thing to me.

That will represent a change in behavior; I'm not sure what that means for releasing such a change. It's also significant work (as we don't currently parse that part of the Unicode database at all), which perhaps partly explains why it hasn't been done in the 6 years since it was proposed.

For any release before the change to use White_Space gets done, I think a doc fix like this one would be an improvement. The "Other" category is very far from either the actual or desired behavior.

vstinner · 2019-07-30T14:10:48Z

Lib/test/test_unicode.py

@@ -617,7 +618,14 @@ def test_isspace(self):
        self.checkequalnofix(True, '\u2000', 'isspace')
        self.checkequalnofix(True, '\u200a', 'isspace')
        self.checkequalnofix(False, '\u2014', 'isspace')
-        # apparently there are no non-BMP spaces chars in Unicode 6
+        for i in range(0x10000):


I would prefer to have a constant declared at the top level of the module. Why not testing the full range of the UCS? range(0x110000)

I would prefer to have a constant declared at the top level of the module.

Ah, good idea. I think in fact perhaps I'll make the range itself a constant, and then be able to write:

for codepoint in BMP_CODEPOINTS: char = chr(codepoint)

because that seems to get at the meaning very well.

(Or perhaps for char in bmp_chars():. OTOH I just spent a few minutes trying to see a clean way to make it for char in BMP_CHARS:, and couldn't find one.)

Why not testing the full range of the UCS? range(0x110000)

Well, the most causal answer to "why" 🙂 is that I first added this as a test in test_unicodedata.py before seeing the existing test_isspace here -- and two of the central tests there (running through the whole database and computing overall checksums) have a loop much like this one, so I followed their lead.

Those two tests (test_method_checksum and test_function_checksum) look to be substantially unchanged since they were first added in 6a20ee7 back in 2000. They probably should be updated to go through the full range of codepoints; I think back in 2000 this was the full range of codepoints.

(Hmm... as I look closer at what test_method_checksum in particular does, it's really all about methods on str, and doesn't actually mention unicodedata. Perhaps it should move to test_unicode.py.)

The one real cost of looping through all planes (vs. just the BMP) is test runtime -- the test takes about 50ms for just the BMP (on my fast desktop) but 900ms for everything. Fast test suites are precious, so in test_isspace it seems a bit of a shame to spend almost a second looping through values where nothing at all is expected to happen. Well worth it for the loops in test_unicodedata, though, where there's plenty happening the whole way along.

I would prefer to have a constant declared at the top level of the module. Why not testing the full range of the UCS? range(0x110000)

Just pushed an update making both of these changes. Thanks for the review!

I was about to edit the existing two loops in test_unicodedata to run the full range too, but then realized that that requires updating the checksum. I feel like that should be a separate change and not mixed in with this one. So I'll keep that around as a commit on a branch on top of this one; I can send it right after this PR is complete, or also happy to squash it in here if you think that'd be preferable.

vstinner · 2019-07-31T09:40:46Z

Lib/test/test_unicodedata.py

@@ -102,6 +106,14 @@ def test_function_checksum(self):
        result = h.hexdigest()
        self.assertEqual(result, self.expectedchecksum)

+    def test_isspace_invariant(self):


It seems like you are testing str.isspace() rather than the unicodedata module here. I suggest to move this test to test_unicode.py instead, after test_isspace().

You wrote that the test takes 900 ms: that's slow. I suggest to decorate the test with:

@support.requires_resource('cpu')

So it will be skipped on slow CIs but only run on fast CIs.

It seems like you are testing str.isspace() rather than the unicodedata module here. I suggest to move this test to test_unicode.py instead, after test_isspace().

Sure; I've waffled on that myself. Will move it back to there.

(Note that the existing test at the top of test_unicodedata.py similarly doesn't mention the unicodedata module. Perhaps it'd be good to move it to test_unicode.py too, though I think not as part of this PR.)

You wrote that the test takes 900 ms: that's slow. I suggest to decorate the test [...]

Ah, thanks! Will do.

vstinner · 2019-07-31T09:43:39Z

Lib/test/test_unicodedata.py

@@ -14,6 +14,10 @@
 encoding = 'utf-8'
 errors = 'surrogatepass'

+def all_chars():
+    '''Each Unicode codepoint, as a one-character string.'''
+    for codepoint in range(0x110000):


Hum, I only expected a constant, not a whole function. I'm not sure that such function is needed, since it's only called once. I suggest to remove the function, and use sys.maxunicode rather than 0x110000. So the test will be updated automatically if sys.maxunicode is increased.

I made it a generator because I think that makes the abstraction a bit crisper: the test just says for char in all_chars(), and the meaning is it iterates through all possible characters. But open-coding the loop works too; I can change it.

So the test will be updated automatically if sys.maxunicode is increased.

I doubt very much that sys.maxunicode will ever be changed in the future. 😉 Expanding Unicode beyond the original 16 bits was traumatic enough -- it's still a widespread cause of bugs, decades later -- and when they did, they gave themselves quite a lot of safety margin.

If you'd find the code clearer to read that way, though, that'd certainly be a good reason.

FWIW I find the hex constant somewhat clearer. One reason is that I don't have to wonder about the fencepost -- note that the correct version would be range(sys.maxunicode + 1).

As one form of data, a quick grep finds the existing usage in the codebase leans toward the explicit hex:

$ git grep maxunicode | wc -l 31 $ git grep -ie 0x10ffff -e 0x110000 | wc -l 100

I doubt very much that sys.maxunicode will ever be changed in the future.

It's not just about being future-proof: a named constant also helps to better self-explain the code, it helps the readability. I always wanted to replace hardcoded constants with named constants: I just did it in PR #15067.

Sounds good. In this case, as I said I find the explicit literal somewhat clearer; but I definitely think that's the important criterion here, and I don't mind changing it.

gnprice · 2019-08-01T05:00:39Z

@vstinner I just pushed another update:

* 578335f44 Move back to test_unicode; open-code loop; mark as uses-CPU.

Thanks again for the review!

I wrote the explicit range(0x110000) rather than range(sys.maxunicode + 1), for the reasons mentioned above. Happy to change it if you'd prefer, though.

vstinner

LGTM apart the hardcoded 0x110000 constant (so I don't approve the change yet, sorry!).

vstinner · 2019-08-01T10:26:43Z

Lib/test/test_unicode.py

        for ch in ['\U00010401', '\U00010427', '\U00010429', '\U0001044E',
                   '\U0001F40D', '\U0001F46F']:
            self.assertFalse(ch.isspace(), '{!a} is not space.'.format(ch))

+    @support.requires_resource('cpu')
+    def test_isspace_invariant(self):
+        for codepoint in range(0x110000):


I would prefer to use sys.maxunicode + 1 here.

gnprice · 2019-08-14T00:32:43Z

@vstinner Ping -- I believe this will be quick for you to act on, because I made the small change you requested in the last round.

I see you're back online this week, so replying just to make sure this returns to your inbox 🙂

miss-islington · 2019-08-14T11:05:23Z

Thanks @gnprice for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8.
🐍🍒⛏🤖 I'm not a witch! I'm not a witch!

miss-islington · 2019-08-14T11:05:24Z

I'm having trouble backporting to 3.8. Reason: 'Error 110 while writing to socket. Connection timed out.'. Please retry by removing and re-adding the needs backport to 3.8 label.

miss-islington · 2019-08-14T11:05:38Z

Thanks @gnprice for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8.
🐍🍒⛏🤖

miss-islington · 2019-08-14T11:05:41Z

Sorry @gnprice and @vstinner, I had trouble checking out the 3.8 backport branch.
Please backport using cherry_picker on command line.
cherry_picker 6bccbe7dfb998af862a183f2c36f0d4603af2c29 3.8

vstinner · 2019-08-14T11:07:00Z

@gnprice: The bot failed to automatically backport the change to Python 3.8. Can you please try to create a PR for Python 3.8?

The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace. Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs. Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The `unicodedata` module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there. Also add to the isspace() test a thorough check that the implementation agrees with the intended definition.

bedevere-bot · 2019-08-15T03:17:10Z

GH-15296 is a backport of this pull request to the 3.8 branch.

gnprice · 2019-08-15T03:18:09Z

Thanks @vstinner for the reviews and the merge!

Can you please try to create a PR for Python 3.8?

Done. Turned out to be an easy merge conflict in the imports block.

The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace. Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs. Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The `unicodedata` module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there. Also add to the isspace() test a thorough check that the implementation agrees with the intended definition.

…ythonGH-15296) The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace. Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs. Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The `unicodedata` module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there. Also add to the isspace() test a thorough check that the implementation agrees with the intended definition. (cherry picked from commit 8c1c426) Co-authored-by: Greg Price <gnprice@gmail.com>

The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace. Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs. Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The `unicodedata` module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there. Also add to the isspace() test a thorough check that the implementation agrees with the intended definition. (cherry picked from commit 8c1c426) Co-authored-by: Greg Price <gnprice@gmail.com>

The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace. Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs. Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The `unicodedata` module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there. Also add to the isspace() test a thorough check that the implementation agrees with the intended definition.

the-knights-who-say-ni added the CLA not signed label Jul 30, 2019

bedevere-bot added tests Tests in the Lib/test dir awaiting review labels Jul 30, 2019

the-knights-who-say-ni added CLA signed and removed CLA not signed labels Jul 30, 2019

vstinner reviewed Jul 30, 2019

View reviewed changes

Cover all characters, using a name; move to test_unicodedata.

feeb2c6

vstinner reviewed Jul 31, 2019

View reviewed changes

Move back to test_unicode; open-code loop; mark as uses-CPU.

578335f

vstinner mentioned this pull request Aug 1, 2019

Use sys.maxunicode and MAX_UNICODE constants #15067

Closed

vstinner reviewed Aug 1, 2019

View reviewed changes

vstinner added needs backport to 3.8 only security fixes skip issue skip news labels Aug 1, 2019

Use sys.maxunicode + 1 rather than explicit 0x110000.

d8c347d

gnprice changed the title ~~Correct documentation of str.isspace.~~ bpo-36502: Correct documentation of str.isspace. Aug 3, 2019

vstinner removed the skip issue label Aug 14, 2019

vstinner merged commit 6bccbe7 into python:master Aug 14, 2019

bedevere-bot removed the awaiting review label Aug 14, 2019

vstinner added needs backport to 3.8 only security fixes and removed needs backport to 3.8 only security fixes labels Aug 14, 2019

miss-islington assigned vstinner Aug 14, 2019

gnprice deleted the isspace-doc branch August 15, 2019 03:15

bedevere-bot removed the needs backport to 3.8 only security fixes label Aug 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-36502: Correct documentation of `str.isspace`. #15019

bpo-36502: Correct documentation of `str.isspace`. #15019

gnprice commented Jul 30, 2019 •

edited by bedevere-bot

Loading

the-knights-who-say-ni commented Jul 30, 2019

tirkarthi commented Jul 30, 2019

mangrisano commented Jul 30, 2019

gnprice commented Jul 30, 2019

vstinner Jul 30, 2019

gnprice Jul 30, 2019

gnprice Jul 30, 2019

gnprice Jul 31, 2019

vstinner Jul 31, 2019

gnprice Aug 1, 2019

vstinner Jul 31, 2019

gnprice Aug 1, 2019

vstinner Aug 1, 2019

gnprice Aug 1, 2019

gnprice commented Aug 1, 2019

vstinner left a comment

vstinner Aug 1, 2019

gnprice Aug 1, 2019

gnprice commented Aug 14, 2019

miss-islington commented Aug 14, 2019

miss-islington commented Aug 14, 2019

miss-islington commented Aug 14, 2019

miss-islington commented Aug 14, 2019

vstinner commented Aug 14, 2019

bedevere-bot commented Aug 15, 2019

gnprice commented Aug 15, 2019

bpo-36502: Correct documentation of str.isspace. #15019

bpo-36502: Correct documentation of str.isspace. #15019

Conversation

gnprice commented Jul 30, 2019 • edited by bedevere-bot Loading

the-knights-who-say-ni commented Jul 30, 2019

tirkarthi commented Jul 30, 2019

mangrisano commented Jul 30, 2019

gnprice commented Jul 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnprice commented Aug 1, 2019

vstinner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnprice commented Aug 14, 2019

miss-islington commented Aug 14, 2019

miss-islington commented Aug 14, 2019

miss-islington commented Aug 14, 2019

miss-islington commented Aug 14, 2019

vstinner commented Aug 14, 2019

bedevere-bot commented Aug 15, 2019

gnprice commented Aug 15, 2019

bpo-36502: Correct documentation of `str.isspace`. #15019

bpo-36502: Correct documentation of `str.isspace`. #15019

gnprice commented Jul 30, 2019 •

edited by bedevere-bot

Loading