-
-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-28180: Implementation for PEP 538 #659
Conversation
- new PYTHONCOERCECLOCALE config setting - coerces legacy C locale to C.UTF-8, C.utf8 or UTF-8 by default TODO: - configure option to disable locale coercion at build time - configure option to disable C locale warning at build time - skip runtime locale warning on Mac OS X
* --with(out)-c-locale-coercion for PY_COERCE_C_LOCALE * --with(out)-c-locale-warning for PY_WARN_ON_C_LOCALE
Programs/python.c
Outdated
/* Set PYTHONIOENCODING if not already set */ | ||
if (setenv("PYTHONIOENCODING", "utf-8:surrogateescape", 0)) { | ||
fprintf(stderr, | ||
"Error setting PYTHONIOENCODING during C locale coercion\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may break old Python 2 in subprocess.
$ cat t.py
# encoding: utf-8
print(u"こんにちは")
$ /usr/bin/python2.6 t.py
こんにちは
$ export PYTHONIOENCODING=utf-8:surrogateescape
$ /usr/bin/python2.6 t.py
TypeError: an integer is required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, my environment was wrong.
I accidently test above in Lib/
directory in cpython's checkout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still raises a good question though, as that setting does affect Python 2 differently from the way it affects Python 3 - it changes the implicit encoding step on stdout, but stdin still relies on passing the raw bytes through without interpretation:
$ LANG=C python2
Python 2.7.13 (default, Jan 12 2017, 17:59:37)
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print(u"こんにちは")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)
>>> print("こんにちは")
こんにちは
>>>
$ PYTHONIOENCODING=utf-8:surrogateescape LANG=C python2
Python 2.7.13 (default, Jan 12 2017, 17:59:37)
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print(u"こんにちは")
こんにちは
>>> print("こんにちは")
こんにちは
>>>
It's also a potential problem that 'surrogateescape' doesn't exist in Python 2, so it may be better to just use Py_SetStandardStreamEncoding
in PEP 538, and leave enabling surrogateescape
in subprocesses as well to PEP 540 (via PYTHONUTF8=1
in the parent environment).
It also turns out that LANG=C python2
is an easy way to demonstrate that GNU readline just plain doesn't handle UTF-8 properly in the C locale - attempting to edit the print(u"こんにちは")
line at the interactive prompt to remove the u
prefix or add it back results in nonsense:
>>> print(u"こんにちは")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)
>>> print(�こんにちは")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)
>>> print("こんにちは")
こんにちは
>>> print(u��にちは") ")
こ�u��にちは
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving out for the moment whether this is a good idea or not (we'll use the PEP process for that), I have some questions about the code.
Doc/using/cmdline.rst
Outdated
to skip coercing the legacy ASCII-based C locale to a more capable UTF-8 | ||
based alternative. Note that this setting is checked even when the | ||
:option:`-E` or :option:`-I` options are used, as it is handled prior to | ||
the processing of command line options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I reading this right? It seems odd that setting PYTHONCOERCECLOCALE
would disable coercing the legacy locale. The sense is exactly opposite of what I'd expect.
Should the envar be named PYTHONNOCOERCECLOCALE
?
Also, what if the environment variable is set to "0"
? Given the above description, that should still skip coercion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, that's a holdover from when the setting was PYTHONALLOWCLOCALE - presumably I changed the title of the section, but then got distracted by something else before updating the body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 1c3a270
Lib/test/test_locale_coercion.py
Outdated
# Details of the CLI warning emitted at runtime | ||
CLI_COERCION_WARNING_FMT = ( | ||
"Python detected LC_CTYPE=C: {} coerced to {} (set another locale " | ||
"or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the value of the envar matter or not? It just needs to be any string value, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs bug - the warning matches the behaviour, the docs don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs fixed in 1c3a270
Lib/test/test_locale_coercion.py
Outdated
"LC_CTYPE": "", | ||
"LC_ALL": "", | ||
} | ||
for env_var in ("LC_ALL", "LC_CTYPE", "LANG"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the order matter? If not, what about using for env_var in base_var_dict:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in d12b412
Lib/test/test_locale_coercion.py
Outdated
"LC_CTYPE": "", | ||
"LC_ALL": "", | ||
} | ||
for env_var in ("LC_ALL", "LC_CTYPE", "LANG"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in d12b412
Lib/test/test_locale_coercion.py
Outdated
self._check_c_locale_coercion("utf-8", coerce_c_locale=None) | ||
|
||
def test_PYTHONCOERCECLOCALE_not_zero(self): | ||
# *Any* string other that "0" is considered "set" for our purposes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this contradict the documentation at the top of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs fixed in 1c3a270
Lib/test/test_locale_coercion.py
Outdated
os.chdir(basepath) | ||
|
||
def tearDown(self): | ||
os.chdir(self.oldcwd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using self.addCleanup(os.chdir, self.oldcwd)
in the setUp()
instead and getting rid of the tearDown()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I stole the basic structure from test_capi
(simplified a bit since this doesn't need to run on Windows), so it's missing some modern niceties. With addCleanup
, we wouldn't even need self.oldcwd
as a record, we can just add the cleanup operation before changing the directory:
self.addCleanup(os.chdir, os.getcwd())
os.chdir(basepath)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in d12b412
Lib/test/test_locale_coercion.py
Outdated
p = subprocess.Popen(cmd, | ||
stdout=subprocess.PIPE, | ||
stderr=subprocess.PIPE, | ||
universal_newlines=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make any sense to run this without universal_newlines
also? Since the locale changes should only affect text mode, that additional test would just ensure that nothing's changed when running subprocesses in bytes mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this particular test, I don't think so, as it's just checking that the runtime warning gets emitted when we bypass the locale coercion in the standard Python CLI, rather than doing anything with the standard streams.
However, it does make sense for test_locale_coercion
itself (which I should rename to test_c_locale_coercion
): that should check the behaviour when the Python subprocess is run with the -u
option for unbuffered binary streams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it turns out that even -u
doesn't change anything of interest to this PEP any more, and there's just an error in the --help
docs that makes it sound like it still does: http://bugs.python.org/issue28647
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, now that the PEP potentially also affects the way the standard streams are configured, I expanded on the test cases to also cover that: 4e6d502
Programs/python.c
Outdated
@@ -15,6 +15,110 @@ wmain(int argc, wchar_t **argv) | |||
} | |||
#else | |||
|
|||
/* Helpers to better handle the legacy C locale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's been some discussion about the "legacy C locale" nomenclature. What about "bare C locale"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even glibc devs consider 7-bit ASCII a legacy encoding at this point, it's just an enormous ecosystem to try to migrate to a new default way of doing things.
My view would likely be different if we were planning to invest time and energy in getting Python 3 to "work properly" in the C locale, but the underlying assumptions are sufficiently incompatible that it isn't even clear what "working properly" would even mean. (PEP 540 gets closer than the status quo, but it's definition of "working properly" is "ignoring the declared locale encoding as almost certainly being wrong")
Programs/python.c
Outdated
#ifdef PY_COERCE_C_LOCALE | ||
static const char *_C_LOCALE_COERCION_WARNING = | ||
"Python detected LC_CTYPE=C: %.20s coerced to %.20s (set another locale " | ||
"or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to come up in every project I work on. Do we use UK or US spelling? :)
I think in Python we use US spellings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll typically use my native spellings when writing PEPs and email, but anything in the actual code or docs should be using the US spelling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in ccfc83f
Python/pylifecycle.c
Outdated
* For consistency with the corresponding check in Programs/python.c | ||
* we ignore the Python -E and -I flags here. | ||
*/ | ||
if (coerce_c_locale == NULL || strncmp(coerce_c_locale, "0", 2) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do this same check in two places. Would it make sense to refactor that into a call or macro so there's no chance in getting out of sync? (See my question at the top of the PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative would be to check a C global here, rather than checking the environment variable directly, and have a _Py_DisableCLocaleWarning()
helper that the CLI calls when PYTHONCOERCECLOCALE=0
is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside of that approach would be that it means embedding applications can't use PYTHONCOERCECLOCALE=0
to silence the warning without changing the configured locale, and I don't want to define a new public embedding configuration API just to avoid checking the environment variable twice.
So I'll unconditionally export a private helper function from pylifecycle.c (so the ABI doesn't depend on the configure flag setting), and re-use that from the CLI implementation.
- setting PYTHONIOENCODING has unintended side effects on Python 2 instances run in a subprocess (since Python 2 has no `surrogateescape` error handler - Py_SetStandardStreamEncoding enables surrogateescape for the current process without any side effects on subprocesses
Windows doesn't use setenv to set environment variables, so set PYTHONIOENCODING from test_capi instead of _testembed when running the forced_io_encoding test.
- move all required logic inside the shared library - explicitly setting one of the coercion target locales now also automatically enables "surrogateescape" on sys.stdin and sys.stdout
Locale coercion no longer has any effect if LC_ALL is explicitly set in the environment. When locale coercion triggers, it sets either both LC_CTYPE & LANG (for full locales) or only LC_CTYPE (for partial locales). This change also eliminated the need for a custom test case for the locale coercion warning - instead, the test suite is able to check for that just by setting LC_ALL in the child process environment.
@zooba @brettcannon Would one of you have some time to look at the Windows failure here? I suspect what's happened is that I've changed some code that I thought was *nix specific, and it needs a preprocessor guard to get it to retain the old behaviour on Windows. |
Doc/using/cmdline.rst
Outdated
|
||
* ``C.UTF-8`` (``LC_ALL``) | ||
* ``C.utf8`` (``LC_ALL``) | ||
* ``UTF-8`` (``LC_CTYPE``) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These LC_ALL
should be updated.
@ncoghlan Is there some state involved in checking/configuring the locale override that must be run before As an aside, I'm somewhat uncomfortable about the "let's ignore -E -S" parts, but if you're sure it's not going to enable loading arbitrary code (via an encoding) then I won't push back too hard. |
if (ctype_loc != NULL) { | ||
/* "surrogateescape" is the default in the legacy C locale */ | ||
if (strcmp(ctype_loc, "C") == 0) { | ||
return "surrogateescape"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zooba This is the bit that I suspect may be unintentionally changing the interpreter startup behaviour on Windows and thus causing the Appveyor test failures: there's currently nothing restricting this particular code path to only run on non-Windows systems, so I suspect it may be triggering and activating surrogateescape
by default on the Windows standard streams, even though Windows uses its own code page system, rather than the POSIX locale system. If I'm right, then the entire body of this function should be inside some kind of #ifdef
, such that the Windows case degenerates to return NULL;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly. The only thing that is available for coercion on Windows is locale.getpreferredencoding()
, and there are no results from setlocale()
that you can reasonably interpret as "wrong".
When/if the "force UTF-8" option comes along, that should apply to Windows, but I don't think we need to determine the error handler here at all for Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing this to the old code (as a result of the test_sys
failure), it seems we were setting surrogateescape
by default in the C locale, even on Windows. So I've had a second go at fixing this by keeping that, and only skipping the new checks for the coercion target locales.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turned out this wasn't the problem, but rather the earlier call to setlocale(LC_CTYPE, "")
, where I had removed a HAVE_SETLOCALE
guard as being redundant in 2017. It turns out that even though Windows does have setlocale
available:
- Passing an empty string doesn't work the same way it does in glibc (presumably it syncs the active C locale with the active Windows code page rather than looking at environment variables)
- The Windows builds don't define
HAVE_SETLOCALE
, so that guard was previously serving as a cryptic#ifndef MS_WINDOWS
check
So rather than restoring the original guard, I replaced it with an explicit #ifndef MS_WINDOWS
check instead.
Python/pylifecycle.c
Outdated
* to give end users a way to force even scripts that are otherwise | ||
* isolated from their environment to use the legacy ASCII-centric C | ||
* locale. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per @zooba: I need to explain the (lack of) security implications here, since:
- the envvar can only be used to turn off locale coercion, it doesn't do anything else (i.e. it's a boolean toggle, not input data)
- if an attacker is in a position to corrupt the behaviour of the
ascii
encoding, they're likely to be able to attack theutf-8
encoding just as effectively
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
- avoid unintended side effects on Windows behaviour - remove a single-use function that made the code harder to follow - clarify the security considerations around ignoring -E and -I when checking PYTHONCOERCECLOCALE
The inline check for "Is this env var exactly zero?" is still more self-explanatory than factoring out the helper function.
A HAVE_SETLOCALE guard was removed when adding a check for __ANDROID__, and that may be affecting the default locale reported on Windows.
The What's New entry has been added, so I consider this PR to be feature complete now - to reduce the chance of conflicts, I won't add the NEWS entry until we're just about to merge it. |
AppVeyor seems to be MIA, but the last Windows run passed, and I've only made docs and metadata changes since then, so I'll be going ahead with the merge and marking the PEP as Final once the Travis run completes. |
@@ -70,7 +103,11 @@ main(int argc, char **argv) | |||
|
|||
/* Force again malloc() allocator to release memory blocks allocated | |||
before Py_Main() */ | |||
#ifdef Py_DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be an unrelated change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that it's technically unrelated now, but an earlier iteration of the patch failed in debug mode without it - that iteration was using Py_SetStandardStreamEncoding
, and when you do that, you allocate some memory that gets freed during interpreter startup after the default allocator has been configured. Without this change, that free attempt failed in debug mode due to the allocator mismatch.
So these guards mean that the "safe to call before Py_Initialize
" APIs actually are safe to call from the CLI entry point.
Reference implementation for PEP 538, which is in turn a partial fix for bpo-28180.
This updates the CPython CLI to attempt to set
LC_CTYPE
to a suitable UTF-8based locale before loading the runtime when it detects that it is running in
the C locale.
It also updates the CPython runtime to emit a compatibility warning on
stderr
when running in the C locale.
Remaining work: