Soft-deprecate `PyUnicode_AsUTF8` #39

encukou · 2023-11-08T12:37:04Z

(I'll include the problem here, as the "problems" repo seems "done" now that PEP-733 is up).

Traditional C APIs take zero-terminated strings, which means that Python strings that with embedded NUL bytes appear truncated. There are many ways to get such a char*: converted directly using PyUnicode_AsUTF8, encoded and accessed via PyBytes_AsString, or accessed with something like PyUnicode_AsUTF8AndSize while ignoring the size.

Many APIs that convert to char* raise an error on embedded NUL bytes. On that:

This is safe, but it needs an extra O(n) search, which is not necessary for all tasks.
It is too late to change widely used existing API (PyUnicode_AsUTF8) to do this. See the reverted [C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters python/cpython#111089

We could:

Soft-deprecate PyUnicode_AsUTF8, nudging people toward PyUnicode_AsUTF8AndSize. (It's still possible to ignore the size, but it's much less likely to do so on purpose -- unless we encourage people to mechanically replace PyUnicode_AsUTF8(s) with PyUnicode_AsUTF8AndSize(s, NULL)).
- See Macro to hide deprecated functions #24 for making soft-deprecation more relevant
In CPython, use the "pointer+size" representation more --- only use "pointer only" for working with external APIs or for backwards compatibility. This might help find APIs we might want to expose.

Notes on some of the issues @vstinner collected in python/cpython#111656 (comment):

In APIs that look up names and take aliases (codec names, hash algorithm names, timezone names, etc.), the embedded NUL is not as security issue. For example, I don't see a problem with UTF-8, utf8 and utf8\0spamspamspaaam all naming the same encoding. (The fact that some APIs will reject the latter string, and others will not, is unfortunate but not terrible.)
In error/warning messages, we might want to filter out newlines, backspaces, terminal escape sequences and the like. If we're not doing that, there's not much additional harm in allowing an “end of message” control character. (FWIW, PyObject_Repr is very useful for arbitrary strings, though we shouldn't call it “safe” as it still passes Unicode lookalikes or BIDI characters through.)

The text was updated successfully, but these errors were encountered:

encukou · 2023-11-08T12:49:05Z

Other approaches that were considered:

[C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters python/cpython#111089
- A backwards-incompatible change is not appropriate given the low severity of the issue
gh-111089: Add PyUnicode_AsUTF8NoNUL() function python/cpython#111688
- The new API is not always appropriate as a replacement for PyUnicode_AsUTF8, and when it is appropriate it only replaces 2 lines (plus declarations/error handling)
- It does nothing to nudge people away from PyUnicode_AsUTF8
- IMO, it encourages going in the wrong direction -- "pointer only" representation rather than "pointer+size"

encukou mentioned this issue Nov 8, 2023

[C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters python/cpython#111089

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Soft-deprecate `PyUnicode_AsUTF8` #39

Soft-deprecate `PyUnicode_AsUTF8` #39

encukou commented Nov 8, 2023

encukou commented Nov 8, 2023 •

edited

Loading

Soft-deprecate PyUnicode_AsUTF8 #39

Soft-deprecate PyUnicode_AsUTF8 #39

Comments

encukou commented Nov 8, 2023

encukou commented Nov 8, 2023 • edited Loading

Soft-deprecate `PyUnicode_AsUTF8` #39

Soft-deprecate `PyUnicode_AsUTF8` #39

encukou commented Nov 8, 2023 •

edited

Loading