Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soft-deprecate PyUnicode_AsUTF8 #39

Open
encukou opened this issue Nov 8, 2023 · 1 comment
Open

Soft-deprecate PyUnicode_AsUTF8 #39

encukou opened this issue Nov 8, 2023 · 1 comment

Comments

@encukou
Copy link
Contributor

encukou commented Nov 8, 2023

(I'll include the problem here, as the "problems" repo seems "done" now that PEP-733 is up).

Traditional C APIs take zero-terminated strings, which means that Python strings that with embedded NUL bytes appear truncated. There are many ways to get such a char*: converted directly using PyUnicode_AsUTF8, encoded and accessed via PyBytes_AsString, or accessed with something like PyUnicode_AsUTF8AndSize while ignoring the size.

Many APIs that convert to char* raise an error on embedded NUL bytes. On that:


We could:

  • Soft-deprecate PyUnicode_AsUTF8, nudging people toward PyUnicode_AsUTF8AndSize. (It's still possible to ignore the size, but it's much less likely to do so on purpose -- unless we encourage people to mechanically replace PyUnicode_AsUTF8(s) with PyUnicode_AsUTF8AndSize(s, NULL)).
  • In CPython, use the "pointer+size" representation more --- only use "pointer only" for working with external APIs or for backwards compatibility. This might help find APIs we might want to expose.

Notes on some of the issues @vstinner collected in python/cpython#111656 (comment):

  • In APIs that look up names and take aliases (codec names, hash algorithm names, timezone names, etc.), the embedded NUL is not as security issue. For example, I don't see a problem with UTF-8, utf8 and utf8\0spamspamspaaam all naming the same encoding. (The fact that some APIs will reject the latter string, and others will not, is unfortunate but not terrible.)
  • In error/warning messages, we might want to filter out newlines, backspaces, terminal escape sequences and the like. If we're not doing that, there's not much additional harm in allowing an “end of message” control character. (FWIW, PyObject_Repr is very useful for arbitrary strings, though we shouldn't call it “safe” as it still passes Unicode lookalikes or BIDI characters through.)
@encukou
Copy link
Contributor Author

encukou commented Nov 8, 2023

Other approaches that were considered:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant