Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.12]: gh-113055: Use pointer for interp->obmalloc state #118618

Closed
wants to merge 3 commits into from

Conversation

nascheme
Copy link
Member

@nascheme nascheme commented May 6, 2024

A backport of this change has been requested by a some users. See Gh-113412 comments.

Crash in WeeChat which might be fixed by this: gh-116510

Comment from @bacon-cheeseburger:

Just wanted to offer that this issue appears to impact Kodi, which has a considerable user base in the millions. The number of users affected by the problem could be quite substantial.

For interpreters that share state with the main interpreter, this points
to the same static memory structure.  For interpreters with their own
obmalloc state, it is heap allocated.  Add free_obmalloc_arenas() which
will free the obmalloc arenas and radix tree structures for interpreters
with their own obmalloc state.
@nascheme nascheme added DO-NOT-MERGE 3.12 bugs and security fixes labels May 6, 2024
@nascheme
Copy link
Member Author

nascheme commented May 6, 2024

This unfortunately causes test_import to crash in the multiple interpreter tests (e.g. test_basic_multiple_interpreters_reset_each).

The crash error is "double free or corruption" and the backtrace shows _int_free() as the last CPython function called. My suspicion is that it might be some immortal refcnt bug related to ints that got fixed in the 3.13 branch but not backported. Maybe gh-94673?

@neonene
Copy link
Contributor

neonene commented May 6, 2024

It looks like L640 and L2062 in pylifecycle.c are different insert positions from 3.13.

@neonene
Copy link
Contributor

neonene commented May 6, 2024

It looks like L640 and L2062 in pylifecycle.c are different insert positions from 3.13.

Leaving a patch just in case. Not sure if it's correct, but at least avoids the test_import crash for me.

diff --git a/Python/pylifecycle.c b/Python/pylifecycle.c
index fb833ba..31a24d4 100644
--- a/Python/pylifecycle.c
+++ b/Python/pylifecycle.c
@@ -637,13 +637,6 @@ pycore_create_interpreter(_PyRuntimeState *runtime,
         return status;
     }

-    // initialize the interp->obmalloc state.  This must be done after
-    // the settings are loaded (so that feature_flags are set) but before
-    // any calls are made to obmalloc functions.
-    if (_PyMem_init_obmalloc(interp) < 0) {
-        return  _PyStatus_NO_MEMORY();
-    }
-
     /* Auto-thread-state API */
     status = _PyGILState_Init(interp);
     if (_PyStatus_EXCEPTION(status)) {
@@ -658,6 +651,13 @@ pycore_create_interpreter(_PyRuntimeState *runtime,
         return status;
     }

+    // initialize the interp->obmalloc state.  This must be done after
+    // the settings are loaded (so that feature_flags are set) but before
+    // any calls are made to obmalloc functions.
+    if (_PyMem_init_obmalloc(interp) < 0) {
+        return  _PyStatus_NO_MEMORY();
+    }
+
     PyThreadState *tstate = _PyThreadState_New(interp);
     if (tstate == NULL) {
         return _PyStatus_ERR("can't make first thread");
@@ -2059,14 +2059,6 @@ new_interpreter(PyThreadState **tstate_p, const PyInterpreterConfig *config)
         return _PyStatus_OK();
     }

-    // initialize the interp->obmalloc state.  This must be done after
-    // the settings are loaded (so that feature_flags are set) but before
-    // any calls are made to obmalloc functions.
-    if (_PyMem_init_obmalloc(interp) < 0) {
-        status = _PyStatus_NO_MEMORY();
-        goto error;
-    }
-
     PyThreadState *tstate = _PyThreadState_New(interp);
     if (tstate == NULL) {
         PyInterpreterState_Delete(interp);
@@ -2110,6 +2102,14 @@ new_interpreter(PyThreadState **tstate_p, const PyInterpreterConfig *config)
         goto error;
     }

+    // initialize the interp->obmalloc state.  This must be done after
+    // the settings are loaded (so that feature_flags are set) but before
+    // any calls are made to obmalloc functions.
+    if (_PyMem_init_obmalloc(interp) < 0) {
+        status = _PyStatus_NO_MEMORY();
+        goto error;
+    }
+
     status = init_interp_create_gil(tstate, config->gil);
     if (_PyStatus_EXCEPTION(status)) {
         goto error;

@nascheme
Copy link
Member Author

nascheme commented May 6, 2024

Thanks @neonene, I think your fix is correct. When I merged then those calls got moved. They need to be done after the settings are loaded and feature flags are set. Maybe that was a problem. After moving the calls as you suggest, all the unit tests pass.

@nascheme
Copy link
Member Author

nascheme commented May 6, 2024

There is a failing check about the ABI changing. The PyThreadState structure has indeed changed. due to make the obmalloc state a pointer rather than an embedded structure. I think it might be okay because it seems the interp member is an opaque type to extensions. I tried accessing tstate->interp->id from a C extension and got this error from GCC:

error: invalid use of incomplete typedef ‘PyInterpreterState’ {aka ‘struct _is’}

@ericsnowcurrently
Copy link
Member

The ABI check (which is only in non-main branches) currently catches changes to internal ABI (e.g. _PyRuntimeState). It is okay to manually update the data file to pass the check as long as it really is only internal ABI that changed. When in doubt, you can always check with the release manager, @Yhg1s.

You can regenerate the file yourself (per https://devguide.python.org/setup/#regenerate-the-abi-dump) or you can copy the data file that was generated when the check ran. It's the "abi-data" artifact on https://github.com/python/cpython/actions/runs/8972849336?pr=118618 (the PR actions "Summary" page). Replace Doc/data/python3.12.abi with the artifact.

@obaterspok
Copy link

@nascheme, I did report this issue for kodi (xbmc/xbmc#24440) and I have just verified the PR fixes the kodi crash that occurred during exit.
If it matters I'm running Fedora 40 using python3-3.12.3-2.fc40.x86_64 for which I manually patched and rebuilt

@nascheme
Copy link
Member Author

nascheme commented May 7, 2024

Output of abidiff attached. I believe the PyInterpreterState structure is internal/private and so changing the size should be allowed in a bugfix release. This will need review from the release manager however.

abi-diff.txt

@nascheme
Copy link
Member Author

ping @Yhg1s

@Yhg1s
Copy link
Member

Yhg1s commented Jun 4, 2024

I'm uncomfortable with this backport, more for its size and impact than the ABI difference. (I think the ABI difference is fine, although it may make debuggers' lives harder.) I'm not sure the bugs it fixes in subinterpreter usage warrant the risk of the change, but I can be persuaded otherwise.

@dguglielmi
Copy link

Tested on Gentoo Linux with dev-lang/python-3.12.3 (cpython), works well!

@Yhg1s : To try to convince you, it's a rather annoying bug for Kodi users. I understand that it probably doesn't represent a large number of users (compared to the overall use of CPython), but for them, it creates a crash report file every time Kodi is shut down when they encounter this issue.

@bacon-cheeseburger
Copy link

There may be more Python users overall but the Kodi user base numbers in the millions across multiple platforms. That's not insignificant. I'd think debuggers, IF this fix makes anything harder for them, would certainly be more capable of dealing with it than regular users. A simple search relating to this problem easily warrants giving the fix serious consideration. However, if the motivation to do something about it isn't there then ...

@nascheme
Copy link
Member Author

Perhaps there is a way to avoid the Kodi crash with a smaller and simpler for to 3.12.x. I can investigate that. Thomas is correct in that this PR is a fairly big change to merge as a backport.

@trygveaa
Copy link

Crash in WeeChat which might be fixed by this: #116510

Sorry I'm a bit late with testing this PR, but I can confirm it fixes the crash in WeeChat as well.

@graysky2
Copy link

@nascheme - I think this needs a rebase to apply.

@nascheme
Copy link
Member Author

nascheme commented Jul 3, 2024

Closing since we are not planning to backport this change.

@nascheme nascheme closed this Jul 3, 2024
@graysky2
Copy link

graysky2 commented Jul 3, 2024

@nascheme - is there a work-around that has not yet made it into a tagged release? For my issue of kodi crashing on exit, applying this PR to 3.12.4 fixes the issue. Running 3.12.4 unmodified triggers the crash.

@nascheme
Copy link
Member Author

nascheme commented Jul 3, 2024

We don't have a work-around. I can investigate the kodi crash and see if I can figure out why it happens.

@graysky2
Copy link

graysky2 commented Jul 3, 2024

Thanks @nascheme - beyond in the info in the linked bug report (some crash dumps and dmesg output) please let me know if I can provide you with anything or test a patch etc.

@nascheme
Copy link
Member Author

nascheme commented Jul 3, 2024

I cannot comment on the gitlab bug (account registration closed). Based on that call stack dump, I would guess it's a similar problem to what WeeChat is encountering. The traceback is starting from Py_EndInterpreter and ending in _PyObject_GC_UNTRACK. I would suggest calling clear on the kodi module(s) before calling Py_EndInterpreter. Module functions have reference cycles and they make interpreter cleanup very complex. There could be a bug in Python or there could be a bug in kodi (e.g. incorrect refcnt inc/dec somewhere). Adding a module dict clear might fix it. See the gist linked from the comment below:

#116510 (comment)

@graysky2
Copy link

graysky2 commented Jul 4, 2024

Thanks @nascheme - implementing what you're asking is beyond my skill set (basic shell scripting only).

@dguglielmi
Copy link

It seems that a solution has been found to prevent the problem in Kodi.

xbmc/xbmc@0f5db1a

@graysky2
Copy link

graysky2 commented Jul 4, 2024

Thanks for pointing that out, @dguglielmi

@graysky2
Copy link

graysky2 commented Jul 4, 2024

To update - building Kodi with that commit applied does not fix the issue.

@hbiyik
Copy link

hbiyik commented Jul 12, 2024

@graysky2 at this point it would make sense to depend kodi on python311 at least on AUR, not only enter/exit crashes also a lot of memory leaks are happening with python3.12

@graysky2
Copy link

No, because kodi is in [extra] and cannot have any depends that are not in the official repos.

@hbiyik
Copy link

hbiyik commented Jul 12, 2024

yeah, sure i meant the kodi-stable-git in AUR, sorry too much valgrind.

trying this approach now at least it is a workaround until python fixes its own stuff.

@hbiyik
Copy link

hbiyik commented Jul 12, 2024

here is a diff for
https://aur.archlinux.org/packages/kodi-stable-git

diff --git a/PKGBUILD b/PKGBUILD
index 8c49304..4882273 100644
--- a/PKGBUILD
+++ b/PKGBUILD
@@ -25,7 +25,7 @@ _renderer=gles
 
 pkgbase=kodi-stable-git
 pkgname=("$pkgbase" "$pkgbase-eventclients" "$pkgbase-tools-texturepacker" "$pkgbase-dev")
-pkgver=r65506.24278a28fb6
+pkgver=r65642.7e5bb325bda
 pkgrel=1
 arch=('x86_64')
 url="https://kodi.tv"
@@ -47,6 +47,7 @@ makedepends=(
   'wayland-protocols' 'waylandpp' 'libxkbcommon'
   # gbm
   'libinput'
+  'python311'
 )
 options=(!lto)
 
@@ -166,6 +167,7 @@ build() {
     -DENABLE_INTERNAL_FSTRCMP=ON
     -DENABLE_INTERNAL_FLATBUFFERS=ON
     -DENABLE_INTERNAL_UDFREAD=ON
+    -DPYTHON_VER=3.11
     -Dlibdvdcss_URL="$srcdir/libdvdcss-$_libdvdcss_version.tar.gz"
     -Dlibdvdnav_URL="$srcdir/libdvdnav-$_libdvdnav_version.tar.gz"
     -Dlibdvdread_URL="$srcdir/libdvdread-$_libdvdread_version.tar.gz"
@@ -193,7 +195,7 @@ package_kodi-stable-git() {
     'mariadb-libs' 'mesa' 'libpipewire' 'python-pillow' 'python-pycryptodomex'
     'python-simplejson' 'shairplay' 'smbclient' 'sndio' 'spdlog' 'sqlite'
     'taglib' 'tinyxml' 'libxrandr' 'libxkbcommon' 'waylandpp' 'libinput'
-    'pcre' 'tinyxml2' 'libdisplay-info'
+    'pcre' 'tinyxml2' 'libdisplay-info' 'python311'
   )
   [[ -n "$_clangbuild" ]] && depends+=('glu')

everything is back to normal again, sorry if this sidetracking the original issue..

@encukou
Copy link
Member

encukou commented Sep 23, 2024

I talked to Thomas, Eric and Neil.
The understanding is that:

Is that right? If not, let's get everyone on the same page.

@graysky2
Copy link

@encukou - that is my understand. Python 3.12.6 for Kodi is still affected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes awaiting core review
Projects
None yet
Development

Successfully merging this pull request may close these issues.