Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-44032: Defer clearing last thread state until last GC has been run. #26285

Merged

Conversation

markshannon
Copy link
Member

@markshannon markshannon commented May 21, 2021

@markshannon
Copy link
Member Author

@erlend-aasland Can you confirm that this fixes the issue?

Python/pystate.c Outdated
@@ -324,6 +326,7 @@ interpreter_clear(PyInterpreterState *interp, PyThreadState *tstate)

/* Last garbage collection on this interpreter */
_PyGC_CollectNoFail(tstate);
PyThreadState_Clear(tstate);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For safety, I would prefer to ensure that the tstate belongs to the interpreter if PyInterpreterState_Clear() is called from a different interpreter (even if I don't know if it's currently possible):

Suggested change
PyThreadState_Clear(tstate);
// Don't clear tstate if it belongs to another interpreter
if (tstate->interp == interp) {
PyThreadState_Clear(tstate);
}

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thread state should remain more or less usable after PyInterpreterState_Clear() is called. I don't think that this PR fix the root issue of https://bugs.python.org/issue44032#msg394104

IMO it would be safer to only release datastack_chunk memory in tstate_delete_common(), rather than in PyThreadState_Clear().

This change remains interesting ;-)

@erlend-aasland
Copy link
Contributor

erlend-aasland commented May 21, 2021

@erlend-aasland Can you confirm that this fixes the issue?

I'll do as soon as I get back on my computer.

UPDATE: This PR does not fix the issue. I've used @pablogsal's reproducer from #26274 (comment). Here's an excerpt of the dump (complete dump attached):

=================================================================
==23898==ERROR: AddressSanitizer: heap-use-after-free on address 0x619000041b48 at pc 0x0001093d7fbe bp 0x7ffee6d13700 sp 0x7ffee6d136f8
READ of size 8 at 0x619000041b48 thread T0
    #0 0x1093d7fbd in subtype_dealloc typeobject.c:1456
    #1 0x10939e680 in _Py_Dealloc object.c:2288
    #2 0x1092d81ac in _Py_DECREF object.h:500
    #3 0x1092d8bdb in _Py_XDECREF object.h:567
    #4 0x1092de271 in list_dealloc listobject.c:342
    #5 0x10939e680 in _Py_Dealloc object.c:2288
    #6 0x10932dafc in _Py_DECREF object.h:500
    #7 0x10933a41b in _Py_XDECREF object.h:567
    #8 0x10933fbfb in dict_dealloc dictobject.c:2068
    #9 0x10939e680 in _Py_Dealloc object.c:2288
    #10 0x1095bfedc in _Py_DECREF object.h:500
    #11 0x1095eb47d in ast_clear Python-ast.c:782
    #12 0x1093f6c51 in subtype_clear typeobject.c:1303
    #13 0x1098c6e70 in delete_garbage gcmodule.c:1017
    #14 0x1098be9af in gc_collect_main gcmodule.c:1300
    #15 0x1098bdf9c in _PyGC_CollectNoFail gcmodule.c:2123
    #16 0x109812be0 in interpreter_clear pystate.c:328
    #17 0x109812e81 in _PyInterpreterState_Clear pystate.c:361
    #18 0x1098017b9 in finalize_interp_clear pylifecycle.c:1634
    #19 0x10980123b in Py_FinalizeEx pylifecycle.c:1812
    #20 0x1098b7143 in Py_RunMain main.c:668
    #21 0x1098b83e4 in pymain_main main.c:696
    #22 0x1098b8694 in Py_BytesMain main.c:720
    #23 0x108eeda71 in main python.c:15
    #24 0x7fff20582f3c in start+0x0 (libdyld.dylib:x86_64+0x15f3c)

0x619000041b48 is located 200 bytes inside of 968-byte region [0x619000041a80,0x619000041e48)
freed by thread T0 here:
    #0 0x10a3ad609 in wrap_free+0xa9 (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x48609)
    #1 0x10939ff08 in _PyMem_RawFree obmalloc.c:127
    #2 0x1093a0ed9 in _PyMem_DebugRawFree obmalloc.c:2540
    #3 0x1093a0fb8 in _PyMem_DebugFree obmalloc.c:2673
    #4 0x1093a1f8f in PyObject_Free obmalloc.c:721
    #5 0x1098c1151 in PyObject_GC_Del gcmodule.c:2345
    #6 0x1093db0e7 in type_dealloc typeobject.c:4041
    #7 0x10939e680 in _Py_Dealloc object.c:2288
    #8 0x1095bfedc in _Py_DECREF object.h:500
    #9 0x1095eb1dd in ast_dealloc Python-ast.c:768
    #10 0x1093d7f83 in subtype_dealloc typeobject.c:1450

dump.txt

@markshannon
Copy link
Member Author

markshannon commented May 21, 2021

@vstinner I don't see how a thread state can remain usable after the interpreter has been cleared.

@erlend-aasland
Copy link
Contributor

erlend-aasland commented May 21, 2021

I just sync'ed with upstream/main to get @vstinner's fix for issue 44184 (GH-26274, 615069e). With that in main and this PR applied, the issue is fixed; thanks! You can disregard my previous comment.

@erlend-aasland
Copy link
Contributor

A thread state should remain more or less usable after PyInterpreterState_Clear() is called. I don't think that this PR fix the root issue of https://bugs.python.org/issue44032#msg394104

IMO it would be safer to only release datastack_chunk memory in tstate_delete_common(), rather than in PyThreadState_Clear().

FYI, freeing datastack chunks in tstate_delete_common() iso. deferring PyThreadState_Clear works fine with the reproducer.

@@ -900,13 +897,6 @@ PyThreadState_Clear(PyThreadState *tstate)
if (tstate->on_delete != NULL) {
tstate->on_delete(tstate->on_delete_data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be moved to tstate_delete_common, or is it unrelated to deleting the data stack?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what it does, so I'm not touching it 🙂

Copy link
Contributor

@erlend-aasland erlend-aasland May 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me neither :) I tried moving it, and it seems to work fine either place. BTW, I found a clue in Include:

* So instead of holding the lock directly, the tstate holds a weakref to
* the lock: that's the value of on_delete_data below. Decref'ing a
* weakref is harmless.
* on_delete points to _threadmodule.c's static release_sentinel() function.
* After the tstate is unlinked, release_sentinel is called with the
* weakref-to-lock (on_delete_data) argument, and release_sentinel releases
* the indirectly held lock.
*/
void (*on_delete)(void *);
void *on_delete_data;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: https://docs.python.org/3/c-api/init.html#c.PyThreadState_Clear:

"Changed in version 3.9: This function now calls the PyThreadState.on_delete callback. Previously, that happened in PyThreadState_Delete()."

Copy link
Contributor

@erlend-aasland erlend-aasland May 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and here:

release_sentinel(void *wr_raw)
{
PyObject *wr = _PyObject_CAST(wr_raw);
/* Tricky: this function is called when the current thread state
is being deleted. Therefore, only simple C code can safely
execute here. */
PyObject *obj = PyWeakref_GET_OBJECT(wr);
lockobject *lock;
if (obj != Py_None) {
lock = (lockobject *) obj;
if (lock->locked) {
PyThread_release_lock(lock->lock_lock);
lock->locked = 0;
}
}
/* Deallocating a weakref with a NULL callback only calls
PyObject_GC_Del(), which can't call any Python code. */
Py_DECREF(wr);
}
static PyObject *
thread__set_sentinel(PyObject *module, PyObject *Py_UNUSED(ignored))
{
PyObject *wr;
PyThreadState *tstate = PyThreadState_Get();
lockobject *lock;
if (tstate->on_delete_data != NULL) {
/* We must support the re-creation of the lock from a
fork()ed child. */
assert(tstate->on_delete == &release_sentinel);
wr = (PyObject *) tstate->on_delete_data;
tstate->on_delete = NULL;
tstate->on_delete_data = NULL;
Py_DECREF(wr);
}
lock = newlockobject(module);
if (lock == NULL)
return NULL;
/* The lock is owned by whoever called _set_sentinel(), but the weakref
hangs to the thread state. */
wr = PyWeakref_NewRef((PyObject *) lock, NULL);
if (wr == NULL) {
Py_DECREF(lock);
return NULL;
}
tstate->on_delete_data = (void *) wr;
tstate->on_delete = &release_sentinel;
return (PyObject *) lock;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's added by @vstinner in GH-18296 (issue 39511). I guess he knows if it should stay or follow, then :)

@markshannon markshannon added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 22, 2021
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @markshannon for commit ddad17e 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 22, 2021
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Calling _PyObject_VirtualFree() in tstate_delete_common() sounds like the good thing to do. I doesn't mean that a tstate must remain usable after calling PyThreadState_Clear(state), it's just about consistency.

@vstinner
Copy link
Member

buildbot/AMD64 RHEL7 Refleaks PR — Build done.
buildbot/AMD64 RHEL8 Refleaks PR — Build done.
buildbot/s390x Fedora Refleaks PR — Build done.

test_asyncio hangs, unrelated error: https://bugs.python.org/issue44112

@markshannon
Copy link
Member Author

I'm not worried about asyncio, but test_multiprocessing also fails for one and test_posix for another.

There doesn't seem to a coherent model of how threads and interpreters are related, or what memory belongs to the interpreter or thread.
As a result, cleaning up after a fork is impossible to get right.

@erlend-aasland
Copy link
Contributor

There doesn't seem to a coherent model of how threads and interpreters are related, or what memory belongs to the interpreter or thread.
As a result, cleaning up after a fork is impossible to get right.

Sounds like an issue should be opened for this, if there's not already one on bpo.

@vstinner
Copy link
Member

AMD64 RHEL8 Refleaks PR:

test_posix failed
...
test_posix leaked [1, 0, 0] file descriptors, sum=1

I can be a random error which doesn't fail with re-run in verbose mode, we don't know since the job hangs before test_posix can be re-run.

@vstinner
Copy link
Member

I'm not worried about asyncio, but test_multiprocessing also fails for one and test_posix for another.

I fail to see the test_multiprocessing error. multiprocessing tests are full of race conditions. Some are legit bugs, some are bugs in the tests. I fixed tons of bugs in multiprocessing and in its test suite last years, but if you search for "multiprocessing" in the bug tracker, you will see a long list of issues open for several weeks/months.

IMO it's perfectly safe to merge this PR. All issues that you saw are very likely known (not regression of your change).

Clearing the current thread state would be another can of worm, but your PR no longer does that ;-)

@markshannon
Copy link
Member Author

Ok, I'll merge this then.
It seems to not make things worse, and probably makes them better.

@markshannon markshannon merged commit af5d497 into python:main May 24, 2021
@markshannon markshannon deleted the free-thread-after-its-finished branch May 24, 2021 15:22
@vstinner
Copy link
Member

Yeah, it's fine, in the really worst case, we can just revert the change to revisit the options. But again, IMO this function is perfectly safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants