bpo-44032: Defer clearing last thread state until last GC has been run. #26285

markshannon · 2021-05-21T16:16:58Z

https://bugs.python.org/issue44032

markshannon · 2021-05-21T16:19:19Z

@erlend-aasland Can you confirm that this fixes the issue?

vstinner · 2021-05-21T16:20:18Z

Python/pystate.c

@@ -324,6 +326,7 @@ interpreter_clear(PyInterpreterState *interp, PyThreadState *tstate)

    /* Last garbage collection on this interpreter */
    _PyGC_CollectNoFail(tstate);
+    PyThreadState_Clear(tstate);


For safety, I would prefer to ensure that the tstate belongs to the interpreter if PyInterpreterState_Clear() is called from a different interpreter (even if I don't know if it's currently possible):

Suggested change

PyThreadState_Clear(tstate);

// Don't clear tstate if it belongs to another interpreter

if (tstate->interp == interp) {

PyThreadState_Clear(tstate);

}

vstinner

A thread state should remain more or less usable after PyInterpreterState_Clear() is called. I don't think that this PR fix the root issue of https://bugs.python.org/issue44032#msg394104

IMO it would be safer to only release datastack_chunk memory in tstate_delete_common(), rather than in PyThreadState_Clear().

This change remains interesting ;-)

erlend-aasland · 2021-05-21T17:27:17Z

@erlend-aasland Can you confirm that this fixes the issue?

I'll do as soon as I get back on my computer.

UPDATE: This PR does not fix the issue. I've used @pablogsal's reproducer from #26274 (comment). Here's an excerpt of the dump (complete dump attached):

=================================================================
==23898==ERROR: AddressSanitizer: heap-use-after-free on address 0x619000041b48 at pc 0x0001093d7fbe bp 0x7ffee6d13700 sp 0x7ffee6d136f8
READ of size 8 at 0x619000041b48 thread T0
    #0 0x1093d7fbd in subtype_dealloc typeobject.c:1456
    #1 0x10939e680 in _Py_Dealloc object.c:2288
    #2 0x1092d81ac in _Py_DECREF object.h:500
    #3 0x1092d8bdb in _Py_XDECREF object.h:567
    #4 0x1092de271 in list_dealloc listobject.c:342
    #5 0x10939e680 in _Py_Dealloc object.c:2288
    #6 0x10932dafc in _Py_DECREF object.h:500
    #7 0x10933a41b in _Py_XDECREF object.h:567
    #8 0x10933fbfb in dict_dealloc dictobject.c:2068
    #9 0x10939e680 in _Py_Dealloc object.c:2288
    #10 0x1095bfedc in _Py_DECREF object.h:500
    #11 0x1095eb47d in ast_clear Python-ast.c:782
    #12 0x1093f6c51 in subtype_clear typeobject.c:1303
    #13 0x1098c6e70 in delete_garbage gcmodule.c:1017
    #14 0x1098be9af in gc_collect_main gcmodule.c:1300
    #15 0x1098bdf9c in _PyGC_CollectNoFail gcmodule.c:2123
    #16 0x109812be0 in interpreter_clear pystate.c:328
    #17 0x109812e81 in _PyInterpreterState_Clear pystate.c:361
    #18 0x1098017b9 in finalize_interp_clear pylifecycle.c:1634
    #19 0x10980123b in Py_FinalizeEx pylifecycle.c:1812
    #20 0x1098b7143 in Py_RunMain main.c:668
    #21 0x1098b83e4 in pymain_main main.c:696
    #22 0x1098b8694 in Py_BytesMain main.c:720
    #23 0x108eeda71 in main python.c:15
    #24 0x7fff20582f3c in start+0x0 (libdyld.dylib:x86_64+0x15f3c)

0x619000041b48 is located 200 bytes inside of 968-byte region [0x619000041a80,0x619000041e48)
freed by thread T0 here:
    #0 0x10a3ad609 in wrap_free+0xa9 (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x48609)
    #1 0x10939ff08 in _PyMem_RawFree obmalloc.c:127
    #2 0x1093a0ed9 in _PyMem_DebugRawFree obmalloc.c:2540
    #3 0x1093a0fb8 in _PyMem_DebugFree obmalloc.c:2673
    #4 0x1093a1f8f in PyObject_Free obmalloc.c:721
    #5 0x1098c1151 in PyObject_GC_Del gcmodule.c:2345
    #6 0x1093db0e7 in type_dealloc typeobject.c:4041
    #7 0x10939e680 in _Py_Dealloc object.c:2288
    #8 0x1095bfedc in _Py_DECREF object.h:500
    #9 0x1095eb1dd in ast_dealloc Python-ast.c:768
    #10 0x1093d7f83 in subtype_dealloc typeobject.c:1450

dump.txt

markshannon · 2021-05-21T17:32:26Z

@vstinner I don't see how a thread state can remain usable after the interpreter has been cleared.

erlend-aasland · 2021-05-21T23:03:00Z

I just sync'ed with upstream/main to get @vstinner's fix for issue 44184 (GH-26274, 615069e). With that in main and this PR applied, the issue is fixed; thanks! You can disregard my previous comment.

erlend-aasland · 2021-05-22T09:46:11Z

A thread state should remain more or less usable after PyInterpreterState_Clear() is called. I don't think that this PR fix the root issue of https://bugs.python.org/issue44032#msg394104

IMO it would be safer to only release datastack_chunk memory in tstate_delete_common(), rather than in PyThreadState_Clear().

FYI, freeing datastack chunks in tstate_delete_common() iso. deferring PyThreadState_Clear works fine with the reproducer.

erlend-aasland · 2021-05-22T10:56:02Z

Python/pystate.c

@@ -900,13 +897,6 @@ PyThreadState_Clear(PyThreadState *tstate)
    if (tstate->on_delete != NULL) {
        tstate->on_delete(tstate->on_delete_data);


Should this also be moved to tstate_delete_common, or is it unrelated to deleting the data stack?

I have no idea what it does, so I'm not touching it 🙂

Me neither :) I tried moving it, and it seems to work fine either place. BTW, I found a clue in Include:

cpython/Include/cpython/pystate.h

Lines 134 to 143 in a9e4361

* So instead of holding the lock directly, the tstate holds a weakref to

* the lock: that's the value of on_delete_data below. Decref'ing a

* weakref is harmless.

* on_delete points to _threadmodule.c's static release_sentinel() function.

* After the tstate is unlinked, release_sentinel is called with the

* weakref-to-lock (on_delete_data) argument, and release_sentinel releases

* the indirectly held lock.

*/

void (*on_delete)(void *);

void *on_delete_data;

Also: https://docs.python.org/3/c-api/init.html#c.PyThreadState_Clear:

"Changed in version 3.9: This function now calls the PyThreadState.on_delete callback. Previously, that happened in PyThreadState_Delete()."

... and here:

cpython/Modules/_threadmodule.c

Lines 1280 to 1329 in a9e4361

release_sentinel(void *wr_raw)

{

PyObject *wr = _PyObject_CAST(wr_raw);

/* Tricky: this function is called when the current thread state

is being deleted. Therefore, only simple C code can safely

execute here. */

PyObject *obj = PyWeakref_GET_OBJECT(wr);

lockobject *lock;

if (obj != Py_None) {

lock = (lockobject *) obj;

if (lock->locked) {

PyThread_release_lock(lock->lock_lock);

lock->locked = 0;

}

}

/* Deallocating a weakref with a NULL callback only calls

PyObject_GC_Del(), which can't call any Python code. */

Py_DECREF(wr);

}

static PyObject *

thread__set_sentinel(PyObject *module, PyObject *Py_UNUSED(ignored))

{

PyObject *wr;

PyThreadState *tstate = PyThreadState_Get();

lockobject *lock;

if (tstate->on_delete_data != NULL) {

/* We must support the re-creation of the lock from a

fork()ed child. */

assert(tstate->on_delete == &release_sentinel);

wr = (PyObject *) tstate->on_delete_data;

tstate->on_delete = NULL;

tstate->on_delete_data = NULL;

Py_DECREF(wr);

}

lock = newlockobject(module);

if (lock == NULL)

return NULL;

/* The lock is owned by whoever called _set_sentinel(), but the weakref

hangs to the thread state. */

wr = PyWeakref_NewRef((PyObject *) lock, NULL);

if (wr == NULL) {

Py_DECREF(lock);

return NULL;

}

tstate->on_delete_data = (void *) wr;

tstate->on_delete = &release_sentinel;

return (PyObject *) lock;

}

It's added by @vstinner in GH-18296 (issue 39511). I guess he knows if it should stay or follow, then :)

bedevere-bot · 2021-05-22T11:15:20Z

🤖 New build scheduled with the buildbot fleet by @markshannon for commit ddad17e 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

vstinner

LGTM.

Calling _PyObject_VirtualFree() in tstate_delete_common() sounds like the good thing to do. I doesn't mean that a tstate must remain usable after calling PyThreadState_Clear(state), it's just about consistency.

vstinner · 2021-05-22T19:21:16Z

buildbot/AMD64 RHEL7 Refleaks PR — Build done.
buildbot/AMD64 RHEL8 Refleaks PR — Build done.
buildbot/s390x Fedora Refleaks PR — Build done.

test_asyncio hangs, unrelated error: https://bugs.python.org/issue44112

markshannon · 2021-05-23T09:55:49Z

I'm not worried about asyncio, but test_multiprocessing also fails for one and test_posix for another.

There doesn't seem to a coherent model of how threads and interpreters are related, or what memory belongs to the interpreter or thread.
As a result, cleaning up after a fork is impossible to get right.

erlend-aasland · 2021-05-23T18:47:27Z

There doesn't seem to a coherent model of how threads and interpreters are related, or what memory belongs to the interpreter or thread.
As a result, cleaning up after a fork is impossible to get right.

Sounds like an issue should be opened for this, if there's not already one on bpo.

vstinner · 2021-05-24T10:20:00Z

AMD64 RHEL8 Refleaks PR:

test_posix failed
...
test_posix leaked [1, 0, 0] file descriptors, sum=1

I can be a random error which doesn't fail with re-run in verbose mode, we don't know since the job hangs before test_posix can be re-run.

vstinner · 2021-05-24T10:23:17Z

I'm not worried about asyncio, but test_multiprocessing also fails for one and test_posix for another.

I fail to see the test_multiprocessing error. multiprocessing tests are full of race conditions. Some are legit bugs, some are bugs in the tests. I fixed tons of bugs in multiprocessing and in its test suite last years, but if you search for "multiprocessing" in the bug tracker, you will see a long list of issues open for several weeks/months.

IMO it's perfectly safe to merge this PR. All issues that you saw are very likely known (not regression of your change).

Clearing the current thread state would be another can of worm, but your PR no longer does that ;-)

markshannon · 2021-05-24T15:20:35Z

Ok, I'll merge this then.
It seems to not make things worse, and probably makes them better.

vstinner · 2021-05-24T20:26:33Z

Yeah, it's fine, in the really worst case, we can just revert the change to revisit the options. But again, IMO this function is perfectly safe.

Defer clearing last thread state until last GC has been run.

236beca

markshannon added the skip news label May 21, 2021

the-knights-who-say-ni added the CLA signed label May 21, 2021

bedevere-bot added the awaiting core review label May 21, 2021

vstinner reviewed May 21, 2021

View reviewed changes

erlend-aasland approved these changes May 21, 2021

View reviewed changes

Delay deletion of stack chunks until thread state is deleted.

ddad17e

erlend-aasland reviewed May 22, 2021

View reviewed changes

markshannon added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 22, 2021

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 22, 2021

vstinner approved these changes May 22, 2021

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels May 22, 2021

markshannon merged commit af5d497 into python:main May 24, 2021

bedevere-bot removed the awaiting merge label May 24, 2021

markshannon deleted the free-thread-after-its-finished branch May 24, 2021 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-44032: Defer clearing last thread state until last GC has been run. #26285

bpo-44032: Defer clearing last thread state until last GC has been run. #26285

markshannon commented May 21, 2021 •

edited by bedevere-bot

Loading

markshannon commented May 21, 2021

vstinner May 21, 2021

vstinner left a comment

erlend-aasland commented May 21, 2021 •

edited

Loading

markshannon commented May 21, 2021 •

edited

Loading

erlend-aasland commented May 21, 2021 •

edited

Loading

erlend-aasland commented May 22, 2021

erlend-aasland May 22, 2021

markshannon May 22, 2021

erlend-aasland May 22, 2021 •

edited

Loading

erlend-aasland May 22, 2021

erlend-aasland May 22, 2021 •

edited

Loading

erlend-aasland May 22, 2021

bedevere-bot commented May 22, 2021

vstinner left a comment

vstinner commented May 22, 2021

markshannon commented May 23, 2021

erlend-aasland commented May 23, 2021

vstinner commented May 24, 2021

vstinner commented May 24, 2021

markshannon commented May 24, 2021

vstinner commented May 24, 2021

		@@ -900,13 +897,6 @@ PyThreadState_Clear(PyThreadState *tstate)
		if (tstate->on_delete != NULL) {
		tstate->on_delete(tstate->on_delete_data);

	* So instead of holding the lock directly, the tstate holds a weakref to
	* the lock: that's the value of on_delete_data below. Decref'ing a
	* weakref is harmless.
	* on_delete points to _threadmodule.c's static release_sentinel() function.
	* After the tstate is unlinked, release_sentinel is called with the
	* weakref-to-lock (on_delete_data) argument, and release_sentinel releases
	* the indirectly held lock.
	*/
	void (on_delete)(void );
	void *on_delete_data;

	release_sentinel(void *wr_raw)
	{
	PyObject *wr = _PyObject_CAST(wr_raw);
	/* Tricky: this function is called when the current thread state
	is being deleted. Therefore, only simple C code can safely
	execute here. */
	PyObject *obj = PyWeakref_GET_OBJECT(wr);
	lockobject *lock;
	if (obj != Py_None) {
	lock = (lockobject *) obj;
	if (lock->locked) {
	PyThread_release_lock(lock->lock_lock);
	lock->locked = 0;
	}
	}
	/* Deallocating a weakref with a NULL callback only calls
	PyObject_GC_Del(), which can't call any Python code. */
	Py_DECREF(wr);
	}

	static PyObject *
	thread__set_sentinel(PyObject module, PyObject Py_UNUSED(ignored))
	{
	PyObject *wr;
	PyThreadState *tstate = PyThreadState_Get();
	lockobject *lock;

	if (tstate->on_delete_data != NULL) {
	/* We must support the re-creation of the lock from a
	fork()ed child. */
	assert(tstate->on_delete == &release_sentinel);
	wr = (PyObject *) tstate->on_delete_data;
	tstate->on_delete = NULL;
	tstate->on_delete_data = NULL;
	Py_DECREF(wr);
	}
	lock = newlockobject(module);
	if (lock == NULL)
	return NULL;
	/* The lock is owned by whoever called _set_sentinel(), but the weakref
	hangs to the thread state. */
	wr = PyWeakref_NewRef((PyObject *) lock, NULL);
	if (wr == NULL) {
	Py_DECREF(lock);
	return NULL;
	}
	tstate->on_delete_data = (void *) wr;
	tstate->on_delete = &release_sentinel;
	return (PyObject *) lock;
	}

bpo-44032: Defer clearing last thread state until last GC has been run. #26285

bpo-44032: Defer clearing last thread state until last GC has been run. #26285

Conversation

markshannon commented May 21, 2021 • edited by bedevere-bot Loading

markshannon commented May 21, 2021

vstinner May 21, 2021

Choose a reason for hiding this comment

vstinner left a comment

Choose a reason for hiding this comment

erlend-aasland commented May 21, 2021 • edited Loading

markshannon commented May 21, 2021 • edited Loading

erlend-aasland commented May 21, 2021 • edited Loading

erlend-aasland commented May 22, 2021

erlend-aasland May 22, 2021

Choose a reason for hiding this comment

markshannon May 22, 2021

Choose a reason for hiding this comment

erlend-aasland May 22, 2021 • edited Loading

Choose a reason for hiding this comment

erlend-aasland May 22, 2021

Choose a reason for hiding this comment

erlend-aasland May 22, 2021 • edited Loading

Choose a reason for hiding this comment

erlend-aasland May 22, 2021

Choose a reason for hiding this comment

bedevere-bot commented May 22, 2021

vstinner left a comment

Choose a reason for hiding this comment

vstinner commented May 22, 2021

markshannon commented May 23, 2021

erlend-aasland commented May 23, 2021

vstinner commented May 24, 2021

vstinner commented May 24, 2021

markshannon commented May 24, 2021

vstinner commented May 24, 2021

markshannon commented May 21, 2021 •

edited by bedevere-bot

Loading

erlend-aasland commented May 21, 2021 •

edited

Loading

markshannon commented May 21, 2021 •

edited

Loading

erlend-aasland commented May 21, 2021 •

edited

Loading

erlend-aasland May 22, 2021 •

edited

Loading

erlend-aasland May 22, 2021 •

edited

Loading