gh-106176: Fix reference leak in importlib/_bootstrap.py #106207

Eclips4 · 2023-06-28T15:27:21Z

Issue: test_import.test_concurrency leaks ref on Windows #106176

Eclips4 · 2023-06-28T15:27:46Z

cc @sunmy2019 @vstinner @erlend-aasland

Eclips4 · 2023-06-28T15:36:59Z

There's no tests for case which solved by #94504, so I'm somewhat unsure if this is the right solution

However, refleaks in test_concurrency and PyUnpicklerTests (which related to issue with refleaks in test_pickle on Windows) are gone.

brettcannon · 2023-06-28T17:45:00Z

@exarkun any opinions on this fix?

sunmy2019 · 2023-06-28T20:13:27Z

I think we are almost here!

sys.modules["_frozen_importlib"]._blocking_on�

is the root of the leak.

sunmy2019 · 2023-06-28T20:15:49Z

As for the fix, we usually clean up things like these within the test case and leave the cache as is.

bedevere-bot · 2023-06-28T20:16:23Z

🤖 New build scheduled with the buildbot fleet by @sunmy2019 for commit 5328e32 🤖

If you want to schedule another build, you need to add the 🔨 test-with-refleak-buildbots label again.

Eclips4 · 2023-06-28T20:23:14Z

I think we are almost here!
sys.modules["_frozen_importlib"]._blocking_on�
is the root of the leak.

Yeah, you're right. The refleaks is happen in _blocking_on or somewhere nearby, I tried playing around with this, and seems that is the most correct solution.

sunmy2019 · 2023-06-29T04:56:20Z

Confirmed fixed on Windows
https://buildbot.python.org/all/#/builders/1214/builds/34

sunmy2019 · 2023-06-29T05:24:05Z

How about this?

diff --git a/Lib/test/libregrtest/utils.py b/Lib/test/libregrtest/utils.py
index fd46819fd90..c5f301c2d18 100644
--- a/Lib/test/libregrtest/utils.py
+++ b/Lib/test/libregrtest/utils.py
@@ -125,6 +125,15 @@ def clear_caches():
         if stream is not None:
             stream.flush()
 
+    try:
+        _frozen_importlib = sys.modules['_frozen_importlib']
+    except KeyError:
+        pass
+    else:
+        _frozen_importlib._blocking_on = {
+            k: v for k, v in _frozen_importlib._blocking_on.items() if v
+        }
+
     try:
         re = sys.modules['re']
     except KeyError:

vstinner · 2023-06-29T10:13:20Z

How about this?

I would prefer to not leak so much implementation details in libregrtest. If importlib cannot clean that automatically and it must be cleared, maybe add a private function for that.

But i don't see why all tests must not clear that dict if only a test is affected. For me it would make sense to clear that dict in test_import and test_importlib, in a tearDownModule() function or something like that.

Eclips4 · 2023-06-29T12:38:06Z

How about this?

I would prefer to not leak so much implementation details in libregrtest. If importlib cannot clean that automatically and it must be cleared, maybe add a private function for that.

But i don't see why all tests must not clear that dict if only a test is affected. For me it would make sense to clear that dict in test_import and test_importlib, in a tearDownModule() function or something like that.

This leak also present in test_pickle see #104702 for more details. Should we add a tearDownModule in test_pickle?
So, I think there are two solution here.
First: Modify libregrtest as @sunmy2019 proposed.
Second: Write a some private support function and call it in tearDownModule function in test_import, test_importlib and test_pickle.

vstinner · 2023-06-29T13:21:07Z

This leak also present in test_pickle see #104702 for more details. Should we add a tearDownModule in test_pickle?

I don't understand how a pickle tests leaks a deep internal importlib "blocking" object? Why is this "blocking" object created? Why is not deleted automatically?

The Refleaks test checks that each test iteration either creates no new objects or deletes all objects that it created.

Eclips4 · 2023-06-29T22:16:17Z

This leak also present in test_pickle see #104702 for more details. Should we add a tearDownModule in test_pickle?

I don't understand how a pickle tests leaks a deep internal importlib "blocking" object? Why is this "blocking" object created? Why is not deleted automatically?

The Refleaks test checks that each test iteration either creates no new objects or deletes all objects that it created.

Let me try to explain. Previously, structure of _blocking_on looks like that: thread_id: module_lock_instance.
When we try to acquire lock, we add thread_id: module_lock_instance pair in _blocking_on, and when we finally acquire lock, we pop key thread_id from _blocking_on.

After changes, it took this form: thread_id: list_of_module_lock_instances.
When we try to acquire module lock, it's create a key-value pair thread_id: [module_lock_instance] in _blocking_on_ dict. After we're finally acquire lock, we remove current module_lock_instance from _blocking_on[thread_id]. So, what's happen when thread with thread_id dies? Nothing. It's just stays in _blocking_on as it is (like thread_id: empty_list). It's just created an key-value pair, some thing happens, and it doesn't get deleted from dict, when thread with thread_id dies (but it should be deleted).
So, there need a explicit deleting of that key from dictionary.

Why test_pickle leaks? In test_pickle we have similiar test to test_concurrency in test_import - pickletester/test_unpickle_module_race.

_blocking_on not cleared between iteration tests. It's just growing up. (if we don't do things like in this PR) Why is so? Well... I don't know. Probably, _blocking_on is cleared when interpreter finalizes. (Does interpreter really finalizes in each iteration of refleak tests?)

At the current moment of research, I think solution which presented in this PR is the only true.

vstinner · 2023-06-29T23:41:42Z

So, what's happen when thread with thread_id dies?

I guess that by "dying", you just mean that the thread exits cleanly.

Can this "per-thread" dictionary becomes a thread local storage, like threading.local(), so it's automatically cleared when the thread exits? You may use _thread._local().

Or can you register a function with threading._register_atexit() and clear _blocks_on[thread_id] at a thread exit?

A practical problem is that the threading.py module is not imported yet when _bootstrap is run :-(

For me, this memory leak is a real issue. If an application spawns many threads, each thread creates an item in _blocks_on and the thread identifier is not recycled, this dictionary will only grow and will never be cleared: so the memory usage only increases, whereas the threads are no longer running (exited).

Prototype using thread local:

import threading
import _thread
import tracemalloc

NTHREAD = 10

def big_alloc():
    return bytearray(1024 * 50)

if 1:
    def mylock():
        thread_locals = _thread._local()
        thread_locals.blocks_on = big_alloc()
else:
    _blocks_on = {}

    def mylock():
        _blocks_on[_thread.get_ident()] = big_alloc()

def test():
    threads = [threading.Thread(target=mylock, args=()) for _ in range(NTHREAD)]
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    threads = None

# warmup to have a more accurate memory usage measurement
test()

tracemalloc.start()
before = tracemalloc.get_traced_memory()[0]

test()

after = tracemalloc.get_traced_memory()[0]
usage = after - before
print(f"usage: {usage} bytes")

It shows less than 1 kB of memory usage: no leak. If you replace if 1: with if 0:, it leaks 50 kB: 10 kB per thread.

brettcannon · 2023-06-30T00:19:29Z

I don't have a specific opinion on this, but do note you do have to make sure _thread can be used by importlib at start-up and it will work appropriately for platforms w/o fully-functioning threading (e.g. WebAssembly).

I guess that by "dying", you just mean that the thread exits cleanly.

It should mean the thread exited while in the middle of an import, else there's a logic error causing keys to get left behind in the dict even when all imports resolve.

vstinner · 2023-06-30T00:54:10Z

It should mean the thread exited while in the middle of an import, else there's a logic error causing keys to get left behind in the dict even when all imports resolve.

_BlockingOnManager.__exit__() removes the lock from _blocking_on[thread_id], but it doesn't remove _blocking_on[thread_id] item in the dictionary. Maybe it should remove it, if the list becomes empty after the removal?

brettcannon · 2023-06-30T01:18:40Z

Maybe it should remove it, if the list becomes empty after the removal?

That SGTM!

sunmy2019 · 2023-06-30T01:24:47Z

Maybe it should remove it, if the list becomes empty after the removal?

It's just what this PR has done.

sunmy2019 · 2023-06-30T01:30:29Z

Lib/importlib/_bootstrap.py

@@ -85,6 +85,8 @@ def __enter__(self):
    def __exit__(self, *args, **kwargs):
        """Remove self.lock from this thread's _blocking_on list."""
        self.blocked_on.remove(self.lock)
+        if not self.blocked_on:
+            del _blocking_on[self.thread_id]


Just have one question, is this always true?

assert self.blocked_on is _blocking_on[self.thread_id]

I think, if it were sometimes is not true, self.blocked_on.remove(...) will fail, no?
Though, currently test suite didn't failed with new assert check.

See #101942 (comment)

It may not be True under certain circumstances.

sunmy2019 · 2023-06-30T02:12:00Z

Actually, I think weakref delivers exactly what we need here. But is weakref available here?

sunmy2019 · 2023-06-30T02:13:11Z

https://docs.python.org/3/library/weakref.html#weakref.WeakValueDictionary

chris-eibl · 2023-06-30T06:40:31Z

ISTM, that this problem has been tried before in a similar way, but then got reverted:
#101942 (comment)

vstinner · 2023-06-30T06:52:40Z

Oh. That problem which looks simple seems quite complicated to be fixed.

Eclips4 · 2023-06-30T07:35:58Z

Actually, I think weakref delivers exactly what we need here. But is weakref available here?

Unfortunately, seems no.
(Some functional of weakref is available here, but not WeakValueDictionary)
Altough, probably, we can fully imitate the WeakValueDictionary here.

Eclips4 · 2023-06-30T07:54:11Z

As Guido says, __del__ (in _BlockingOnManager, just del _blocking_on[self.thread_id]) actually solve the refleak issue. But is it really solves the problem which #94504 is trying to solve?

sunmy2019 · 2023-06-30T09:17:28Z

As Guido says, __del__ (in _BlockingOnManager, just del _blocking_on[self.thread_id]) actually solve the refleak issue. But is it really solves the problem which #94504 is trying to solve?

No. Either deleting the list inside __exit__ or __del__ can be interrupted by GC (triggered by interpreter core).

imports can happen during GC, causing the issue that #94504 is trying to solve.

sunmy2019 · 2023-06-30T09:32:41Z

Actually, I think weakref delivers exactly what we need here. But is weakref available here?

The key point is that _weakref provides an atomic removal function which won't be interrupted by GC.

        def remove(wr, selfref=ref(self), _atomic_removal=_remove_dead_weakref):
            self = selfref()
            if self is not None:
                if self._iterating:
                    self._pending_removals.append(wr.key)
                else:
                    # Atomic removal is necessary since this function
                    # can be called asynchronously by the GC
                    _atomic_removal(self.data, wr.key)

static PyObject *
_weakref__remove_dead_weakref_impl(PyObject *module, PyObject *dct,
                                   PyObject *key)
/*[clinic end generated code: output=d9ff53061fcb875c input=19fc91f257f96a1d]*/
{
    if (_PyDict_DelItemIf(dct, key, is_dead_weakref) < 0) {
        if (PyErr_ExceptionMatches(PyExc_KeyError))
            /* This function is meant to allow safe weak-value dicts
               with GC in another thread (see issue #28427), so it's
               ok if the key doesn't exist anymore.
               */
            PyErr_Clear();
        else
            return NULL;
    }
    Py_RETURN_NONE;
}

We can provide similar things inside some C module.

vstinner · 2023-06-30T10:00:06Z

We can provide similar things inside some C module.

importlib._bootstrap can use _weakref._remove_dead_weakref().

exarkun · 2023-06-30T11:37:33Z

Maybe it should remove it, if the list becomes empty after the removal?

It's just what this PR has done.

You have to be sure to do it thread-safely and re-entrant-safely or the original bug is re-introduced.

Eclips4 · 2023-06-30T11:59:40Z

Actually, I think weakref delivers exactly what we need here. But is weakref available here?

The key point is that _weakref provides an atomic removal function which won't be interrupted by GC.

        def remove(wr, selfref=ref(self), _atomic_removal=_remove_dead_weakref):
            self = selfref()
            if self is not None:
                if self._iterating:
                    self._pending_removals.append(wr.key)
                else:
                    # Atomic removal is necessary since this function
                    # can be called asynchronously by the GC
                    _atomic_removal(self.data, wr.key)

static PyObject *
_weakref__remove_dead_weakref_impl(PyObject *module, PyObject *dct,
                                   PyObject *key)
/*[clinic end generated code: output=d9ff53061fcb875c input=19fc91f257f96a1d]*/
{
    if (_PyDict_DelItemIf(dct, key, is_dead_weakref) < 0) {
        if (PyErr_ExceptionMatches(PyExc_KeyError))
            /* This function is meant to allow safe weak-value dicts
               with GC in another thread (see issue #28427), so it's
               ok if the key doesn't exist anymore.
               */
            PyErr_Clear();
        else
            return NULL;
    }
    Py_RETURN_NONE;
}

We can provide similar things inside some C module.

I was able to implement a WeakValueDictionary in _bootstrap.py. It's seems weird, but it works. Is there a less painful way, without an implementation of an entire (in fact, a half) WeakValueDictionary?
Can we somehow just use an _remove_dead_wekref?

brettcannon · 2023-07-07T22:13:41Z

Can we somehow just use an _remove_dead_wekref?

If you're asking if you can use it at all, then the answer is "yes, you can use it" since _weakref is a built-in module and since the stdlib itself can use private APIs.

brettcannon · 2023-07-07T22:15:01Z

Another option is if people can come up with a different solution to the import deadlock/race condition problem. It's obviously rather tricky (thanks, threads), but maybe someone else has another way to approach the problem?

brettcannon · 2023-08-02T20:21:39Z

We can also revert the fix if this continues to block 3.12.0rc1 and rethink how to tackle the threaded import issue.

exarkun · 2023-08-02T20:28:09Z

We can also revert the fix if this continues to block 3.12.0rc1 and rethink how to tackle the threaded import issue.

The bug fixed caused real Python programs to crash in multiple environments in mysterious ways due to unpredictable interactions between the import system and third-party modules.

The problem introduced by the fix seems to be a leak of one dictionary entry per thread that performs an import and then exits. Some Python programs might create hundreds of thousands or millions of unique threads over their lifetime and might see noticable memory usage as a result - but at least they won't crash when one of those threads gets unlucky and enters the importlib machinery at just the wrong time.

brettcannon · 2023-08-02T20:44:04Z

The bug fixed caused real Python programs to crash in multiple environments in mysterious ways due to unpredictable interactions between the import system and third-party modules.

I'm aware it fixed actual issues (I did the code review and merge of your PR), but the issue isn't something a large portion of people run into. And considering it wasn't (potentially) fixed until Python 3.12, I would argue it wasn't crippling either.

The problem introduced by the fix seems to be a leak of one dictionary entry per thread that performs an import and then exits.

And the release manager for Python 3.12 is blocking the release due to this memory leak, which I consider the more critical issue.

As I said, I would love if someone can come up with a fix for the refleak and keep the fix, but if it's a choice between keeping the fix or stopping the refleak, the RM has chosen the latter as the more critical thing at the moment.

exarkun · 2023-08-02T21:15:13Z

The bug fixed caused real Python programs to crash in multiple environments in mysterious ways due to unpredictable interactions between the import system and third-party modules.

I'm aware it fixed actual issues (I did the code review and merge of your PR), but the issue isn't something a large portion of people run into. And considering it wasn't (potentially) fixed until Python 3.12, I would argue it wasn't crippling either.

The problem introduced by the fix seems to be a leak of one dictionary entry per thread that performs an import and then exits.

And the release manager for Python 3.12 is blocking the release due to this memory leak, which I consider the more critical issue.

This is news to me. Who is the release manager and where can I read about their concerns?

brettcannon · 2023-08-02T21:24:12Z

where can I read about their concerns?

It's in the issue attached to this PR (you actually commented on the issue about how you didn't have time, so maybe you unsubscribed?).

sunmy2019 · 2023-08-04T06:43:53Z

Is it possible to temporarily disable gc when deleting this empty list?

brettcannon · 2023-08-04T19:35:57Z

Is it possible to temporarily disable gc when deleting this empty list?

How does that help with the memory staying around? The key issue is making sure appropriately cleanup occurs regardless of what eventually happens to the thread as there end up being dangling objects.

Yhg1s · 2023-08-04T22:04:07Z

Eeh, I would like to clarify that I have not taken a stance on whether the original bug or the leak is more important. What does matter is that test_import currently leaks references, and that is a blocker. The leak in test_import can be fixed in different ways. I also don't think accumulating a new, effectively immortal object per thread is a good thing to do, but that in itself isn't blocking 3.12rc1.

brettcannon · 2023-08-10T17:33:00Z

So what do we want to do here? I have been assuming @Eclips4 and @vstinner were working on this, but now I'm not sure.

brettcannon · 2023-08-25T21:56:21Z

To help move this forward, I took the idea from @Eclips4 of using WeakValueDictionary and created #108497 .

Eclips4 · 2023-08-29T07:25:35Z

#108497 has been merged, so let's close this.
Thanks to everyone who participated!

pythongh-106176: Fix reference leak in importlib/_bootstrap.py

5328e32

Eclips4 requested review from brettcannon, ericsnowcurrently, ncoghlan and warsaw as code owners June 28, 2023 15:27

bedevere-bot added the awaiting review label Jun 28, 2023

bedevere-bot mentioned this pull request Jun 28, 2023

test_import.test_concurrency leaks ref on Windows #106176

Closed

sunmy2019 added the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Jun 28, 2023

bedevere-bot removed the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Jun 28, 2023

sunmy2019 reviewed Jun 30, 2023

View reviewed changes

sunmy2019 mentioned this pull request Jul 19, 2023

Deadlock in threaded application when using sys._current_frames #106883

Open

Eclips4 mentioned this pull request Aug 27, 2023

GH-106176, GH-104702: Fix reference leak when importing across multiple threads #108497

Merged

Eclips4 closed this Aug 29, 2023

Eclips4 mentioned this pull request Aug 29, 2023

test_pickle are leaked on Windows #104702

Closed

Eclips4 deleted the issue-106176 branch August 29, 2023 09:58

gh-106176: Fix reference leak in importlib/_bootstrap.py #106207

gh-106176: Fix reference leak in importlib/_bootstrap.py #106207

Conversation

Eclips4 commented Jun 28, 2023 • edited by bedevere-bot Loading

Eclips4 commented Jun 28, 2023

Eclips4 commented Jun 28, 2023 • edited Loading

brettcannon commented Jun 28, 2023

sunmy2019 commented Jun 28, 2023

sunmy2019 commented Jun 28, 2023

bedevere-bot commented Jun 28, 2023

Eclips4 commented Jun 28, 2023 • edited Loading

sunmy2019 commented Jun 29, 2023

sunmy2019 commented Jun 29, 2023

vstinner commented Jun 29, 2023

Eclips4 commented Jun 29, 2023

vstinner commented Jun 29, 2023

Eclips4 commented Jun 29, 2023 • edited Loading

vstinner commented Jun 29, 2023

brettcannon commented Jun 30, 2023

vstinner commented Jun 30, 2023

brettcannon commented Jun 30, 2023

sunmy2019 commented Jun 30, 2023

sunmy2019 Jun 30, 2023

Choose a reason for hiding this comment

Eclips4 Jun 30, 2023

Choose a reason for hiding this comment

sunmy2019 Jun 30, 2023 • edited Loading

Choose a reason for hiding this comment

sunmy2019 commented Jun 30, 2023

sunmy2019 commented Jun 30, 2023

chris-eibl commented Jun 30, 2023

vstinner commented Jun 30, 2023

Eclips4 commented Jun 30, 2023 • edited Loading

Eclips4 commented Jun 30, 2023 • edited Loading

sunmy2019 commented Jun 30, 2023

sunmy2019 commented Jun 30, 2023

vstinner commented Jun 30, 2023

exarkun commented Jun 30, 2023

Eclips4 commented Jun 30, 2023 • edited Loading

brettcannon commented Jul 7, 2023

brettcannon commented Jul 7, 2023

brettcannon commented Aug 2, 2023

exarkun commented Aug 2, 2023

brettcannon commented Aug 2, 2023

exarkun commented Aug 2, 2023

brettcannon commented Aug 2, 2023

sunmy2019 commented Aug 4, 2023

brettcannon commented Aug 4, 2023

Yhg1s commented Aug 4, 2023

brettcannon commented Aug 10, 2023

brettcannon commented Aug 25, 2023

Eclips4 commented Aug 29, 2023 • edited Loading

Eclips4 commented Jun 28, 2023 •

edited by bedevere-bot

Loading

Eclips4 commented Jun 28, 2023 •

edited

Loading

Eclips4 commented Jun 28, 2023 •

edited

Loading

Eclips4 commented Jun 29, 2023 •

edited

Loading

sunmy2019 Jun 30, 2023 •

edited

Loading

Eclips4 commented Jun 30, 2023 •

edited

Loading

Eclips4 commented Jun 30, 2023 •

edited

Loading

Eclips4 commented Jun 30, 2023 •

edited

Loading

Eclips4 commented Aug 29, 2023 •

edited

Loading