Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-106176: Fix reference leak in importlib/_bootstrap.py #106207

Closed
wants to merge 1 commit into from

Conversation

Eclips4
Copy link
Member

@Eclips4 Eclips4 commented Jun 28, 2023

@Eclips4
Copy link
Member Author

Eclips4 commented Jun 28, 2023

@Eclips4
Copy link
Member Author

Eclips4 commented Jun 28, 2023

There's no tests for case which solved by #94504, so I'm somewhat unsure if this is the right solution

However, refleaks in test_concurrency and PyUnpicklerTests (which related to issue with refleaks in test_pickle on Windows) are gone.

@brettcannon
Copy link
Member

@exarkun any opinions on this fix?

@sunmy2019
Copy link
Member

I think we are almost here!

sys.modules["_frozen_importlib"]._blocking_on

is the root of the leak.

@sunmy2019
Copy link
Member

As for the fix, we usually clean up things like these within the test case and leave the cache as is.

@sunmy2019 sunmy2019 added the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Jun 28, 2023
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @sunmy2019 for commit 5328e32 🤖

If you want to schedule another build, you need to add the 🔨 test-with-refleak-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Jun 28, 2023
@Eclips4
Copy link
Member Author

Eclips4 commented Jun 28, 2023

I think we are almost here!

sys.modules["_frozen_importlib"]._blocking_on

is the root of the leak.

Yeah, you're right. The refleaks is happen in _blocking_on or somewhere nearby, I tried playing around with this, and seems that is the most correct solution.

@sunmy2019
Copy link
Member

Confirmed fixed on Windows
https://buildbot.python.org/all/#/builders/1214/builds/34

@sunmy2019
Copy link
Member

How about this?

diff --git a/Lib/test/libregrtest/utils.py b/Lib/test/libregrtest/utils.py
index fd46819fd90..c5f301c2d18 100644
--- a/Lib/test/libregrtest/utils.py
+++ b/Lib/test/libregrtest/utils.py
@@ -125,6 +125,15 @@ def clear_caches():
         if stream is not None:
             stream.flush()
 
+    try:
+        _frozen_importlib = sys.modules['_frozen_importlib']
+    except KeyError:
+        pass
+    else:
+        _frozen_importlib._blocking_on = {
+            k: v for k, v in _frozen_importlib._blocking_on.items() if v
+        }
+
     try:
         re = sys.modules['re']
     except KeyError:

@vstinner
Copy link
Member

How about this?

I would prefer to not leak so much implementation details in libregrtest. If importlib cannot clean that automatically and it must be cleared, maybe add a private function for that.

But i don't see why all tests must not clear that dict if only a test is affected. For me it would make sense to clear that dict in test_import and test_importlib, in a tearDownModule() function or something like that.

@Eclips4
Copy link
Member Author

Eclips4 commented Jun 29, 2023

How about this?

I would prefer to not leak so much implementation details in libregrtest. If importlib cannot clean that automatically and it must be cleared, maybe add a private function for that.

But i don't see why all tests must not clear that dict if only a test is affected. For me it would make sense to clear that dict in test_import and test_importlib, in a tearDownModule() function or something like that.

This leak also present in test_pickle see #104702 for more details. Should we add a tearDownModule in test_pickle?
So, I think there are two solution here.
First: Modify libregrtest as @sunmy2019 proposed.
Second: Write a some private support function and call it in tearDownModule function in test_import, test_importlib and test_pickle.

@vstinner
Copy link
Member

This leak also present in test_pickle see #104702 for more details. Should we add a tearDownModule in test_pickle?

I don't understand how a pickle tests leaks a deep internal importlib "blocking" object? Why is this "blocking" object created? Why is not deleted automatically?

The Refleaks test checks that each test iteration either creates no new objects or deletes all objects that it created.

@Eclips4
Copy link
Member Author

Eclips4 commented Jun 29, 2023

This leak also present in test_pickle see #104702 for more details. Should we add a tearDownModule in test_pickle?

I don't understand how a pickle tests leaks a deep internal importlib "blocking" object? Why is this "blocking" object created? Why is not deleted automatically?

The Refleaks test checks that each test iteration either creates no new objects or deletes all objects that it created.

Let me try to explain. Previously, structure of _blocking_on looks like that: thread_id: module_lock_instance.
When we try to acquire lock, we add thread_id: module_lock_instance pair in _blocking_on, and when we finally acquire lock, we pop key thread_id from _blocking_on.

After changes, it took this form: thread_id: list_of_module_lock_instances.
When we try to acquire module lock, it's create a key-value pair thread_id: [module_lock_instance] in _blocking_on_ dict. After we're finally acquire lock, we remove current module_lock_instance from _blocking_on[thread_id]. So, what's happen when thread with thread_id dies? Nothing. It's just stays in _blocking_on as it is (like thread_id: empty_list). It's just created an key-value pair, some thing happens, and it doesn't get deleted from dict, when thread with thread_id dies (but it should be deleted).
So, there need a explicit deleting of that key from dictionary.

Why test_pickle leaks? In test_pickle we have similiar test to test_concurrency in test_import - pickletester/test_unpickle_module_race.

_blocking_on not cleared between iteration tests. It's just growing up. (if we don't do things like in this PR) Why is so? Well... I don't know. Probably, _blocking_on is cleared when interpreter finalizes. (Does interpreter really finalizes in each iteration of refleak tests?)

At the current moment of research, I think solution which presented in this PR is the only true.

@vstinner
Copy link
Member

So, what's happen when thread with thread_id dies?

I guess that by "dying", you just mean that the thread exits cleanly.

Can this "per-thread" dictionary becomes a thread local storage, like threading.local(), so it's automatically cleared when the thread exits? You may use _thread._local().

Or can you register a function with threading._register_atexit() and clear _blocks_on[thread_id] at a thread exit?

A practical problem is that the threading.py module is not imported yet when _bootstrap is run :-(

For me, this memory leak is a real issue. If an application spawns many threads, each thread creates an item in _blocks_on and the thread identifier is not recycled, this dictionary will only grow and will never be cleared: so the memory usage only increases, whereas the threads are no longer running (exited).

Prototype using thread local:

import threading
import _thread
import tracemalloc

NTHREAD = 10

def big_alloc():
    return bytearray(1024 * 50)

if 1:
    def mylock():
        thread_locals = _thread._local()
        thread_locals.blocks_on = big_alloc()
else:
    _blocks_on = {}

    def mylock():
        _blocks_on[_thread.get_ident()] = big_alloc()

def test():
    threads = [threading.Thread(target=mylock, args=()) for _ in range(NTHREAD)]
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    threads = None

# warmup to have a more accurate memory usage measurement
test()

tracemalloc.start()
before = tracemalloc.get_traced_memory()[0]

test()

after = tracemalloc.get_traced_memory()[0]
usage = after - before
print(f"usage: {usage} bytes")

It shows less than 1 kB of memory usage: no leak. If you replace if 1: with if 0:, it leaks 50 kB: 10 kB per thread.

@brettcannon
Copy link
Member

I don't have a specific opinion on this, but do note you do have to make sure _thread can be used by importlib at start-up and it will work appropriately for platforms w/o fully-functioning threading (e.g. WebAssembly).

I guess that by "dying", you just mean that the thread exits cleanly.

It should mean the thread exited while in the middle of an import, else there's a logic error causing keys to get left behind in the dict even when all imports resolve.

@vstinner
Copy link
Member

It should mean the thread exited while in the middle of an import, else there's a logic error causing keys to get left behind in the dict even when all imports resolve.

_BlockingOnManager.__exit__() removes the lock from _blocking_on[thread_id], but it doesn't remove _blocking_on[thread_id] item in the dictionary. Maybe it should remove it, if the list becomes empty after the removal?

@brettcannon
Copy link
Member

Maybe it should remove it, if the list becomes empty after the removal?

That SGTM!

@sunmy2019
Copy link
Member

Maybe it should remove it, if the list becomes empty after the removal?

It's just what this PR has done.

@@ -85,6 +85,8 @@ def __enter__(self):
def __exit__(self, *args, **kwargs):
"""Remove self.lock from this thread's _blocking_on list."""
self.blocked_on.remove(self.lock)
if not self.blocked_on:
del _blocking_on[self.thread_id]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just have one question, is this always true?

assert self.blocked_on is _blocking_on[self.thread_id]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, if it were sometimes is not true, self.blocked_on.remove(...) will fail, no?
Though, currently test suite didn't failed with new assert check.

Copy link
Member

@sunmy2019 sunmy2019 Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #101942 (comment)

It may not be True under certain circumstances.

@sunmy2019
Copy link
Member

Actually, I think weakref delivers exactly what we need here. But is weakref available here?

@sunmy2019
Copy link
Member

@chris-eibl
Copy link

ISTM, that this problem has been tried before in a similar way, but then got reverted:
#101942 (comment)

@vstinner
Copy link
Member

Oh. That problem which looks simple seems quite complicated to be fixed.

@Eclips4
Copy link
Member Author

Eclips4 commented Jun 30, 2023

Actually, I think weakref delivers exactly what we need here. But is weakref available here?

Unfortunately, seems no.
(Some functional of weakref is available here, but not WeakValueDictionary)
Altough, probably, we can fully imitate the WeakValueDictionary here.

@Eclips4
Copy link
Member Author

Eclips4 commented Jun 30, 2023

As Guido says, __del__ (in _BlockingOnManager, just del _blocking_on[self.thread_id]) actually solve the refleak issue. But is it really solves the problem which #94504 is trying to solve?

@sunmy2019
Copy link
Member

As Guido says, __del__ (in _BlockingOnManager, just del _blocking_on[self.thread_id]) actually solve the refleak issue. But is it really solves the problem which #94504 is trying to solve?

No. Either deleting the list inside __exit__ or __del__ can be interrupted by GC (triggered by interpreter core).

imports can happen during GC, causing the issue that #94504 is trying to solve.

@sunmy2019
Copy link
Member

Actually, I think weakref delivers exactly what we need here. But is weakref available here?

The key point is that _weakref provides an atomic removal function which won't be interrupted by GC.

        def remove(wr, selfref=ref(self), _atomic_removal=_remove_dead_weakref):
            self = selfref()
            if self is not None:
                if self._iterating:
                    self._pending_removals.append(wr.key)
                else:
                    # Atomic removal is necessary since this function
                    # can be called asynchronously by the GC
                    _atomic_removal(self.data, wr.key)
static PyObject *
_weakref__remove_dead_weakref_impl(PyObject *module, PyObject *dct,
                                   PyObject *key)
/*[clinic end generated code: output=d9ff53061fcb875c input=19fc91f257f96a1d]*/
{
    if (_PyDict_DelItemIf(dct, key, is_dead_weakref) < 0) {
        if (PyErr_ExceptionMatches(PyExc_KeyError))
            /* This function is meant to allow safe weak-value dicts
               with GC in another thread (see issue #28427), so it's
               ok if the key doesn't exist anymore.
               */
            PyErr_Clear();
        else
            return NULL;
    }
    Py_RETURN_NONE;
}

We can provide similar things inside some C module.

@vstinner
Copy link
Member

We can provide similar things inside some C module.

importlib._bootstrap can use _weakref._remove_dead_weakref().

@exarkun
Copy link
Contributor

exarkun commented Jun 30, 2023

Maybe it should remove it, if the list becomes empty after the removal?

It's just what this PR has done.

You have to be sure to do it thread-safely and re-entrant-safely or the original bug is re-introduced.

@Eclips4
Copy link
Member Author

Eclips4 commented Jun 30, 2023

Actually, I think weakref delivers exactly what we need here. But is weakref available here?

The key point is that _weakref provides an atomic removal function which won't be interrupted by GC.

        def remove(wr, selfref=ref(self), _atomic_removal=_remove_dead_weakref):
            self = selfref()
            if self is not None:
                if self._iterating:
                    self._pending_removals.append(wr.key)
                else:
                    # Atomic removal is necessary since this function
                    # can be called asynchronously by the GC
                    _atomic_removal(self.data, wr.key)
static PyObject *
_weakref__remove_dead_weakref_impl(PyObject *module, PyObject *dct,
                                   PyObject *key)
/*[clinic end generated code: output=d9ff53061fcb875c input=19fc91f257f96a1d]*/
{
    if (_PyDict_DelItemIf(dct, key, is_dead_weakref) < 0) {
        if (PyErr_ExceptionMatches(PyExc_KeyError))
            /* This function is meant to allow safe weak-value dicts
               with GC in another thread (see issue #28427), so it's
               ok if the key doesn't exist anymore.
               */
            PyErr_Clear();
        else
            return NULL;
    }
    Py_RETURN_NONE;
}

We can provide similar things inside some C module.

I was able to implement a WeakValueDictionary in _bootstrap.py. It's seems weird, but it works. Is there a less painful way, without an implementation of an entire (in fact, a half) WeakValueDictionary?
Can we somehow just use an _remove_dead_wekref?

@brettcannon
Copy link
Member

Can we somehow just use an _remove_dead_wekref?

If you're asking if you can use it at all, then the answer is "yes, you can use it" since _weakref is a built-in module and since the stdlib itself can use private APIs.

@brettcannon
Copy link
Member

Another option is if people can come up with a different solution to the import deadlock/race condition problem. It's obviously rather tricky (thanks, threads), but maybe someone else has another way to approach the problem?

@brettcannon
Copy link
Member

We can also revert the fix if this continues to block 3.12.0rc1 and rethink how to tackle the threaded import issue.

@exarkun
Copy link
Contributor

exarkun commented Aug 2, 2023

We can also revert the fix if this continues to block 3.12.0rc1 and rethink how to tackle the threaded import issue.

The bug fixed caused real Python programs to crash in multiple environments in mysterious ways due to unpredictable interactions between the import system and third-party modules.

The problem introduced by the fix seems to be a leak of one dictionary entry per thread that performs an import and then exits. Some Python programs might create hundreds of thousands or millions of unique threads over their lifetime and might see noticable memory usage as a result - but at least they won't crash when one of those threads gets unlucky and enters the importlib machinery at just the wrong time.

@brettcannon
Copy link
Member

The bug fixed caused real Python programs to crash in multiple environments in mysterious ways due to unpredictable interactions between the import system and third-party modules.

I'm aware it fixed actual issues (I did the code review and merge of your PR), but the issue isn't something a large portion of people run into. And considering it wasn't (potentially) fixed until Python 3.12, I would argue it wasn't crippling either.

The problem introduced by the fix seems to be a leak of one dictionary entry per thread that performs an import and then exits.

And the release manager for Python 3.12 is blocking the release due to this memory leak, which I consider the more critical issue.

As I said, I would love if someone can come up with a fix for the refleak and keep the fix, but if it's a choice between keeping the fix or stopping the refleak, the RM has chosen the latter as the more critical thing at the moment.

@exarkun
Copy link
Contributor

exarkun commented Aug 2, 2023

The bug fixed caused real Python programs to crash in multiple environments in mysterious ways due to unpredictable interactions between the import system and third-party modules.

I'm aware it fixed actual issues (I did the code review and merge of your PR), but the issue isn't something a large portion of people run into. And considering it wasn't (potentially) fixed until Python 3.12, I would argue it wasn't crippling either.

The problem introduced by the fix seems to be a leak of one dictionary entry per thread that performs an import and then exits.

And the release manager for Python 3.12 is blocking the release due to this memory leak, which I consider the more critical issue.

This is news to me. Who is the release manager and where can I read about their concerns?

@brettcannon
Copy link
Member

where can I read about their concerns?

It's in the issue attached to this PR (you actually commented on the issue about how you didn't have time, so maybe you unsubscribed?).

@sunmy2019
Copy link
Member

Is it possible to temporarily disable gc when deleting this empty list?

@brettcannon
Copy link
Member

Is it possible to temporarily disable gc when deleting this empty list?

How does that help with the memory staying around? The key issue is making sure appropriately cleanup occurs regardless of what eventually happens to the thread as there end up being dangling objects.

@Yhg1s
Copy link
Member

Yhg1s commented Aug 4, 2023

Eeh, I would like to clarify that I have not taken a stance on whether the original bug or the leak is more important. What does matter is that test_import currently leaks references, and that is a blocker. The leak in test_import can be fixed in different ways. I also don't think accumulating a new, effectively immortal object per thread is a good thing to do, but that in itself isn't blocking 3.12rc1.

@brettcannon
Copy link
Member

So what do we want to do here? I have been assuming @Eclips4 and @vstinner were working on this, but now I'm not sure.

@brettcannon
Copy link
Member

To help move this forward, I took the idea from @Eclips4 of using WeakValueDictionary and created #108497 .

@Eclips4
Copy link
Member Author

Eclips4 commented Aug 29, 2023

#108497 has been merged, so let's close this.
Thanks to everyone who participated!

@Eclips4 Eclips4 closed this Aug 29, 2023
@Eclips4 Eclips4 deleted the issue-106176 branch August 29, 2023 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants