Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple racing class attribute read-write crashes on free-threaded builds #118362

Closed
Fidget-Spinner opened this issue Apr 28, 2024 · 14 comments
Closed
Assignees
Labels
topic-free-threading type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@Fidget-Spinner
Copy link
Member

Fidget-Spinner commented Apr 28, 2024

Crash report

What happened?

The following segfaults:

from multiprocessing.dummy import Pool


NTHREADS = 6

class A:
    attr = 1

BOTTOM = 0
TOP = 1000

ITERS = 100
def read(id0):
    for _ in range(ITERS):
        for _ in range(BOTTOM, TOP):
            A.attr
            # print(A.attr)

def write(id0):
    for _ in range(ITERS):
        for _ in range(BOTTOM, TOP):
            # Make _PyType_Lookup cache hot first
            A.attr
            A.attr
            x = A.attr
            x += 1
            A.attr = x
            # print(f"WRITTEN {x}\n")


def main():    
    with Pool(NTHREADS) as pool:    
        pool.apply_async(read, (1,))
        pool.apply_async(write, (1,))
        pool.close()
        pool.join()

main()
print("done")

The first issue is that we have assert(_PyType_HasFeature(type, Py_TPFLAGS_VALID_VERSION_TAG)); in typeobject.c, which may not be true as another thread can modify the attribute in the meantime. We should remove that assertion.

The next issue, which I have not yet been able to solve, is that the attribute evaporates midway in _Py_type_getattro_impl. This is likely caused by a decref of another thread. So far I've tried changing _PyType_Lookup and find_name_in_mro to return strong references, but they don't seem to solve it for some reason.

CPython versions tested on:

CPython main branch

Operating systems tested on:

No response

Output from running 'python -VV' on the command line:

No response

Linked PRs

@Fidget-Spinner Fidget-Spinner added type-crash A hard crash of the interpreter, possibly with a core dump topic-free-threading labels Apr 28, 2024
@Fidget-Spinner
Copy link
Member Author

I'm not assigning myself because I'm still unable to figure this out at the moment. So anyone's free to take it up.

@colesbury
Copy link
Contributor

The modification to the type's dictionary happens before type_modified_unlocked, which invalidates the cache. So the old attribute may be destroyed before the cache entry is invalidated.

There's a somewaht similar issue with the GIL because the destructors can call arbitrary code before the cache is invalidated, but it's much harder to trigger a crash because the old value finalizer is called before the memory is freed.

cc @DinoV

cpython/Objects/typeobject.c

Lines 5364 to 5379 in 8b56d82

BEGIN_TYPE_LOCK()
res = _PyObject_GenericSetAttrWithDict((PyObject *)type, name, value, NULL);
if (res == 0) {
/* Clear the VALID_VERSION flag of 'type' and all its
subclasses. This could possibly be unified with the
update_subclasses() recursion in update_slot(), but carefully:
they each have their own conditions on which to stop
recursing into subclasses. */
type_modified_unlocked(type);
if (is_dunder_name(name)) {
res = update_slot(type, name);
}
assert(_PyType_CheckConsistency(type));
}
END_TYPE_LOCK()

@colesbury
Copy link
Contributor

A starting point would be to call type_modified_unlocked(type) to invalidate the cache before _PyObject_GenericSetAttrWithDict. However, _PyObject_GenericSetAttrWithDict is complex and may suspend the type lock critical section, which might allow another thread to re-cache the stale value before the attribute is updated. I think we might want to also disable cache updates for the duration of the update, e.g.:

For example, something like the following pseudo-code:

BEGIN_TYPE_LOCK();
PyInterpreterState_Get()->types.cache_enabled = false;
type_modified_unlocked(type);
res = _PyObject_GenericSetAttrWithDict((PyObject *)type, name, value, NULL);
if (res == 0) {
    ...
}
PyInterpreterState_Get()->types.cache_enabled = true;
END_TYPE_LOCK();

The caching logic in _PyType_Lookup would only modify the cache if it's not disabled (i.e., not in the middle of an attribute update).

@DinoV DinoV self-assigned this Apr 30, 2024
@DinoV
Copy link
Contributor

DinoV commented Apr 30, 2024

type_modified_unlocked(type);
res = _PyObject_GenericSetAttrWithDict((PyObject *)type, name, value, NULL);

I don't think we can mark the type as being modified before doing the set, as it delivers an event that may care about seeing the new value. There's also other potential issues around re-entrancy in the set, e.g. if there's a meta-type with a descriptor that then does a lookup against the type it'll have a valid version tag re-assigned before the mutation happens.

We could clear this cache entry first if it's the type and still use the flag to not update it. Another possibility would be that we lookup the current attribute (via the cache or MRO) and incref it and then decref it after the set. I'm kind of leaning towards the later just because the type cache is so hot and mutation of types is so rare.

@DinoV
Copy link
Contributor

DinoV commented Apr 30, 2024

I think there's an additional issue here though which is that _PyType_Lookup is just not thread safe because it's returning a borrowed reference.

@colesbury
Copy link
Contributor

it delivers an event that may care about seeing the new value...

Ok, we'd need to split up type_modified_unlocked and only do the clearing of the version tag first

if there's a meta-type with a descriptor that then does a lookup against the type it'll have a valid version tag re-assigned before the mutation happens...

I'm not sure I understand if there's some issue specific to meta-types here. I'm suggesting explicitly disabling all type cache updates for essentially the duration we hold the type lock. BEGIN/END_TYPE_LOCK almost does this already (it's a mutex) but not quite because the critical section can be suspended in lots of places in the middle allowing a concurrent type cache update to proceed.

Another possibility would be that we lookup the current attribute (via the cache or MRO) and incref it and then decref it after the set

This doesn't sound robust to me. One of the big problems is that the type lock is a critical section that doesn't really exclude concurrent updates because there are many opportunities for it to be suspended.

I think there's an additional issue here though which is that _PyType_Lookup is just not thread safe because it's returning a borrowed reference.

Yeah that's an additional issue.

@DinoV
Copy link
Contributor

DinoV commented Apr 30, 2024

it delivers an event that may care about seeing the new value...

Ok, we'd need to split up type_modified_unlocked and only do the clearing of the version tag first

if there's a meta-type with a descriptor that then does a lookup against the type it'll have a valid version tag re-assigned before the mutation happens...

I'm not sure I understand if there's some issue specific to meta-types here. I'm suggesting explicitly disabling all type cache updates for essentially the duration we hold the type lock. BEGIN/END_TYPE_LOCK almost does this already (it's a mutex) but not quite because the critical section can be suspended in lots of places in the middle allowing a concurrent type cache update to proceed.

My comment here was specifically about the pseudo-code moving the type_modified_unlocked(type); call above the set attr. I think the fact that this delivers the type modified event to external listeners alone prevents us from doing it. But the reason I called out meta-types in particular as they're more likely to cause an additional _PyType_Lookup to happen on the type causing a new version to be assigned to it before the set happens. I'm not sure there's a descriptor in the base type which would cause that but maybe there is. But either way we could:

  • Invalidate the version
  • Run some other Python code which assigns a version tag (whether through re-entrancy or a race, this can happen outside of the type cache)
  • Finally do the store into the types dictionary

Not updating the type cache during this process would mean we're not corrupting the type cache with a stale value but a type-watcher would still become out of sync. So at the very least I think the type_modified_unlocked call needs to happen after the set.

Another possibility would be that we lookup the current attribute (via the cache or MRO) and incref it and then decref it after the set

This doesn't sound robust to me. One of the big problems is that the type lock is a critical section that doesn't really exclude concurrent updates because there are many opportunities for it to be suspended.

But if the object is kept alive by type_setattro then those suspensions don't matter? The other threads have no other synchronization against the type so not seeing the write until after the type_modified_unlocked is delivered seems fine.

@colesbury
Copy link
Contributor

colesbury commented Apr 30, 2024

Ok, I think I understand now. I'd revise my suggestion so that type_setattro does:

  1. Acquire the type lock (critical section)
  2. Invalidate the types version cache recursively (i.e., the first part of type_modified_unlocked)
  3. Disable type cache updates
  4. Call _PyObject_GenericSetAttrWithDict
  5. Re-invalidate the version cache and trigger watchers (i.e., type_modified_unlocked)
  6. Enable type cache updates
  7. Release the type lock

The other threads have no other synchronization against the type so not seeing the write until after the type_modified_unlocked is delivered seems fine...

But they'll get stale cache entries -- that seems bad to me. To a certain extent, this already happens in CPython with the GIL, but it's going to be more common and weirder with the GIL disabled.

@DinoV
Copy link
Contributor

DinoV commented Apr 30, 2024

But they'll get stale cache entries -- that seems bad to me. To a certain extent, this already happens in CPython with the GIL, but it's going to be more common and weirder with the GIL disabled.

It seems like logically though there's no other synchronization happening and so the write isn't actually done until the version is invalidated. I guess it's possible that the re-entrancy could cause a 2nd write to happen that could be observed while still seeing the old value in the cache though breaking sequential consistency (although I guess to your point that's probably possible w/ the GIL already too).

@colesbury
Copy link
Contributor

colesbury commented Apr 30, 2024

Here's a single-threaded example of seeing a stale cached value (even with the GIL):
https://gist.github.com/colesbury/49ab040d89faab3b0a8bdb9552bfe266

I basically have two concerns:

  1. This is longstanding buggy CPython behavior, but if it becomes more frequently visible in the free-threaded build then that's a real problem.
  2. I'm not sure how we can keep alive the correct object current attribute given that the cache can have stale values and the type lock can be suspended. (I might just be missing something here; it's hard to reason about without seeing actual code.)

@DinoV
Copy link
Contributor

DinoV commented Apr 30, 2024

Here's a single-threaded example of seeing a stale cached value (even with the GIL): https://gist.github.com/colesbury/49ab040d89faab3b0a8bdb9552bfe266

I basically have two concerns:

1. This is longstanding buggy CPython behavior, but if it becomes more frequently visible in the free-threaded build then that's a real problem.

2. I'm not sure how we can keep alive the _correct_ object current attribute given that the cache can have stale values and the type lock can be suspended. (I might just be missing something here; it's hard to reason about without seeing actual code.)

This is what I was thinking for keeping it alive in type_setattro, we just see if it's currently in the cache and keep that alive, and if not we look it up in the MRO before hand and keep that alive:

    BEGIN_TYPE_LOCK()

#ifdef Py_GIL_DISABLED
    PyObject *old_value = NULL;
    // We could have the old value in the cache and we might destroy it before
    // we mark the type as being modified.  Therefore we keep the object alive
    // between the set and the type modified.
    unsigned int h = MCACHE_HASH_METHOD(type, name);
    struct type_cache *cache = get_type_cache();
    struct type_cache_entry *entry = &cache->hashtable[h];
    while (1) {
        int sequence = _PySeqLock_BeginRead(&entry->sequence);
        if (_Py_atomic_load_uint32_relaxed(&entry->version) == type->tp_version_tag &&
            _Py_atomic_load_ptr_relaxed(&entry->name) == name) {
            old_value = _Py_atomic_load_ptr_relaxed(&entry->value);

            // If the sequence is still valid then we're done
            if (_PySeqLock_EndRead(&entry->sequence, sequence)) {
                Py_XINCREF(old_value);
                break;
            }
        } else {
            int error;
            old_value = find_name_in_mro(type, name, &error);
            if (error) {
                return -1;
            }
            Py_XINCREF(old_value);
            break;
        }
    }
#endif

@DinoV
Copy link
Contributor

DinoV commented May 1, 2024

I think #118454 will fix it.

I went with something closer to what @colesbury described with the multi-phase type version invalidation. But to split the invalidation into 2 phases could potentially leave the types in an inconsistent state, or we'd need to remember what types we actually invalidated.

I've instead made the update of the dict atomic with the invalidation of the type. This isn't very difficult to guarantee because other than the descriptor case the only thing that can cause re-entrancy below the lock is the finalizer on the object. The keys we're using against the dict are all exact unicode, and types only contain exact unicode, so we know the lookups are otherwise safe.

@DinoV
Copy link
Contributor

DinoV commented May 1, 2024

Here's a single-threaded example of seeing a stale cached value (even with the GIL): https://gist.github.com/colesbury/49ab040d89faab3b0a8bdb9552bfe266
I basically have two concerns:

1. This is longstanding buggy CPython behavior, but if it becomes more frequently visible in the free-threaded build then that's a real problem.

2. I'm not sure how we can keep alive the _correct_ object current attribute given that the cache can have stale values and the type lock can be suspended. (I might just be missing something here; it's hard to reason about without seeing actual code.)

This is what I was thinking for keeping it alive in type_setattro, we just see if it's currently in the cache and keep that alive, and if not we look it up in the MRO before hand and keep that alive

And that wouldn't have worked if the value was cached for a subclass but not for the defining class.

DinoV added a commit that referenced this issue May 6, 2024
…e face of concurrent mutators (#118454)

Add _PyType_LookupRef and use incref before setting attribute on type
Makes setting an attribute on a class and signaling type modified atomic
Avoid adding re-entrancy exposing the type cache in an inconsistent state by decrefing after type is updated
DinoV added a commit that referenced this issue May 6, 2024
* Skip tests when threads aren't available

* Use ThreadPoolExecutor
@colesbury
Copy link
Contributor

This is fixed now

SonicField pushed a commit to SonicField/cpython that referenced this issue May 8, 2024
… in the face of concurrent mutators (python#118454)

Add _PyType_LookupRef and use incref before setting attribute on type
Makes setting an attribute on a class and signaling type modified atomic
Avoid adding re-entrancy exposing the type cache in an inconsistent state by decrefing after type is updated
SonicField pushed a commit to SonicField/cpython that referenced this issue May 8, 2024
…8666)

* Skip tests when threads aren't available

* Use ThreadPoolExecutor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-free-threading type-crash A hard crash of the interpreter, possibly with a core dump
Projects
None yet
Development

No branches or pull requests

3 participants