bpo-40089: Add _at_fork_reinit() method to locks #19195

vstinner · 2020-03-27T16:52:13Z

Add a private _at_fork_reinit() method to _thread.Lock,
_thread.RLock, threading.RLock and threading.Condition classes:
reinitialize the lock after fork in the child process; reset the lock
to the unlocked state.

Rename also the private _reset_internal_locks() method of
threading.Event to _at_fork_reinit().

Add _PyThread_at_fork_reinit() private function. It is excluded
from the limited C API.
threading.Thread._reset_internal_locks() now calls
_at_fork_reinit() on self._tstate_lock rather than creating a new
Python lock object.

https://bugs.python.org/issue40089

vstinner · 2020-03-27T17:20:07Z

@pitrou, @methane or @pablogsal: Would you mind to review this change?

I have no strong opinion on the method name :-) threading.Thread uses "_reset_internal_locks" name. The logging module uses "_after_at_fork_child_reinit_locks" name.

pitrou · 2020-03-27T17:44:38Z

I don't understand: where is _at_fork_reinit called after fork?

vstinner · 2020-03-27T18:00:31Z

I don't understand: where is _at_fork_reinit called after fork?

With this PR, it's used by threading._after_fork() to reinitialize Thread._tstate_lock and Thread.started._cond: see Thread._reset_internal_locks().

Then, I plan to use the new method in logging._after_at_fork_child_reinit_locks(): see bpo-40091.

pitrou · 2020-03-27T18:09:27Z

Also cc @gpshead

vstinner · 2020-03-27T18:58:46Z

Ok, I managed to do what I wanted to do: the _at_fork_reinit() method now also releases on memory!

So the overall PR has multiple advantages compared to the current code:

Python no longer leaks memory on fork
Thread._reset_internal_locks() avoids calling _set_tstate_lock() which updates tstate->on_delete and tstate->on_delete_data.
Keep the same Python objects at fork: there is no more need to create new objects. For example, for threading.Condition, it prevents to update acquire() and release() methods which are taken directly from the inner lock:

class Condition:
    def __init__(self, lock=None):
        (...)
        self._lock = lock
        # Export the lock's acquire() and release() methods
        self.acquire = lock.acquire
        self.release = lock.release
        (...)

Without this PR, we have to create new lock objects and so override variables in namespaces. It means that we have to be very carefuly with not keeping old copies of these variables. For example, nothing prevents someone to write "acquire = condition.acquire" which is a bounded method of the inner condition lock!

Another minor bug in the current code: _reset_internal_locks() doesn't remove the old lock from _shutdown_locks (set). That's another example of bug that this change prevents.

vstinner · 2020-03-27T19:28:55Z

Ok, I managed to do what I wanted to do: the _at_fork_reinit() method now also releases on memory!

Sadly, it's not safe... I updated my PR with a comment explaining why we leak memory on purpose.

Maybe I will experiment an enhancement to not leak memory, but let's start with the safe approach.

gpshead · 2020-03-27T19:39:20Z

FWIW I agree with not releasing the memory. I wouldn't even consider this a leak, though I understand if msan might if we don't keep references to the old untouchable zombie lock memory somewhere. (err, rather than keeping a reference i think this works via sanitizer api calls that can be used to declare something an intentional leak?)

vstinner · 2020-03-28T00:17:49Z

I created https://bugs.python.org/issue40089 when I worked on a crash in logging on AIX. I proposed a simple fix for logging which creates a new lock object: PR #19196. Multiprocessing tests hang with this simple change, whereas it just work if I replace _releaseLock() with _lock._at_fork_reinit() in logging._after_at_fork_child_reinit_locks().

vstinner · 2020-03-28T00:33:00Z

@gpshead: So what do you think of the overall approach? I also would be interested by your review of the implementation ;-)

vstinner · 2020-03-31T23:47:33Z

@pitrou: So what do you think of this change?

Lib/test/lock_tests.py

Add a private _at_fork_reinit() method to _thread.Lock, _thread.RLock, threading.RLock and threading.Condition classes: reinitialize the lock after fork in the child process; reset the lock to the unlocked state. Rename also the private _reset_internal_locks() method of threading.Event to _at_fork_reinit(). * Add _PyThread_at_fork_reinit() private function. It is excluded from the limited C API. * threading.Thread._reset_internal_locks() now calls _at_fork_reinit() on self._tstate_lock rather than creating a new Python lock object.

vstinner · 2020-04-02T01:24:30Z

I plan to merge this change at the beginning of next week.

I rebased my PR on master to get the fix for Azure Pipelines, I squashed commits and I made even more explicit than the function "Leak memory on purpose." I modified _PyThread_at_fork_reinit() comment in the pthread implementation.

melwitt

Thank you for your work on this. Just noticed a couple of typos while I was reading the patch.

Python/thread_pthread.h

vstinner · 2020-04-03T01:11:00Z

@melwitt: Thanks for the review, I fixed the two typos that you reported ;-)

vstinner · 2020-04-03T01:14:43Z

@melwitt: I'm curious, are you affected by https://bugs.python.org/issue40089 issue?

melwitt · 2020-04-06T06:04:15Z

Yes, I believe so. I wanted to gather a bit more data before confirming.

I think we are seeing this issue in the OpenStack CI gate [1][2] where jobs occasionally fail seemingly because the database never returns results when queried. As I dug more into it, I found there's a "start" lock in the oslo.db library (lib we use for database access) and sometimes during a service start that forks child process workers (the CI job does a stop and start of services), the child processes begin in a stuck state, unable to access the database because they are stuck behind the "start" lock, never able to acquire it.

I added a lot of DNM patch logging and found that indeed, the parent process was holding the lock at the time the child processes were forked and they inherited held locks.

I think I have a way to handle the issue in Nova [3] for now by clearing a cache of ours, but am anyway glad to see this PR. I found your work while googling something like "parent process fork child lock already locked" and https://bugs.python.org/issue6721 was in the results :)

[1] http://status.openstack.org/elastic-recheck/#1844929
[2] https://bugs.launchpad.net/nova/+bug/1844929
[3] https://bugs.launchpad.net/nova/+bug/1844929/comments/28

the-knights-who-say-ni added the CLA signed label Mar 27, 2020

bedevere-bot added the awaiting core review label Mar 27, 2020

vstinner commented Mar 31, 2020

View reviewed changes

Lib/test/lock_tests.py Show resolved Hide resolved

melwitt reviewed Apr 3, 2020

View reviewed changes

Python/thread_pthread.h Outdated Show resolved Hide resolved

Python/thread_pthread.h Outdated Show resolved Hide resolved

Fix typos spotted by melanie witt

b153154

vstinner merged commit 87255be into python:master Apr 7, 2020

bedevere-bot removed the awaiting core review label Apr 7, 2020

vstinner deleted the at_fork_init branch April 7, 2020 21:11

vstinner mentioned this pull request Apr 7, 2020

[WIP] bpo-40091: Fix a crash in logging after fork #19196

Closed

Uh oh!

bpo-40089: Add _at_fork_reinit() method to locks #19195

bpo-40089: Add _at_fork_reinit() method to locks #19195

Uh oh!

Conversation

vstinner commented Mar 27, 2020 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Mar 27, 2020

Uh oh!

pitrou commented Mar 27, 2020

Uh oh!

vstinner commented Mar 27, 2020 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Mar 27, 2020

Uh oh!

vstinner commented Mar 27, 2020

Uh oh!

vstinner commented Mar 27, 2020

Uh oh!

gpshead commented Mar 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Mar 28, 2020

Uh oh!

vstinner commented Mar 28, 2020

Uh oh!

vstinner commented Mar 31, 2020

Uh oh!

Uh oh!

vstinner commented Apr 2, 2020

Uh oh!

melwitt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vstinner commented Apr 3, 2020

Uh oh!

vstinner commented Apr 3, 2020

Uh oh!

melwitt commented Apr 6, 2020

Uh oh!

Uh oh!

vstinner commented Mar 27, 2020 •

edited by bedevere-bot

Loading

vstinner commented Mar 27, 2020 •

edited by bedevere-bot

Loading

gpshead commented Mar 27, 2020 •

edited

Loading