-
-
Notifications
You must be signed in to change notification settings - Fork 32.7k
bpo-40089: Add _at_fork_reinit() method to locks #19195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@pitrou, @methane or @pablogsal: Would you mind to review this change? I have no strong opinion on the method name :-) threading.Thread uses "_reset_internal_locks" name. The logging module uses "_after_at_fork_child_reinit_locks" name. |
I don't understand: where is |
With this PR, it's used by threading._after_fork() to reinitialize Thread._tstate_lock and Thread.started._cond: see Thread._reset_internal_locks(). Then, I plan to use the new method in logging._after_at_fork_child_reinit_locks(): see bpo-40091. |
Also cc @gpshead |
Ok, I managed to do what I wanted to do: the _at_fork_reinit() method now also releases on memory! So the overall PR has multiple advantages compared to the current code:
Without this PR, we have to create new lock objects and so override variables in namespaces. It means that we have to be very carefuly with not keeping old copies of these variables. For example, nothing prevents someone to write "acquire = condition.acquire" which is a bounded method of the inner condition lock! Another minor bug in the current code: _reset_internal_locks() doesn't remove the old lock from _shutdown_locks (set). That's another example of bug that this change prevents. |
Sadly, it's not safe... I updated my PR with a comment explaining why we leak memory on purpose. Maybe I will experiment an enhancement to not leak memory, but let's start with the safe approach. |
FWIW I agree with not releasing the memory. I wouldn't even consider this a leak, though I understand if msan might if we don't keep references to the old untouchable zombie lock memory somewhere. (err, rather than keeping a reference i think this works via sanitizer api calls that can be used to declare something an intentional leak?) |
I created https://bugs.python.org/issue40089 when I worked on a crash in logging on AIX. I proposed a simple fix for logging which creates a new lock object: PR #19196. Multiprocessing tests hang with this simple change, whereas it just work if I replace |
@gpshead: So what do you think of the overall approach? I also would be interested by your review of the implementation ;-) |
@pitrou: So what do you think of this change? |
Add a private _at_fork_reinit() method to _thread.Lock, _thread.RLock, threading.RLock and threading.Condition classes: reinitialize the lock after fork in the child process; reset the lock to the unlocked state. Rename also the private _reset_internal_locks() method of threading.Event to _at_fork_reinit(). * Add _PyThread_at_fork_reinit() private function. It is excluded from the limited C API. * threading.Thread._reset_internal_locks() now calls _at_fork_reinit() on self._tstate_lock rather than creating a new Python lock object.
I plan to merge this change at the beginning of next week. I rebased my PR on master to get the fix for Azure Pipelines, I squashed commits and I made even more explicit than the function "Leak memory on purpose." I modified _PyThread_at_fork_reinit() comment in the pthread implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work on this. Just noticed a couple of typos while I was reading the patch.
@melwitt: Thanks for the review, I fixed the two typos that you reported ;-) |
@melwitt: I'm curious, are you affected by https://bugs.python.org/issue40089 issue? |
Yes, I believe so. I wanted to gather a bit more data before confirming. I think we are seeing this issue in the OpenStack CI gate [1][2] where jobs occasionally fail seemingly because the database never returns results when queried. As I dug more into it, I found there's a "start" lock in the oslo.db library (lib we use for database access) and sometimes during a service start that forks child process workers (the CI job does a stop and start of services), the child processes begin in a stuck state, unable to access the database because they are stuck behind the "start" lock, never able to acquire it. I added a lot of DNM patch logging and found that indeed, the parent process was holding the lock at the time the child processes were forked and they inherited held locks. I think I have a way to handle the issue in Nova [3] for now by clearing a cache of ours, but am anyway glad to see this PR. I found your work while googling something like "parent process fork child lock already locked" and https://bugs.python.org/issue6721 was in the results :) [1] http://status.openstack.org/elastic-recheck/#1844929 |
Add a private _at_fork_reinit() method to _thread.Lock,
_thread.RLock, threading.RLock and threading.Condition classes:
reinitialize the lock after fork in the child process; reset the lock
to the unlocked state.
Rename also the private _reset_internal_locks() method of
threading.Event to _at_fork_reinit().
from the limited C API.
_at_fork_reinit() on self._tstate_lock rather than creating a new
Python lock object.
https://bugs.python.org/issue40089