Skip to content

bpo-40089: Add _at_fork_reinit() method to locks #19195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 7, 2020
Merged

bpo-40089: Add _at_fork_reinit() method to locks #19195

merged 2 commits into from
Apr 7, 2020

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Mar 27, 2020

Add a private _at_fork_reinit() method to _thread.Lock,
_thread.RLock, threading.RLock and threading.Condition classes:
reinitialize the lock after fork in the child process; reset the lock
to the unlocked state.

Rename also the private _reset_internal_locks() method of
threading.Event to _at_fork_reinit().

  • Add _PyThread_at_fork_reinit() private function. It is excluded
    from the limited C API.
  • threading.Thread._reset_internal_locks() now calls
    _at_fork_reinit() on self._tstate_lock rather than creating a new
    Python lock object.

https://bugs.python.org/issue40089

@vstinner
Copy link
Member Author

@pitrou, @methane or @pablogsal: Would you mind to review this change?

I have no strong opinion on the method name :-) threading.Thread uses "_reset_internal_locks" name. The logging module uses "_after_at_fork_child_reinit_locks" name.

@pitrou
Copy link
Member

pitrou commented Mar 27, 2020

I don't understand: where is _at_fork_reinit called after fork?

@vstinner
Copy link
Member Author

vstinner commented Mar 27, 2020

I don't understand: where is _at_fork_reinit called after fork?

With this PR, it's used by threading._after_fork() to reinitialize Thread._tstate_lock and Thread.started._cond: see Thread._reset_internal_locks().

Then, I plan to use the new method in logging._after_at_fork_child_reinit_locks(): see bpo-40091.

@pitrou
Copy link
Member

pitrou commented Mar 27, 2020

Also cc @gpshead

@vstinner
Copy link
Member Author

Ok, I managed to do what I wanted to do: the _at_fork_reinit() method now also releases on memory!

So the overall PR has multiple advantages compared to the current code:

  • Python no longer leaks memory on fork
  • Thread._reset_internal_locks() avoids calling _set_tstate_lock() which updates tstate->on_delete and tstate->on_delete_data.
  • Keep the same Python objects at fork: there is no more need to create new objects. For example, for threading.Condition, it prevents to update acquire() and release() methods which are taken directly from the inner lock:
class Condition:
    def __init__(self, lock=None):
        (...)
        self._lock = lock
        # Export the lock's acquire() and release() methods
        self.acquire = lock.acquire
        self.release = lock.release
        (...)

Without this PR, we have to create new lock objects and so override variables in namespaces. It means that we have to be very carefuly with not keeping old copies of these variables. For example, nothing prevents someone to write "acquire = condition.acquire" which is a bounded method of the inner condition lock!

Another minor bug in the current code: _reset_internal_locks() doesn't remove the old lock from _shutdown_locks (set). That's another example of bug that this change prevents.

@vstinner
Copy link
Member Author

Ok, I managed to do what I wanted to do: the _at_fork_reinit() method now also releases on memory!

Sadly, it's not safe... I updated my PR with a comment explaining why we leak memory on purpose.

Maybe I will experiment an enhancement to not leak memory, but let's start with the safe approach.

@gpshead
Copy link
Member

gpshead commented Mar 27, 2020

FWIW I agree with not releasing the memory. I wouldn't even consider this a leak, though I understand if msan might if we don't keep references to the old untouchable zombie lock memory somewhere. (err, rather than keeping a reference i think this works via sanitizer api calls that can be used to declare something an intentional leak?)

@vstinner
Copy link
Member Author

I created https://bugs.python.org/issue40089 when I worked on a crash in logging on AIX. I proposed a simple fix for logging which creates a new lock object: PR #19196. Multiprocessing tests hang with this simple change, whereas it just work if I replace _releaseLock() with _lock._at_fork_reinit() in logging._after_at_fork_child_reinit_locks().

@vstinner
Copy link
Member Author

@gpshead: So what do you think of the overall approach? I also would be interested by your review of the implementation ;-)

@vstinner
Copy link
Member Author

@pitrou: So what do you think of this change?

Add a private _at_fork_reinit() method to _thread.Lock,
_thread.RLock, threading.RLock and threading.Condition classes:
reinitialize the lock after fork in the child process; reset the lock
to the unlocked state.

Rename also the private _reset_internal_locks() method of
threading.Event to _at_fork_reinit().

* Add _PyThread_at_fork_reinit() private function. It is excluded
  from the limited C API.
* threading.Thread._reset_internal_locks() now calls
  _at_fork_reinit() on self._tstate_lock rather than creating a new
  Python lock object.
@vstinner
Copy link
Member Author

vstinner commented Apr 2, 2020

I plan to merge this change at the beginning of next week.

I rebased my PR on master to get the fix for Azure Pipelines, I squashed commits and I made even more explicit than the function "Leak memory on purpose." I modified _PyThread_at_fork_reinit() comment in the pthread implementation.

Copy link
Contributor

@melwitt melwitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work on this. Just noticed a couple of typos while I was reading the patch.

@vstinner
Copy link
Member Author

vstinner commented Apr 3, 2020

@melwitt: Thanks for the review, I fixed the two typos that you reported ;-)

@vstinner
Copy link
Member Author

vstinner commented Apr 3, 2020

@melwitt: I'm curious, are you affected by https://bugs.python.org/issue40089 issue?

@melwitt
Copy link
Contributor

melwitt commented Apr 6, 2020

Yes, I believe so. I wanted to gather a bit more data before confirming.

I think we are seeing this issue in the OpenStack CI gate [1][2] where jobs occasionally fail seemingly because the database never returns results when queried. As I dug more into it, I found there's a "start" lock in the oslo.db library (lib we use for database access) and sometimes during a service start that forks child process workers (the CI job does a stop and start of services), the child processes begin in a stuck state, unable to access the database because they are stuck behind the "start" lock, never able to acquire it.

I added a lot of DNM patch logging and found that indeed, the parent process was holding the lock at the time the child processes were forked and they inherited held locks.

I think I have a way to handle the issue in Nova [3] for now by clearing a cache of ours, but am anyway glad to see this PR. I found your work while googling something like "parent process fork child lock already locked" and https://bugs.python.org/issue6721 was in the results :)

[1] http://status.openstack.org/elastic-recheck/#1844929
[2] https://bugs.launchpad.net/nova/+bug/1844929
[3] https://bugs.launchpad.net/nova/+bug/1844929/comments/28

@vstinner vstinner merged commit 87255be into python:master Apr 7, 2020
@vstinner vstinner deleted the at_fork_init branch April 7, 2020 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants