Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hangs where a lock is held by a suspended thread. #401

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

schveiguy
Copy link
Contributor

The sequence this protects against:

  1. thread A acquires a lock
  2. thread B starts a collection, stops the world
  3. thread A is suspended with the lock held
  4. thread B runs mark phase, and during collect phase, runs a finalizer
  5. finalizer attempts to acquire the lock held by thread A
  6. deadlock

This fixes the problem by allowing the world to resume during the collect phase. In order to prevent changes to the GC state which is fully marked, but not yet collected, this introduces a "Probation" mode of the thread state. The probation mode is like the run mode, but if a thread in this mode attempts to enter a busy state, it is paused until the collection thread is finished.

Please closely review if the logic is sound that entering a busy state is the only way to cause problems.

To avoid using signals, which are very error prone, we use a mutex/condition instead.

A new mode must be introduced, because when I tried to do it by just leaving the thread in the "Resumed" mode, I found there was a signaling race condition.

@deadalnix
Copy link
Contributor

You elected to make the collect phase concurrent. This is good, but please review the logic carefully isn't going to cut it. One needs to know the invariant we are working with.

@schveiguy
Copy link
Contributor Author

The collect phase must be concurrent, or we will hang on normal D programs (we already do).

The invariant is that no functions can execute that would affect the loops through the blocks, or change which pages/slots are marked as allocated or not.

@deadalnix
Copy link
Contributor

Clearly there is more to this. There is the notion of world probation, new thread state, a mutex in there. We aren't going to play wack a mole with this. None of the invariant are either explained or documented, let alone the code structured in a way that enforces them. This means that even if they are correct in that PR, which I don't know how to check, really, it is only a mater of time before they are broken.

One easy first step is whether we can make the minimize part of the collection cycle concurrent first. This wouldn't directly solve the problem, but it's a step in the right direction as it reduces the time we stop the world, and reduce the surface of things that can go wrong.

@deadalnix
Copy link
Contributor

By the way, if we do not stop the world during collection, then we might have many more possible states for the mark bits, and I don't see this being handled here. That would probably require factoring setting/checking these bits in the extent so that the modification take place everywhere at once.

@deadalnix deadalnix force-pushed the master branch 3 times, most recently from 7ce40a5 to 4dbe602 Compare December 4, 2024 01:18
@schveiguy schveiguy force-pushed the fixhangsenterbusystate branch from 8395039 to cdd5b7e Compare December 4, 2024 21:49
@schveiguy schveiguy force-pushed the fixhangsenterbusystate branch from cdd5b7e to d40d3a8 Compare December 26, 2024 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants