-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hangs where a lock is held by a suspended thread. #401
base: master
Are you sure you want to change the base?
Conversation
You elected to make the collect phase concurrent. This is good, but please review the logic carefully isn't going to cut it. One needs to know the invariant we are working with. |
The collect phase must be concurrent, or we will hang on normal D programs (we already do). The invariant is that no functions can execute that would affect the loops through the blocks, or change which pages/slots are marked as allocated or not. |
Clearly there is more to this. There is the notion of world probation, new thread state, a mutex in there. We aren't going to play wack a mole with this. None of the invariant are either explained or documented, let alone the code structured in a way that enforces them. This means that even if they are correct in that PR, which I don't know how to check, really, it is only a mater of time before they are broken. One easy first step is whether we can make the minimize part of the collection cycle concurrent first. This wouldn't directly solve the problem, but it's a step in the right direction as it reduces the time we stop the world, and reduce the surface of things that can go wrong. |
By the way, if we do not stop the world during collection, then we might have many more possible states for the mark bits, and I don't see this being handled here. That would probably require factoring setting/checking these bits in the extent so that the modification take place everywhere at once. |
7ce40a5
to
4dbe602
Compare
8395039
to
cdd5b7e
Compare
cdd5b7e
to
d40d3a8
Compare
The sequence this protects against:
This fixes the problem by allowing the world to resume during the collect phase. In order to prevent changes to the GC state which is fully marked, but not yet collected, this introduces a "Probation" mode of the thread state. The probation mode is like the run mode, but if a thread in this mode attempts to enter a busy state, it is paused until the collection thread is finished.
Please closely review if the logic is sound that entering a busy state is the only way to cause problems.
To avoid using signals, which are very error prone, we use a mutex/condition instead.
A new mode must be introduced, because when I tried to do it by just leaving the thread in the "Resumed" mode, I found there was a signaling race condition.