Godot 4.6: CommandQueueMT Recursive Flushing Causes Deadlock
The Godot Engine community has recently encountered a significant issue in the 4.6 development cycle related to the CommandQueueMT. Specifically, the system no longer supports recursive flushing, leading to deadlocks within the physics server. This article delves into the details of this problem, its causes, and how it manifests within the engine.
Understanding the Issue
The core of the problem lies in a change introduced by commit #112506, which altered the behavior of CommandQueueMT. Prior to this change, recursive flushing of the command queue was inadvertently utilized. This became apparent in the Jolt Physics implementation of PhysicsServer3D::joint_disable_collisions_between_bodies. During the periodic flushing of the command queue on the dedicated physics thread, a call was made to PhysicsServer3DWrap::body_add_collision_exception. This action triggered CommandQueueMT::flush_if_pending to run again, attempting to acquire a lock that was already held. The consequence was a deadlock in the server thread, which subsequently caused the main thread to deadlock as well.
The specific location of the lock acquisition attempt is in the command_queue_mt.h file, around line 160. The code tries to acquire a lock during the flushing process, but since the lock is already held due to the recursive call, the system grinds to a halt.
To better grasp the implications, consider the sequence of events leading to the deadlock:
- The physics thread initiates the flushing of the command queue.
- During this flush,
PhysicsServer3DWrap::body_add_collision_exceptionis called. - This, in turn, triggers
CommandQueueMT::flush_if_pending. - The
flush_if_pendingfunction attempts to acquire the lock, which is already held by the initial flush. - Deadlock occurs, freezing the physics thread.
- The main thread, waiting for the physics thread, also deadlocks.
This issue highlights the critical importance of managing thread synchronization and avoiding recursive lock acquisition in multi-threaded environments. The change in CommandQueueMT exposed this vulnerability, leading to the observed deadlocks.
Reproducing the Deadlock
To reproduce this issue, follow these steps:
- Check out the specific commit:
git checkout 7e1ef4c9e - Run the Minimal Reproduction Project (MRP) provided in issue #113620.
By following these steps, developers can observe the deadlock in action and gain a deeper understanding of the problem.
System Information
The issue was reproduced on the following system configuration:
- Godot Version: v4.6.dev (af1e71d32)
- Operating System: Fedora Linux 42 (Adams) on Wayland
- Display Driver: X11, Multi-window, 2 monitors
- Graphics API: Vulkan (Forward+)
- GPU: dedicated AMD Radeon RX 9070 XT (RADV GFX1201)
- CPU: AMD Ryzen 9 9950X3D 16-Core Processor (32 threads)
- Memory: 60.39 GiB
This information helps to contextualize the issue and provides a reference point for other developers encountering similar problems.
Impact on Jolt Physics Implementation
The Jolt Physics implementation is particularly affected by this change due to its reliance on PhysicsServer3D::joint_disable_collisions_between_bodies. This function, when called during the periodic flushing of the command queue, triggers the recursive lock acquisition, leading to the described deadlock. As a result, any project using Jolt Physics and relying on this function may experience severe stability issues.
Code Analysis: CommandQueueMT and Thread Synchronization
The CommandQueueMT class is designed to manage a queue of commands that need to be executed in a thread-safe manner. It uses a mutex to protect the queue from concurrent access. The flush_if_pending function is responsible for processing the commands in the queue. However, the change in #112506 introduced a constraint that prevents recursive calls to this function. This constraint, while intended to improve thread safety, has inadvertently exposed a vulnerability in the Jolt Physics implementation.
Here's a closer look at the relevant code snippet from command_queue_mt.h:
std::mutex mutex;
template <typename R>
R flush_if_pending(std::function<R()> p_func) {
if (is_ Flusing()) {
return p_func();
}
std::unique_lock<std::mutex> lock(mutex);
return p_func();
}
This code attempts to acquire a lock on the mutex before executing the provided function. If the function is called recursively while already holding the lock, a deadlock occurs.
Proposed Solutions and Workarounds
To address this issue, several solutions and workarounds can be considered:
- Reverting the Change: One option is to revert the change introduced by #112506. However, this may reintroduce the original issue that the change was intended to fix.
- Refactoring the Jolt Physics Implementation: Another approach is to refactor the Jolt Physics implementation to avoid recursive calls to
CommandQueueMT::flush_if_pending. This may involve restructuring the code to decouple the flushing of the command queue from thePhysicsServer3D::joint_disable_collisions_between_bodiesfunction. - Implementing a Recursive Lock: A more advanced solution is to use a recursive lock instead of a standard mutex. A recursive lock allows the same thread to acquire the lock multiple times without blocking. However, this approach requires careful consideration to avoid potential issues such as priority inversion.
Community Discussion and Collaboration
The Godot Engine community has been actively discussing this issue in various forums and issue trackers. Developers are sharing their experiences, proposing solutions, and collaborating to find the best way to address the problem. This collaborative effort is crucial to ensure the stability and reliability of the engine.
Conclusion
The CommandQueueMT issue in Godot Engine 4.6 highlights the complexities of multi-threaded programming and the importance of careful thread synchronization. The inadvertent reliance on recursive flushing in the Jolt Physics implementation exposed a vulnerability that led to deadlocks. By understanding the root cause of the problem and considering various solutions, the Godot Engine community can work together to resolve this issue and ensure the stability of the engine.
For more information on multi-threaded programming and synchronization techniques, visit the pthreads Tutorial.