Fixing Tensor Parallelism Issues On Dual RTX 6000 Blackwell GPUs
Experiencing issues with tensor parallelism when trying to leverage the power of multiple high-end GPUs can be a frustrating roadblock. This is especially true when you're working with cutting-edge hardware like the NVIDIA RTX PRO 6000 Blackwell Max-Q GPUs and sophisticated frameworks like sglang. One common problem that can halt your progress is the process hanging indefinitely, particularly when tensor parallelism is enabled. This article dives deep into a specific scenario where running sglang.launch_server with a tp-size of 2 on these powerful GPUs leads to this exact issue, and we’ll explore how to diagnose and potentially resolve it. Understanding tensor parallelism is key to unlocking the full potential of distributed deep learning, allowing models to be split across multiple devices for faster training and inference. When this process gets stuck, it often points to a communication bottleneck or configuration mismatch between the GPUs or the software orchestrating them.
The Challenge: SGLang and Tensor Parallelism Hangs
Our main challenge revolves around setting up a tensor parallelism environment using sglang with two NVIDIA RTX PRO 6000 Blackwell Max-Q GPUs. The command that triggers the issue is: python -m sglang.launch_server --model Qwen/Qwen3-4B --trust-remote-code --port 30000 --tp-size 2 --attention-backend flashinfer. Without setting the NCCL_P2P_DISABLE=1 environment variable, the process simply hangs, leaving you staring at a frozen terminal. The provided log snippets offer crucial clues. We see the initialization of PyTorch distributed, NCCL (NVIDIA Collective Communications Library), and the network setup. The logs show that both ranks (TP0 and TP1, representing the two GPUs) are initializing and connecting. However, the critical point is that the process gets stuck after the NCCL initialization completes, specifically before any actual model inference or computation seems to be happening. This suggests that the problem lies in the foundational communication layer between the GPUs when tensor parallelism is activated. The fact that setting NCCL_P2P_DISABLE=1 resolves the hang is a significant hint. NCCL is designed to facilitate high-performance communication between GPUs, and P2P (Peer-to-Peer) access is a crucial optimization for direct GPU-to-GPU data transfer. When this direct transfer mechanism causes a hang, it indicates a potential incompatibility, a misconfiguration in how P2P is being utilized, or perhaps a subtle driver or hardware issue that prevents direct communication from succeeding.
Deciphering the Logs and Error Messages
The detailed logs provided are invaluable for pinpointing the issue. We can observe the initialization process of both TP0 (representing the first GPU) and TP1 (representing the second GPU). The logs show successful NCCL initialization, including network plugin setup using sockets, and importantly, the successful completion of ncclCommInitRank for both ranks. This means that NCCL itself has successfully set up the communication context between the two GPUs. However, the hang occurs after these initialization steps, suggesting that the problem isn't with the basic setup of NCCL but rather with how it's being used for actual data transfer during tensor parallelism. A key observation is the line: NCCL INFO NCCL_P2P_DISABLE set by environment to 1. This confirms that the user has indeed set this environment variable to 1, which tells NCCL to disable direct P2P communication and fall back to a more general communication method (likely involving the CPU as an intermediary or using PCIe bridge communication). When this is enabled, the hang is resolved, which strongly implies that the direct P2P communication path between the two RTX PRO 6000 Blackwell GPUs is where the problem lies. The logs also show detailed information about network configuration, including attempts to use NET/IB (InfiniBand) and NET/Socket. The absence of libnccl-net.so and NET/IB : No device found indicates that InfiniBand is not being used, and the system is relying on standard Ethernet sockets for inter-node or inter-process communication, which is common for a single machine setup. The NCCL INFO Check P2P Type isAllDirectP2p 0 directMode 0 line is also interesting. It suggests that even though P2P might be enabled by default, the system has detected that not all P2P transfers are direct, which could be a symptom of the underlying issue or a consequence of the P2P disabling. The watchdog timeouts (Watchdog timeout (self.watchdog_timeout=300)) that appear at the end of the log are not the root cause but rather a consequence of the process hanging indefinitely. The watchdog timer eventually triggers because the process is unresponsive.
Potential Causes and Solutions
Given the symptoms, especially the resolution by disabling NCCL P2P, the primary suspect is an issue with direct GPU-to-GPU communication between your two NVIDIA RTX PRO 6000 Blackwell Max-Q GPUs. Here are several potential causes and troubleshooting steps:
- GPU Driver and CUDA Toolkit Compatibility: Ensure you are using the latest stable NVIDIA drivers that are compatible with your CUDA Toolkit version (12.8 in this case) and your GPU hardware. Sometimes, older drivers or incompatible CUDA versions can cause subtle issues with NCCL's advanced features like P2P. Verify that your
CUDA Driver Version(580.95.05) is indeed the recommended version for your specific RTX 6000 Blackwell cards and the CUDA 12.8 toolkit. - NCCL Configuration: While
NCCL_P2P_DISABLE=1works, it's not ideal as it might reduce performance. NCCL has other environment variables that can influence communication. You could experiment with:NCCL_DEBUG=INFO(already set, but good to keep in mind)NCCL_IB_HCAandNCCL_IB_DISABLE: If you have InfiniBand but it's not being detected correctly, these might help, but since it's not detected, this is less likely the issue here.NCCL_CHECK_P2P_SUPPORT=1: This might provide more detailed information about P2P support, although the logs already suggest it's not fully functional in a direct manner.
- Hardware Configuration and NVLink: The NVIDIA topology output is crucial. It shows
GPU0 X NODE GPU1 NODE X. This indicates that the GPUs are connected via NVLink (NODE implies connection traversing PCIe and interconnect between NUMA nodes, but with NVLink, it's typically very fast). Ensure that NVLink is properly enabled and functioning. Sometimes, a physical connection issue or a BIOS setting can prevent NVLink from operating at full speed or at all. You can check NVLink status usingnvidia-smi nvlink -s. If NVLink is not functioning as expected, it might force NCCL to use slower communication paths, potentially leading to hangs under certain conditions, or it might fail entirely for direct P2P transfers. - PCIe Configuration: Although NVLink is usually preferred, PCIe bandwidth and configuration still matter. Ensure that your GPUs are seated in appropriate PCIe slots that offer sufficient bandwidth (e.g., x16 slots) and that there are no PCIe bifurcation issues or other configurations that might impede direct communication. The topology shows
GPU0 X NODE GPU1 X, which usually means they are directly connected via NVLink, bypassing the CPU for inter-GPU communication. If NVLink fails, they might try to communicate via PCIe, which could be a bottleneck or a point of failure if not configured correctly. - sglang and Flashinfer Settings: While less likely to be the direct cause of a P2P hang, ensure that your sglang and
flashinferversions are compatible and up-to-date. Theattention-backend flashinfersetting is relevant. Sometimes, specific combinations of backends and hardware can have quirks. - System Resources and Power Limits: Although the logs don't explicitly show this, ensure that your system's power supply can handle both GPUs running at full load, and that there are no thermal throttling issues that might indirectly affect communication stability.
Focusing on NVLink
The most promising avenue, given the P2P disable workaround, is to thoroughly investigate the NVLink connection. NVLink provides a much higher bandwidth and lower latency connection between GPUs than PCIe, and it's essential for efficient tensor parallelism.
- Verification: Use the command
nvidia-smi nvlink -s. This command should show that NVLink is enabled and that the GPUs are connected. Look for output indicating that GPU0 and GPU1 are linked. - Physical Connection: Ensure the NVLink bridge is securely seated on both GPUs and that the motherboard slots are properly seated.
- BIOS Settings: Some motherboards have BIOS settings related to PCIe and inter-GPU communication. Ensure these are configured optimally for multi-GPU setups.
If NVLink is not working correctly, NCCL might be attempting a P2P transfer that fails because the underlying hardware path isn't available or is malfunctioning. Disabling P2P (NCCL_P2P_DISABLE=1) forces NCCL to use a different, more robust (though potentially slower) communication path, which avoids the hang.
Conclusion and Next Steps
The issue described, where sglang with tensor parallelism hangs unless NCCL_P2P_DISABLE=1 is set, strongly points towards a problem with direct peer-to-peer (P2P) communication between the two NVIDIA RTX PRO 6000 Blackwell Max-Q GPUs. While disabling P2P is a viable workaround for now, it might not be the most performant solution. The most probable cause is an issue with the NVLink connection between the GPUs or a driver/CUDA toolkit incompatibility that prevents NCCL from reliably using direct P2P transfers.
To further diagnose and resolve this:
- Verify NVLink Status: Run
nvidia-smi nvlink -sand ensure NVLink is active and correctly connecting your GPUs. Address any physical or configuration issues if it's not. - Update Drivers and CUDA: Ensure you have the latest stable NVIDIA drivers and that they are fully compatible with your CUDA 12.8 installation.
- Test with Different Attention Backends: While
flashinferis excellent, tryattention-backend apexor other available backends to see if the issue is specific toflashinfer. - Monitor GPU Communication: Use tools like
nvprofor Nsight Systems to try and capture the communication patterns during the hang, if possible, though this can be challenging with a hanging process.
By systematically investigating these areas, you should be able to identify the root cause and achieve stable, high-performance tensor parallelism on your powerful GPU setup.
For more in-depth information on NVIDIA's NCCL and GPU communication, you can refer to the official NVIDIA NCCL Developer Guide. For discussions and community support regarding sglang, the sglang GitHub Discussions page is an excellent resource.