- Published on
 
B200 CUDA Error 802: A Strange Issue Solved by Enabling MIG
- Authors
 
- Name
 - Borris
 - Developer
 
Hello, this is Borris from the REALDRAW Tech Team.
Our company provides a product that supports webtoon production through webtoon image training and generation. While testing the training/generation features on AWS using the B200 instance, we encountered a problem that led to an unexpected yet interesting discovery.
The Problem Appeared Right at the Start
AWS EC2 provides DLAMI (Deep Learning AMI), a preconfigured Amazon Machine Image designed to help you quickly build deep learning applications. When we launched a B200 instance using an Ubuntu-based DLAMI, all essential libraries such as PyTorch, NVIDIA drivers, and CUDA were already installed. We started testing our training/generation immediately — but encountered an issue right away.
UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount().
Did you run some cuda functions before calling NumCudaDevices() that might have already set an error?
Error 802: system not yet initialized
(Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
CUDA available: False
The warning message indicated that CUDA was not initializing correctly — meaning GPU computation wasn’t available. However, running the nvidia-smi command showed that the driver was installed and the GPU was recognized normally.
What I Tried
- Checked if NVIDIA MIG mode was disabled
 - Installed both the latest and older versions of NVIDIA drivers
 - Installed FabricManager compatible with the driver version and modified 
fabricmanager.cfg - Installed CUDA Toolkit (versions 12.6 through 12.9) then reinstalled PyTorch
 - Followed the NVIDIA DGX OS 7 User Guide to reinstall CUDA, drivers, and FabricManager
 - Verified GPU status via 
nvidia-smiand rantorch.cuda.is_available() 
Despite all these efforts, nothing worked — the same error kept appearing repeatedly.
Since troubleshooting seemed like it would take a long time, I decided to experiment with MIG (Multi-Instance GPU), one of the main features available on B200.
Enabling MIG and Fixing the Issue
MIG allows you to split a single GPU into multiple virtual GPUs. Since the B200 instance supports MIG, I partitioned the GPU into MIG 3g.90gb instances. After doing so, I ran torch.cuda.is_available() again — and the issue was suddenly resolved.
Testing our training/generation on the virtualized GPU confirmed that it worked perfectly. We also benchmarked how training and generation performance scaled across these MIG partitions — and achieved the results we were hoping for.
Wrapping Up
Although enabling MIG solved the issue, I’m still left wondering why CUDA failed when MIG was disabled.
Normally, it’s possible for CUDA to malfunction when MIG is enabled — but the reverse (MIG disabled and CUDA not working) is quite puzzling.
Some users on the NVIDIA Developer Forum suggest that the issue might stem from an improperly installed MLNX_OFED package, though I haven’t yet confirmed this.
I hope this post helps anyone else encountering a similar issue with the B200 instance. If you’ve resolved this error using the solution mentioned in the forum, please share your experience in the comments — it would be greatly appreciated.