Run Self-Check Binary for CUDA Memory Allocation
On Linux, you can use the self check binary to verify CUDA memory allocation and ensure that your system is properly recognizing and utilizing your GPUs.
- Self-Check Binary: Download from GitHub
Example of a Successful Output:
Reported 1 CUDA devices Device #0: name=NVIDIA GeForce RTX 3080: memory alloc test pass all cards look okExample of an Error Output:
Cannot get device count: cuda error=35 - CUDA driver version is insufficient for CUDA runtime version
If you see an error like the one above, it suggests that your CUDA driver version is not compatible with the CUDA runtime version, and you may need to update your drivers or CUDA toolkit.
System Checks
1️⃣ Check The Driver Version
Ensure that you are using NVIDIA driver version 535.xx or 550.xx.
Please refrain from using experimental or beta drivers. These Linux drivers typically start with 555.xx or 560.xx.
2️⃣ Avoid Overclocking & Under-volting
Your GPU(s) must run at its original factory settings without any overclocking or under-volting. If you've applied any modifications using tools like Afterburner or GPU Tweak, please revert all settings to the default factory configuration to maintain stability and performance.
3️⃣ Check for Competing GPU Processes
Ensure that no other applications are utilizing GPU resources, including web browsers, messaging apps (such as Discord, Telegram), and other background processes. To check for active processes using the GPU, open a Terminal and run the command nvidia-smi.
All workers should be fully dedicated to IO tasks while their containers are online. All other applications should be closed and exited completely to prevent interference with performance.
4️⃣ Review System Power Requirements
Ensure your Power Supply Unit (PSU) can deliver the maximum power required by all components in your worker, with an additional 20% buffer to accommodate power surges.
For example, consider a setup with an Intel 14900K CPU and two NVIDIA RTX 4090 GPUs:
- Intel 14900K: up to 253 watts
- NVIDIA RTX 4090: 450-480 watts each
- Other Components (e.g. NVME, SSD, RAM, system fans, etc): ~100 watts
To calculate the PSU requirement:
- Sum the maximum power needs for all components:
- 253w + (480w x 2) + 100w = 1313 watts
- Add a 20% buffer for power surges:
- (1313w * 0.20) + 1213w = 1575 watts
In this case, a 1600W PSU would be recommended for optimal performance and reliability.
There are many tools to assist in calculating PSU requirements, such as: https://outervision.com/power-supply-calculator
5️⃣ Review CPU PCI Express Lane Availability
The number of PCI Express (PCIe) lanes supported by your CPU determines the maximum number of GPUs your system can support.
- Consumer GPUs (e.g., 30-series, 40-series) require 8x PCIe lanes each.
- Enterprise GPUs (e.g., H100, A100) require 16x PCIe lanes each.
Other devices, such as storage drives, network adapters or soundcards, also consume PCIe lanes which reduces the total lane availability for GPUs. Optimally, it's recommended to slot only GPUs into the PCIe slots unless the other devices are essential.
- NVMe SSDs use 4x PCIe lanes each
- SATA SSDs use 2x PCIe lanes each
6️⃣ Review PCI Express Lane Width Configuration
In your motherboard's BIOS, verify that the PCI Express lane width is correctly configured for each GPU, ensuring all GPUs are set to the same lane width (e.g., all x8 or all x16) based on the GPU type.
- Consumer GPUs (e.g., 30-series, 40-series) should be configured for an x8 PCIe lane width.
- Enterprise GPUs (e.g., H100, A100) should be configured for an x16 PCIe lane width.
Consistency in lane width across all GPUs helps ensure optimal performance and stability.
7️⃣ Verify The Physical GPU Connection
Check that the GPU is firmly and properly seated in the PCI Express slot on your motherboard. If you are using riser cables, ensure they match the correct PCI specification (e.g., PCI Gen 4) and support the required lane width (e.g., x8 or x16) for your GPU.
Using PCI splitters (cables that split a single PCIe slot into multiple slots) is not recommended. These may have a considerable impact on performance and stability.
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article