Overview
AINode discovers peers automatically via UDP broadcast on port 5679. Once discovered, the master can shard one large model across all GPUs using tensor-parallel inference over NCCL. Verified: 4× DGX Spark, 487 GB aggregated VRAM, NCCL over RoCE at 200 Gbps.Step 1: Install on each node
Step 2: Open the cluster UI
Navigate tohttp://<master-ip>:3000 → Cluster tab. Nodes appear automatically as they start broadcasting.
Step 3: Launch distributed inference
- Go to Config → Cluster
- Set Minimum Nodes to the desired count
- Pick Tensor Parallel mode
- Click Launch
Network requirements
| Requirement | Why |
|---|---|
| Passwordless SSH between nodes | eugr’s launcher SSHes into each peer |
| One NIC per cluster subnet | Multi-NIC routing ambiguity causes NCCL hangs |
| UDP port 5679 open on cluster subnet | Peer discovery |
| TCP ports 6379, 10001–10009, 29500 | Ray + NCCL rendezvous |
VRAM reference
| Nodes | Aggregated VRAM | Largest model |
|---|---|---|
| 1× GB10 | 122 GB | Llama 3.1 70B (fp16) |
| 2× GB10 | 244 GB | Llama 3.1 70B (fp16) |
| 4× GB10 | 487 GB | Llama 3.1 405B (AWQ) |
Troubleshooting
Ray placement group hangs indefinitelyMore than one NIC has an IP on the cluster subnet. Remove the extra IP or set
NCCL_SOCKET_IFNAME explicitly in the cluster config.
NCCL using socket transport instead of RoCECheck with
ainode logs -f | grep "Using network". Should say NET/IB RoCE. If it says Socket, the mlx5 driver isn’t loaded or the ConnectX-7 isn’t on the cluster fabric.
Workers not appearing in topologyConfirm UDP port 5679 is open. All nodes must be on the same broadcast domain (same L2 subnet).
