How it works
AINode uses Ray for cross-node orchestration and NCCL for GPU-to-GPU tensor communication. The head node:- Discovers workers via UDP broadcast
- SSHes into each peer and starts a vLLM worker container
- Forms a Ray cluster
- Launches vLLM with
--tensor-parallel-size N
1/N of the model weights. All-reduce collectives run over NCCL, preferring RoCE (RDMA) at 200 Gbps when ConnectX-7 hardware is present.
Verified configurations
| Config | Total VRAM | Models |
|---|---|---|
| 2× DGX Spark (TP=2) | 244 GB | Llama 3.1 70B fp16 |
| 4× DGX Spark + ASUS (TP=4) | 487 GB | Llama 3.1 405B AWQ |
