Skip to main content

Ports

PortProtocolService
3000TCPAINode web UI + management API
8000TCPInference API + Prometheus metrics
5679UDPPeer discovery broadcast
6379TCPRay head
10001–10009TCPRay workers
29500TCPNCCL rendezvous

Critical: single NIC per cluster subnet

Each node must have exactly one NIC with an IP on the cluster subnet. Multiple NICs on the same subnet cause Ray placement group creation to hang indefinitely with no error message.
# Check — should show ONE IP on your cluster subnet
ip addr | grep 192.168.0

NCCL interface selection

Set in ~/.ainode/config.json:
{"cluster_interface": "enP2p1s0f1np1"}
AINode sets NCCL_SOCKET_IFNAME, GLOO_SOCKET_IFNAME, UCX_NET_DEVICES, and NCCL_IB_HCA automatically from this value.

RoCE tuning (ConnectX-7)

# Verify NCCL is using RoCE
ainode logs -f | grep "Using network"
# Expected: "NET/IB ... mlx5_0:1/RoCE ... speed=200000"

# Optional: enable GPU Direct RDMA
sudo modprobe nvidia-peermem

Tailscale

Use Tailscale for management SSH only. Never use Tailscale IPs for NCCL — the ~2ms RTT collapses multi-GPU throughput.