Run a quantization
Open Training → Quantize a Model.Base model
A Hugging Face repo id (
Qwen/Qwen3.5-4B) or a model already in Installed.
An on-disk copy is used automatically when present (offline + reproducible).Scheme
- AWQ — W4A16, 4-bit weights. Serves as
awq_marlinon GB10. Proven. - NVFP4 — Blackwell-native 4-bit float. Newer; verified on dense text models.
Calibration samples
Default 256, drawn from
HuggingFaceH4/ultrachat_200k. More samples =
better calibration, longer job.Push to Hugging Face (optional)
Tick Push result to Hugging Face and (optionally) name the repo. Requires
a write token — see Secrets. The push happens after the
job finishes and creates a private repo under your token’s namespace.
<org--name>-<scheme> (e.g. Qwen--Qwen3.5-4B-awq), ready to launch.
Multimodal & hybrid models (Qwen3.5)
Qwen3.5 models bundle a vision tower and use Gated-DeltaNet linear attention. AINode handles this automatically: it loads the full model class so the saved config is complete (vLLM-servable), keeps the vision tower, embeddings,lm_head
and the linear_attn projections in bf16, and saves the image processor.
AWQ on Qwen3.5 is verified servable. NVFP4 on Qwen3.5 is experimental and not
yet validated — prefer AWQ for the Qwen3.5 family today.
API
"push_to_hf": true (and optionally "hf_repo": "name").
The job validates write scope before running, so a read-only token fails fast.
Why 4-bit on GB10
GB10 decode is memory-bandwidth bound. 4-bit weights mean fewer bytes read per token, so a quantized model both fits more easily and decodes faster. AWQ (awq_marlin) is the proven kernel path on GB10’s sm120.