Skip to main content

Endpoints

MethodPathDescription
POST/api/training/jobsSubmit a new training job
GET/api/training/jobsList all jobs
GET/api/training/jobs/{id}Job status + progress
DELETE/api/training/jobs/{id}Cancel a job
GET/api/training/jobs/{id}/logsTail logs (?tail=N)
GET/api/training/jobs/{id}/outputList output artifacts
GET/api/training/jobs/{id}/output/{file}Download an artifact
POST/api/training/jobs/{id}/mergeMerge LoRA into base model
POST/api/training/jobs/{id}/resumeResume from checkpoint
GET/api/training/templatesList templates (built-in + custom)
POST/api/training/templatesSave a custom template
GET/api/training/statsAggregate counters
POST/api/training/estimateMemory/time estimates

Submit a job

curl -X POST http://localhost:3000/api/training/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "base_model": "Qwen/Qwen2.5-7B-Instruct",
    "dataset_path": "alpaca.jsonl",
    "method": "lora",
    "num_epochs": 3,
    "batch_size": 4,
    "lora_rank": 16,
    "eval_split": 0.1
  }'
Response:
{
  "job_id": "abc123",
  "status": "pending",
  "config": { "base_model": "...", "method": "lora", ... }
}

Training config fields

{
  "base_model": "Qwen/Qwen2.5-7B-Instruct",
  "dataset_path": "path/or/hf-dataset-id",
  "method": "lora",
  "num_epochs": 3,
  "batch_size": 4,
  "learning_rate": 0.0002,
  "lora_rank": 16,
  "lora_alpha": 32,
  "max_seq_length": 2048,
  "eval_split": 0.1,
  "eval_steps": 0,
  "use_gradient_checkpointing": false,
  "distributed": false,
  "num_nodes": 1,
  "hf_token": "hf_xxx",
  "wandb_project": "my-project",
  "run_name": "alpaca-7b-run1"
}

List output artifacts

curl http://localhost:3000/api/training/jobs/abc123/output
{
  "job_id": "abc123",
  "status": "completed",
  "files": [
    {"name": "adapter_model.safetensors", "size_mb": 142.3, "download_url": "..."},
    {"name": "config.json", "size_mb": 0.01, "download_url": "..."}
  ],
  "total_files": 2,
  "total_size_mb": 142.31
}

Merge LoRA adapter

curl -X POST http://localhost:3000/api/training/jobs/abc123/merge
Returns a merge_job_id to poll with GET /api/training/jobs/{merge_job_id}.

Resume from checkpoint

# Use latest checkpoint
curl -X POST http://localhost:3000/api/training/jobs/abc123/resume

# Specify a checkpoint
curl -X POST http://localhost:3000/api/training/jobs/abc123/resume \
  -d '{"checkpoint": "checkpoint-500"}'