Training

Training jobs are batch workloads that run to completion. Unlike workspaces, they do not provide an interactive interface.

Submitting a Training Job

python /workspace/train.py --epochs 10 --batch-size 32 --lr 0.001

For multi-GPU training:

Set Workers to the number of GPU nodes
Run:AI automatically configures MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and RANK
Use PyTorch DDP or Horovod in your training script

Logs: Click on a training job to view stdout/stderr in real time
Metrics: GPU utilization, memory usage, and throughput are shown in the job dashboard
TensorBoard: If your script writes TensorBoard logs, you can view them from the job details

State	Meaning
Pending	Queued, waiting for resources
Running	Actively training
Succeeded	Completed successfully
Failed	Exited with an error
Preempted	Stopped to make room for higher-priority jobs

Training jobs are not interactive. Use a workspace if you need to debug your code before submitting a training job.