Training
Training jobs are batch workloads that run to completion. Unlike workspaces, they do not provide an interactive interface.
Submitting a Training Job
- Go to Training in the Run:AI UI
- Click New Training
- Configure:
- Name: A descriptive name for the job
- Project: Your assigned project
- Environment: Select the framework environment
- Compute Resource: GPU type and count
- Data Sources: Attach training data
- Command: The training script to execute
Example Command
python /workspace/train.py --epochs 10 --batch-size 32 --lr 0.001
Distributed Training
For multi-GPU training:
- Set Workers to the number of GPU nodes
- Run:AI automatically configures
MASTER_ADDR,MASTER_PORT,WORLD_SIZE, andRANK - Use PyTorch DDP or Horovod in your training script
Monitoring
- Logs: Click on a training job to view stdout/stderr in real time
- Metrics: GPU utilization, memory usage, and throughput are shown in the job dashboard
- TensorBoard: If your script writes TensorBoard logs, you can view them from the job details
Job States
| State | Meaning |
|---|---|
| Pending | Queued, waiting for resources |
| Running | Actively training |
| Succeeded | Completed successfully |
| Failed | Exited with an error |
| Preempted | Stopped to make room for higher-priority jobs |
Training jobs are not interactive. Use a workspace if you need to debug your code before submitting a training job.