Training

Training jobs are batch workloads that run to completion. Unlike workspaces, they do not provide an interactive interface.

Submitting a Training Job

  1. Go to Training in the Run:AI UI
  2. Click New Training
  3. Configure:
  4. Name: A descriptive name for the job
  5. Project: Your assigned project
  6. Environment: Select the framework environment
  7. Compute Resource: GPU type and count
  8. Data Sources: Attach training data
  9. Command: The training script to execute

Example Command

python /workspace/train.py --epochs 10 --batch-size 32 --lr 0.001

Distributed Training

For multi-GPU training:

  1. Set Workers to the number of GPU nodes
  2. Run:AI automatically configures MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and RANK
  3. Use PyTorch DDP or Horovod in your training script

Monitoring

  • Logs: Click on a training job to view stdout/stderr in real time
  • Metrics: GPU utilization, memory usage, and throughput are shown in the job dashboard
  • TensorBoard: If your script writes TensorBoard logs, you can view them from the job details

Job States

State Meaning
Pending Queued, waiting for resources
Running Actively training
Succeeded Completed successfully
Failed Exited with an error
Preempted Stopped to make room for higher-priority jobs

Training jobs are not interactive. Use a workspace if you need to debug your code before submitting a training job.