Multi-GPU Distributed Training

Exercise guide — refer to the official documentation for full details.


Overview

Multi-GPU distributed training workloads run a distributed training job to completion and then exit. Unlike workspaces, they're not interactive.


Create a Multi-GPU Distributed Training Workload

  1. Navigate to Workload Manager > Workloads
  2. Click + New Workload and select Training
  3. Select the omega-project-4-gpu project
  4. Set Workload architecture to Distributed (not standard)
  5. Select Start from Scratch
  6. Select PyTorch as the Framework
  7. Select Workers Only
  8. Fill in:

    Field Value
    Name pytorch-distributed-training-example
  9. Click Continue

Environment

  1. Select or create an environment

Compute Resources

  1. Set 4 GPU devices as the compute resource
  2. Set the number of workers to 1

Data & Storage

  1. Select distributed-training-pvc as the data source

Extended Resources

  1. Toggle Increase shared memory size

Submit

  1. Click Create Training
  2. Wait for the workload to reach Running status
  3. Select your workload and click on Show Details to monitor progress

Verify

The training job should complete and move to Completed status. Check the logs to confirm the training output.

Note: Multi-GPU training and distributed training are two distinct concepts. Multi-GPU training uses multiple GPUs within a single node, whereas distributed training spans multiple nodes and typically requires coordination between them.