Multi-GPU Distributed Training

Exercise guide — refer to the official documentation for full details.

Overview

Multi-GPU distributed training workloads run a distributed training job to completion and then exit. Unlike workspaces, they're not interactive.

Create a Multi-GPU Distributed Training Workload

Navigate to Workload Manager > Workloads
Click + New Workload and select Training
Select the omega-project-4-gpu project
Set Workload architecture to Distributed (not standard)
Select Start from Scratch
Select PyTorch as the Framework
Select Workers Only
Fill in:

Field Value

Name pytorch-distributed-training-example
Click Continue

Environment

Select or create an environment

Compute Resources

Set 4 GPU devices as the compute resource
Set the number of workers to 1

Data & Storage

Select distributed-training-pvc as the data source

Extended Resources

Toggle Increase shared memory size

Submit

Click Create Training
Wait for the workload to reach Running status
Select your workload and click on Show Details to monitor progress

Verify

The training job should complete and move to Completed status. Check the logs to confirm the training output.

Note: Multi-GPU training and distributed training are two distinct concepts. Multi-GPU training uses multiple GPUs within a single node, whereas distributed training spans multiple nodes and typically requires coordination between them.