Multi-GPU Distributed Training
Exercise guide — refer to the official documentation for full details.
Overview
Multi-GPU distributed training workloads run a distributed training job to completion and then exit. Unlike workspaces, they're not interactive.
Create a Multi-GPU Distributed Training Workload
- Navigate to Workload Manager > Workloads
- Click + New Workload and select Training
- Select the
omega-project-4-gpuproject - Set Workload architecture to Distributed (not standard)
- Select Start from Scratch
- Select
PyTorchas the Framework - Select Workers Only
-
Fill in:
Field Value Name pytorch-distributed-training-example -
Click Continue
Environment
- Select or create an environment
Compute Resources
- Set
4GPU devices as the compute resource - Set the number of workers to
1
Data & Storage
- Select
distributed-training-pvcas the data source
Extended Resources
- Toggle
Increase shared memory size
Submit
- Click Create Training
- Wait for the workload to reach Running status
- Select your workload and click on Show Details to monitor progress
Verify
The training job should complete and move to Completed status. Check the logs to confirm the training output.
Note: Multi-GPU training and distributed training are two distinct concepts. Multi-GPU training uses multiple GPUs within a single node, whereas distributed training spans multiple nodes and typically requires coordination between them.