Inference
Inference deployments serve trained models as REST API endpoints. They run continuously and auto-scale based on request load.
Deploying a Model
- Go to Inference in the Run:AI UI
- Click New Deployment
- Configure:
- Name: A name for the endpoint (e.g.,
my-model-api) - Project: Your assigned project
- Environment: Select Triton or a custom serving image
- Model Path: Mount path to your model artifacts
- Compute Resource: GPU type and count per replica
- Replicas: Minimum and maximum replica count
Using Triton Inference Server
NVIDIA Triton supports multiple model formats:
| Format | Framework |
|---|---|
| ONNX | Framework-agnostic |
| TensorRT | Optimized for NVIDIA GPUs |
| PyTorch (TorchScript) | PyTorch models |
| TensorFlow SavedModel | TensorFlow models |
Model Repository Structure
models/
my_model/
config.pbtxt
1/
model.onnx
Calling the Endpoint
Once deployed, the inference endpoint provides a REST API:
curl -X POST https://<endpoint-url>/v2/models/my_model/infer \
-H "Content-Type: application/json" \
-d '{"inputs": [{"name": "input", "shape": [1, 3, 224, 224], "datatype": "FP32", "data": [...]}]}'
Monitoring
- Request rate: Requests per second to each endpoint
- Latency: p50, p95, p99 response times
- GPU utilization: How efficiently the model uses the GPU
Inference deployments consume GPU resources continuously. Delete deployments you no longer need.