Inference

Inference deployments serve trained models as REST API endpoints. They run continuously and auto-scale based on request load.

Deploying a Model

Go to Inference in the Run:AI UI
Click New Deployment
Configure:
Name: A name for the endpoint (e.g., my-model-api)
Project: Your assigned project
Environment: Select Triton or a custom serving image
Model Path: Mount path to your model artifacts
Compute Resource: GPU type and count per replica
Replicas: Minimum and maximum replica count

Using Triton Inference Server

NVIDIA Triton supports multiple model formats:

Format	Framework
ONNX	Framework-agnostic
TensorRT	Optimized for NVIDIA GPUs
PyTorch (TorchScript)	PyTorch models
TensorFlow SavedModel	TensorFlow models

Model Repository Structure

models/
  my_model/
    config.pbtxt
    1/
      model.onnx

Calling the Endpoint

Once deployed, the inference endpoint provides a REST API:

curl -X POST https://<endpoint-url>/v2/models/my_model/infer \
  -H "Content-Type: application/json" \
  -d '{"inputs": [{"name": "input", "shape": [1, 3, 224, 224], "datatype": "FP32", "data": [...]}]}'

Monitoring

Request rate: Requests per second to each endpoint
Latency: p50, p95, p99 response times
GPU utilization: How efficiently the model uses the GPU

Inference deployments consume GPU resources continuously. Delete deployments you no longer need.