Inference

Inference deployments serve trained models as REST API endpoints. They run continuously and auto-scale based on request load.

Deploying a Model

  1. Go to Inference in the Run:AI UI
  2. Click New Deployment
  3. Configure:
  4. Name: A name for the endpoint (e.g., my-model-api)
  5. Project: Your assigned project
  6. Environment: Select Triton or a custom serving image
  7. Model Path: Mount path to your model artifacts
  8. Compute Resource: GPU type and count per replica
  9. Replicas: Minimum and maximum replica count

Using Triton Inference Server

NVIDIA Triton supports multiple model formats:

Format Framework
ONNX Framework-agnostic
TensorRT Optimized for NVIDIA GPUs
PyTorch (TorchScript) PyTorch models
TensorFlow SavedModel TensorFlow models

Model Repository Structure

models/
  my_model/
    config.pbtxt
    1/
      model.onnx

Calling the Endpoint

Once deployed, the inference endpoint provides a REST API:

curl -X POST https://<endpoint-url>/v2/models/my_model/infer \
  -H "Content-Type: application/json" \
  -d '{"inputs": [{"name": "input", "shape": [1, 3, 224, 224], "datatype": "FP32", "data": [...]}]}'

Monitoring

  • Request rate: Requests per second to each endpoint
  • Latency: p50, p95, p99 response times
  • GPU utilization: How efficiently the model uses the GPU

Inference deployments consume GPU resources continuously. Delete deployments you no longer need.