To start the Triton server in a baremetal GCP VM:
- Create (or start) a cuda-enabled VM in GCP Compute Engine.
- See instructions on creating one here: https://cloud.google.com/compute/docs/gpus/create-vm-with-gpus#dlvm-image
- IMPORTANT: If you are creating a new VM, make sure it includes the network tag
triton-server
so it gets the correct Firewall rules for TritonRT servers. - For our tests a VM simulates the conditions of the Triton server used for previous tests (Kubernetes-based) on an isolated machine.
- Start the VM from the console:
triton-compute-tesla-t4
- Make sure to authenticate with the server. Follow the instructions on the prompt:
gcloud auth login
- Connect to the triton VM on HarrisGroup:
gcloud beta compute ssh --zone "us-central1-f" "triton-compute-tesla-t4" --project "harrisgroup-223921"
- Make sure the model repository GCS bucket is mounted via gcsfuse, in this case at /home/$(whoami)/model_repository which is then bind mounted to the server container.
- Make sure the destination directoty exists, and finally mount the bucket via gcsfuse:
mkdir -p /home/$(whoami)/model_repository
gcsfuse sonic-model-repo /home/$(whoami)/model_repository
- Start the server via Docker run:
docker run -d --gpus=4 --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v type=bind,source=/home/$(whoami)/model_repository,target=/srv nvcr.io/nvidia/tritonserver:21.05-py3 tritonserver --model-repository=/srv
- Find the VM external IP and use it as the host for your tests. The container is configured to expose HTTP on 8000, gRPC on 8001 and metrics on 8002:
gcloud compute instances describe triton-compute-tesla-t4 --format='get(networkInterfaces[0].accessConfigs[0].natIP)'
Try to query the metrics endpoint. If the server started correctly, the following should work:
curl -vvv http://<vm_ip>:8002/metrics