Kubernetes Deploy: Triton Inference Server Cluster

NOTE: Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly. See https://issuetracker.google.com/issues/141255952. Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

The steps below describe how to set-up a model repository, use custom manifests to launch the inference server, and then send inference requests to the running server. You can access a Grafana endpoint to see real-time metrics reported by the inference server if desired.

Remember to deploy the nvidia device plugin in the cluster before putting GPU workloads on it, the plugin allows the node to expose installed GPUs to the Kubernetes environment

$ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta/nvidia-device-plugin.yml

Model Repository (GCS)

If you already have a model repository you may use that with this helm chart. If you do not have a model repository, you can checkout a local copy of the inference server source repository to create an example model repository:

$ git clone https://github.com/NVIDIA/triton-inference-server.git

Triton Server needs a repository of models that it will make available for inferencing. For this example you will place the model repository in a Google Cloud Storage bucket:

$ gsutil mb gs://triton-inference-server-repository

Following the instructions download the example model repository to your system and copy it into the GCS bucket:

$ gsutil cp -r docs/examples/model_repository gs://triton-inference-server-repository/model_repository

GCS Permissions

Make sure the bucket permissions are set so that the inference server can access the model repository. If the bucket is public then no additional changes are needed and you can proceed to "Running The Inference Server" section.

If bucket premissions need to be set with the GOOGLE_APPLICATION_CREDENTIALS environment variable then perform the following steps:

Generate Google service account JSON with proper permissions called gcp-creds.json.

Create a Kubernetes secret from this file:

$ kubectl create configmap gcpcreds --from-literal "project-id=myproject"
$ kubectl create secret generic gcpcreds --from-file gcp-creds.json

Modify templates/deployment.yaml to include the GOOGLE_APPLICATION_CREDENTIALS environment variable:
```
env:
  - name: GOOGLE_APPLICATION_CREDENTIALS
    value: /secret/gcp-creds.json
```

Modify templates/deployment.yaml to mount the secret in a volume at /secret:

volumeMounts:
  - name: vsecret
    mountPath: "/secret"
    readOnly: true
...
volumes:
- name: vsecret
  secret:
    secretName: gcpcreds

Model Repository (AWS/S3)

If you already have a model repository you may use that with this helm chart. If you do not have a model repository, you can checkout a local copy of the inference server source repository to create an example model repository:

$ git clone https://github.com/NVIDIA/triton-inference-server.git

Triton Server needs a repository of models that it will make available for inferencing. For an AWS case you will place the model repository in an S3 bucket (IMPORTANT: Bucket must be in the same region as the Kubernetes/EKS cluster):

$ aws s3api create-bucket --bucket inf-server-repo --region us-west-2

Copy desired models into the S3 bucket:

$ aws s3 cp ~/facile s3://inf-server-repoy/model_repository/facile --recursive

AWS Permissions

Make sure bucket permissions are set so that the inference server can access the model repository. In addition, mount an AWS config file inside the pod specifying the region where the bucket is located.

Create Kubernetes secrets with the right access key values, the pod will then consume these secrets in the form of environmental variables

$ kubectl create secret generic aws-access-key-id --from-literal=aws_access_key_id=<KEY_ID> $ kubectl create secret generic aws-secret-access-key --from-literal=aws_secret_access_key=<SECRET_ACCESS_KEY>

Create Kubernetes secrets with AWS config and mount it inside the pod:

$ kubectl create secret generic aws-credentials --from-file=./credentials $ kubectl create secret generic aws-config --from-file=./config

Then point an environment variable to the file (has to be readable by the user running the container)

Modify the deployment.yaml manifesr to include the AwS_CONFIG_FILE environment variable:
```
env:
  - name: AWS_CONFIG_FILE
    value: '/opt/tensorrtserver/aws/config'
```

Deploy Prometheus and Grafana (Optional, requires helm)

The inference server metrics are collected by Prometheus and viewable by Grafana. The inference server helm chart assumes that Prometheus and Grafana are available so this step must be followed even if you don't want to use Grafana.

Use the prometheus-operator to install these components. The serviceMonitorSelectorNilUsesHelmValues flag is needed so that Prometheus can find the inference server metrics in the example release deployed below:

$ helm install --name example-metrics --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false stable/prometheus-operator

Then port-forward to the Grafana service so you can access it from your local browser:

$ kubectl port-forward service/example-metrics-grafana 8080:80

Now you should be able to navigate in your browser to localhost:8080 and see the Grafana login page. Use username=admin and password=prom-operator to login.

An example Grafana dashboard is available in dashboard.json. Use the import function in Grafana to import and view this dashboard.

Deploy the Inference Server

Deploy the inference server using the default configuration with:

$ kubectl apply -f deployment_tensorrt-inference-server-v100.yaml
$ kubectl apply -f service_inference-v100-svc.yaml

Use kubectl to see status and wait until the inference server pods are running:

$ kubectl get pods
NAME                                               READY   STATUS    RESTARTS   AGE
example-triton-inference-server-5f74b55885-n6lt7   1/1     Running   0          2m21s

Using Triton Inference Server

Now that the inference server is running you can send HTTP or GRPC requests to it to perform inferencing. By default, the inferencing service is exposed with a LoadBalancer service type. Use the following to find the external IP for the inference server. In this case it is 34.83.9.133:

$ kubectl get services
NAME                             TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)                                        AGE
...
example-triton-inference-server  LoadBalancer   10.18.13.28    34.83.9.133   8000:30249/TCP,8001:30068/TCP,8002:32723/TCP   47m

The inference server exposes an HTTP endpoint on port 8000, and GRPC endpoint on port 8001 and a Prometheus metrics endpoint on port 8002. You can use curl to get the status of the inference server from the HTTP endpoint:

$ curl 34.83.9.133:8000/api/status

Follow the instructions to get the example image classification client that can be used to perform inferencing using image classification models being served by the inference server. For example:

$ image_client -u 34.83.9.133:8000 -m resnet50_netdef -s INCEPTION -c3 mug.jpg
Request 0, batch size 1
Image 'images/mug.jpg':
    504 (COFFEE MUG) = 0.723992
    968 (CUP) = 0.270953
    967 (ESPRESSO) = 0.00115997

Cleanup

Once you've finished using the inference server you should use helm to delete the deployment:

$ kubectl delete deployment <deployment_name>
$ kubectl delete service <service_name>

For the Prometheus and Grafana services you should explicitly delete CRDs as described in https://github.com/helm/charts/tree/master/stable/prometheus-operator#uninstalling-the-chart:

$ kubectl delete crd alertmanagers.monitoring.coreos.com servicemonitors.monitoring.coreos.com podmonitors.monitoring.coreos.com prometheuses.monitoring.coreos.com prometheusrules.monitoring.coreos.com

You may also want to delete the GCS bucket you created to hold the model repository:

$ gsutil rm -r gs://triton-inference-server-repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEPLOY.rst

DEPLOY.rst

Kubernetes Deploy: Triton Inference Server Cluster

Model Repository (GCS)

GCS Permissions

Model Repository (AWS/S3)

AWS Permissions

Deploy Prometheus and Grafana (Optional, requires helm)

Deploy the Inference Server

Using Triton Inference Server

Cleanup

Files

DEPLOY.rst

Latest commit

History

DEPLOY.rst

File metadata and controls

Kubernetes Deploy: Triton Inference Server Cluster

Model Repository (GCS)

GCS Permissions

Model Repository (AWS/S3)

AWS Permissions

Deploy Prometheus and Grafana (Optional, requires helm)

Deploy the Inference Server

Using Triton Inference Server

Cleanup