This sample provide code to integrate Intel® Extension for PyTorch with Triton Inference Server framework. This project provides custom Python backend for Intel® Extension for PyTorch and additional dynamic batching algorithm to improve the performance. This code can be used as performance benchmark for Bert-Base and Bert-Large models.
You'll need to install Docker Engine on your development system. Note that while Docker Engine is free to use, Docker Desktop may require you to purchase a license. See the Docker Engine Server installation instructions for details.
Currently AI Inference samples support following Bert models finetuned on Squad dataset:
- bert_base - PyTorch+Intel® Extension for PyTorch Bert Base uncased
- bert_large - PyTorch+Intel® Extension for PyTorch Bert Large uncased
To add Intel Extension for PyTorch to the Triton Inference Server Container Image, use the following command to build the container used in this example.
docker build -t triton:ipex .
Tip
You can customize the PyTorch package versions that get installed by adding --build-arg PYTORCH_VERSION=<new-version>
based on the arguments found in the Dockerfile.
Note
If you are working under a corporate proxy you will need to include the following parameters in your docker build
command: --build-arg http_proxy=${http_proxy} --build-arg https_proxy=${https_proxy}
.
Start the Inference Server.
docker run \
-d --rm --shm-size=1g \
--net host --name server \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v $PWD:/models \
triton:ipex
Note
If you are working under a corporate proxy you will need to include the following parameters in your docker run
command: -e http_proxy=${http_proxy} -e https_proxy=${https_proxy}
Check the server logs and verify that both models have been registered successfully
docker logs server
Test the server connection, get model metadata, and make a test inference request to a model
curl -v localhost:8000/v2
curl -v localhost:8000/v2/health/ready
curl -v localhost:8000/v2/models/bert_base/versions/1
curl -v -X POST localhost:8000/v2/models/bert_base/versions/1/infer -d \
'{
"inputs": [
{
"name": "INPUT0",
"shape": [ 1, 4 ],
"datatype": "INT64",
"data": [ 1, 2, 3, 4 ]
}
]
}'
Tip
For more information about the Triton Inference Server HTTP/REST and GRPC APIs see the Predict Protocol v2.
Triton Inference Server comes with the tool perf_analyzer
, that can be used to benchmark the inference server from any client. Use the docker run command below to benchmark the inference server from the same container image.
docker run \
-it --rm \
--net host --name client \
triton:ipex \
perf_analyzer \
-u localhost:8000 \
-m bert_base \
--shape INPUT0:128 \
--input-data zero \
--sync \
--concurrency-range 1 \
--measurement-mode count_windows \
--measurement-request-count 1000
Modify the perf_analyzer command to test different models with various concurrency, requests count, and input data.
When finished with benchmarking, stop the inference server with the following command:
docker container stop server
AI Inference samples project is licensed under Apache License Version 2.0. Refer to the LICENSE file for the full license text and copyright notice.
This distribution includes third party software governed by separate license terms.
3-clause BSD license:
- model.py - for Intel® Extension for PyTorch optimized workload
- model.py - for Intel® Extension for PyTorch optimized workload
This third party software, even if included with the distribution of the Intel software, may be governed by separate license terms, including without limitation, third party license terms, other Intel software license terms, and open source software license terms. These separate license terms govern your use of the third party programs as set forth in the THIRD-PARTY-PROGRAMS file.
Intel, the Intel logo and Intel Xeon are trademarks of Intel Corporation or its subsidiaries.
- Other names and brands may be claimed as the property of others.
©Intel Corporation