Setup inferencing endpoint

Running the inference demo

Connect to your inf1.xlarge instance and update tensorflow-neuron, aws-neuron-runtime and aws-neuron-tools.

Update inference EC2 instance

Update to the latest neuron software by executing the following commands :

source activate aws_neuron_tensorflow_p36
conda install numpy=1.17.2 --yes --quiet
conda update tensorflow-neuron

Launching the BERT-Large demo server

Copy the compiled model (bert-saved-model-neuron) from your c5.4xlarge to your inf1.xlarge instance.

scp -r -i <PEM key file>  ./bert-saved-model-neuron ubuntu@<instance DNS>:~/ # if Ubuntu-based AMI
scp -r -i <PEM key file>  ./bert-saved-model-neuron ec2-user@<instance DNS>:~/  # if using AML2-based AMI

Place the model in the same directory as the bert_demo scripts. Then from the same conda environment launch the BERT-Large demo server :

sudo systemctl restart neuron-rtd
python --dir bert-saved-model-neuron --batch 6 --parallel 4

This loads 4 BERT-Large models, one into each of the 4 NeuronCores found in an inf1.xlarge instance. For each of the 4 models, the BERT-Large demo server opportunistically stitches together asynchronous requests into batch 6 requests.

Wait for the bert_server to finish loading the BERT-Large models to Inferentia memory. When it is ready to accept requests it will print the inferences per second once every second. This reflects the number of real inferences only. Dummy requests created for batching are not credited to inferentia performance.

Sending requests to server from multiple clients

Wait until the bert demo server is ready to accept requests. Then on the same inf1.xlarge instance, launch a separate linux terminal. From the bert_demo directory execute the following commands :

source activate aws_neuron_tensorflow_p36
for i in {1..96}; do python --cycle 128 & done

This spins up 96 clients, each of which sends 128 inference requests. The expected performance is about 360 inferences/second for a single instance of inf1.xlarge.

Using public BERT SavedModels

We are now providing a compilation script that has better compatibility with various flavors of BERT SavedModels generated from Here are the current limitations:

  1. Using the original file
  2. BERT SavedModel is generated using estimator.export_saved_model
  3. BERT SavedModel uses fixed sequence length 128 (you may check by saved_model_cli show --dir /path/to/user/bert/savedmodel --all)
  4. neuron-cc version is at least 1.0.12000.0
  5. aws-neuron-runtime version is at least 1.0.7000.0
  6. The --batch_size argument specified in this script is at most 4

Example usage is shown below:

export BERT_LARGE_SAVED_MODEL="/path/to/user/bert-large/savedmodel"
python --input_saved_model $BERT_LARGE_SAVED_MODEL --output_saved_model ./bert-saved-model-neuron --batch_size=1