Setup EC2 Instance

Launch EC2 instances

For this demo, launch two EC2 instances :

  • a c5.4xlarge instance for compiling the BERT-Large Model and
  • an inf1.xlarge instance for running inference

For both of these instances choose the latest Ubuntu 18 Deep Learning AMI (DLAMI).

Compiling Neuron compatible BERT-Large

First connect to a c5.4xlarge instance and update tensorflow-neuron and neuron-cc

Update compilation EC2 instance

Update to the latest Neuron software by executing the following commands :

source activate aws_neuron_tensorflow_p36
conda install numpy=1.17.2 --yes --quiet
conda update tensorflow-neuron

Note: if your tensorflow-neuron version on the inference instance is lower than 1.15.0.1.0.1333.0, you will need to run this demo on inf1.2xlarge instead of inf1.xlarge.

Compile open source BERT-Large saved model using Neuron compatible BERT-Large implementation

Neuron software works with tensorflow saved models. Users should bring their own BERT-Large saved model for this section. This demo will run inference for the Microsoft Research Paraphrase Corpus(MRPC) task and the saved model should be fine tuned for MRPC. Users who need additional help to fine-tune the model for MRPC or to create a saved model can refer to Appendix 1.

In the same conda environment and directory bert_demo scripts, run the following :

export BERT_LARGE_SAVED_MODEL="/path/to/user/bert-large/savedmodel"
python bert_model.py --input_saved_model $BERT_LARGE_SAVED_MODEL --output_saved_model ./bert-saved-model-neuron --batch_size=6 --aggressive_optimizations

This compiles BERT-Large pointed to by $BERT_LARGE_SAVED_MODEL for an input size of 128 and batch size of 6. The compilation output is stored in bert-saved-model-neuron. Copy this to your Inf1 instance for inferencing.

The bert_model.py script encapsulates all the steps necessary for this process. For details on what is done by bert_model.py please refer to Appendix 2.