Appendix

Appendix 1

Users who need help finetuning BERT-Large for MRPC and creating a saved model may follow the instructions here.

Connect to the c5.4xlarge compilation EC2 instance you started above and download these three items :

  1. git clone this github repo.
  2. download GLUE data as described here. Do not run the finetuning command.
  3. download a desired pre-trained BERT-Large checkpoint from here. This is the model we will fine tune.

Next edit run_classifier.py in the cloned bert repo to apply the patch described in the following git diff. Alternatively you can copy and paste the line with the “+” and add it to the run_classifier.py file.

diff --git a/run_classifier.py b/run_classifier.py
index 817b147..c9426bc 100644
--- a/run_classifier.py
+++ b/run_classifier.py
@@ -955,6 +955,18 @@ def main(_):
         drop_remainder=predict_drop_remainder)
 
     result = estimator.predict(input_fn=predict_input_fn)
+    features = {
+        "input_ids": tf.placeholder(shape=[None, FLAGS.max_seq_length], dtype=tf.int32, name='input_ids'),
+        "input_mask": tf.placeholder(shape=[None, FLAGS.max_seq_length], dtype=tf.int32, name='input_mask'),
+        "segment_ids": tf.placeholder(shape=[None, FLAGS.max_seq_length], dtype=tf.int32, name='segment_ids'),
+        "label_ids": tf.placeholder(shape=[None], dtype=tf.int32, name='label_ids'),
+        "is_real_example": tf.placeholder(shape=[None], dtype=tf.int32, name='is_real_example'),
+    }
+    serving_input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn(features)
+    estimator._export_to_tpu = False  ## !!important to add this
+    estimator.export_saved_model(
+        export_dir_base='./bert_classifier_saved_model',
+        serving_input_receiver_fn=serving_input_fn)
 
     output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv")
     with tf.gfile.GFile(output_predict_file, "w") as writer:

NOTE : Users who are interested may refer to this link for additional background information on the patch but it is not necessary for running this demo.

Then from the bert_demo directory run the following :

source activate aws_neuron_tensorflow_p36
export BERT_REPO_DIR="/path/to/cloned/bert/repo/directory"
export GLUE_DIR="/path/to/glue/data/directory"
export BERT_BASE_DIR="/path/to/pre-trained/bert-large/checkpoint/directory"
./tune_save.sh

The a saved model will be created in $BERT_REPO_DIR/bert-saved-model/random_number/. Where, random_number is a random number generated for every run. Use this saved model to continue with the rest of the demo.

Appendix 2

For all BERT variants, we currently need to augment the standard Neuron compilation process for performance tuning. In the future, we intend to automate this tuning process. This would allow users to use the standard Neuron compilation process, which requires only a one line change in user source code. The standard compilation process is described here.

The augmented Neuron compilation process is encapsulated by the bert_model.py script, which performs the following things :

  1. Define a Neuron compatible implementation of BERT-Large. For inference, this is functionally equivalent to the open source BERT-Large. The changes needed to create a Neuron compatible BERT-Large implementation is described in Appendix 3.
  2. Extract BERT-Large weights from the open source saved model pointed to by –input_saved_model and associates it with the Neuron compatible model
  3. Invoke TensorFlow-Neuron to compile the Neuron compatible model for Inferentia using the newly associated weights
  4. Finally, the compiled model is saved into the location given by –output_saved_model

Appendix 3

The Neuron compatible implementation of BERT-Large is functionally equivalent to the open source version when used for inference. However, the detailed implementation does differ and here are the list of changes :

  1. Data Type Casting : If the original BERT-Large an FP32 model, bert_model.py contains manually defined cast operators to enable mixed-precision. FP16 is used for multi-head attention and fully-connected layers, and fp32 everywhere else. This will be automated in a future release.
  2. Remove Unused Operators: A model typically contains training operators that are not used in inference, including a subset of the reshape operators. Those operators do not affect inference functionality and have been removed.
  3. Reimplementation of Selected Operators : A number of operators (mainly mask operators), has been reimplemented to bypass a known compiler issue. This will be fixed in a planned future release.
  4. Manually Partition Embedding Ops to CPU : The embedding portion of BERT-Large has been partitioned manually to a subgraph that is executed on the host CPU, without noticable performance impact. In near future, we plan to implement this through compiler auto-partitioning without the need for user intervention.