Performance Optimization

By modifying the inference scripts created in section c., let’s see the inference performance running on Inferentia.

Step 1. Modify inference script

Create a python script named infer_resnet50_perf.py with the following content.

We will use SavedModel which was compiled in section c., so work in the same directory.

import os
import time
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input, decode_predictions

# added for utilizing 4 neuron cores
os.environ['NEURONCORE_GROUP_SIZES'] = '4x1'

# Load models
compiled_model_dir = 'resnet50_neuron'
predictor_inferentia = tf.contrib.predictor.from_saved_model(compiled_model_dir)

# Create input from image
img_sgl = image.load_img('kitten_small.jpg', target_size=(224, 224))
img_arr = image.img_to_array(img_sgl)
img_arr2 = np.expand_dims(img_arr, axis=0)
img_arr3 = preprocess_input(img_arr2)

model_feed_dict={'input': img_arr3}

# warmup
infa_rslts = predictor_inferentia(model_feed_dict) 

num_inferences = 10000

# Run inference on Neuron Cores, Display results
start = time.time()
for _ in range(num_inferences):
    infa_rslts = predictor_inferentia(model_feed_dict)
elapsed_time = time.time() - start

print('By Neuron Core - num_inferences:{:>6}[images], elapsed_time:{:6.2f}[sec], Throughput:{:8.2f}[images/sec]'.format(num_inferences, elapsed_time, num_inferences / elapsed_time))

Step 2. Run inference script

Run the inference script infer_resnet50_perf.py for running inference on CPUs and Neuron cores respectively,

python infer_resnet50_perf.py

You will get the following result.

By Neuron Core - num_inferences: 10000[images], elapsed_time: 38.45[sec], Throughput:  260.06[images/sec]

Step 3. View Neuron core usage

Run neuron-top command in another terminal when inference on Neuron Core is active to show Neuron core and the memory usage.

neuron-top

You will see all four Neuron cores on the Inferentia chip are running at 15~20% utilization.

neuron-top - 05:48:03
Models: 4 loaded, 4 running. NeuronCores: 4 used.
0000:00:1f.0 Utilizations: NC0 17.54%, NC1 17.81%, NC2 17.64%, NC3 17.81%,
Model ID   Device    NeuronCore%   Device Mem   Host Mem   Model Name
10012      nd0:nc3   17.81           50 MB         1 MB    p/tmpwxmh31_5/neuron_op_d6f098c01c780733
10011      nd0:nc2   17.64           50 MB         1 MB    p/tmpwxmh31_5/neuron_op_d6f098c01c780733
10010      nd0:nc1   17.81           50 MB         1 MB    p/tmpwxmh31_5/neuron_op_d6f098c01c780733
10009      nd0:nc0   17.54           50 MB         1 MB    p/tmpwxmh31_5/neuron_op_d6f098c01c780733

Step 4. Modify inference script

Create a python script named infer_resnet50_perf2.py with the following content:

In this script, in order to effectively utilize the four Neuron cores on the Inferentia chip, threads are executed in parallel using Python’s ThreadPoolExecutor to maximize the load on the Neuron cores.

import os
import time
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input, decode_predictions
from concurrent import futures

# added for utilizing 4 neuron cores
os.environ['NEURONCORE_GROUP_SIZES'] = '4x1'

# Load models
compiled_model_dir = 'resnet50_neuron'
predictor_inferentia = tf.contrib.predictor.from_saved_model(compiled_model_dir)

# Create input from image
img_sgl = image.load_img('kitten_small.jpg', target_size=(224, 224))
img_arr = image.img_to_array(img_sgl)
img_arr2 = np.expand_dims(img_arr, axis=0)
img_arr3 = preprocess_input(img_arr2)

model_feed_dict={'input': img_arr3}

# warmup
infa_rslts = predictor_inferentia(model_feed_dict) 

num_inferences = 10000

# Run inference on Neuron Cores, Display results
start = time.time()
with futures.ThreadPoolExecutor(8) as exe:
    fut_list = []
    for _ in range (num_inferences):
        fut = exe.submit(predictor_inferentia, model_feed_dict)
        fut_list.append(fut)
    for fut in fut_list:
        infa_rslts = fut.result()
elapsed_time = time.time() - start

print('By Neuron Core - num_inferences:{:>6}[images], elapsed_time:{:6.2f}[sec], Throughput:{:8.2f}[images/sec]'.format(num_inferences, elapsed_time, num_inferences / elapsed_time))

Step 5. Run modified inference script

Run modified inference script infer_resnet50_perf2.py.

python infer_resnet50_perf2.py

You will see the following result. Comparing to the previous result, the inference throughput has improved significantly.

By Neuron Core - num_inferences: 10000[images], elapsed_time: 11.85[sec], Throughput:  844.06[images/sec]

Step 6. View Neuron core usage

As we did in Step 1, run neuron-top command to show Neuron core usage.

neuron-top

You will see all four Neuron cores on the Inferentia chip are running at ~100% utilization.

neuron-top - 05:49:12
Models: 4 loaded, 4 running. NeuronCores: 4 used.
0000:00:1f.0 Utilizations: NC0 97.95%, NC1 99.97%, NC2 99.89%, NC3 99.87%,
Model ID   Device    NeuronCore%   Device Mem   Host Mem   Model Name
10016      nd0:nc3   99.87           50 MB         1 MB    p/tmpwxmh31_5/neuron_op_d6f098c01c780733
10015      nd0:nc2   99.89           50 MB         1 MB    p/tmpwxmh31_5/neuron_op_d6f098c01c780733
10014      nd0:nc1   99.97           50 MB         1 MB    p/tmpwxmh31_5/neuron_op_d6f098c01c780733
10013      nd0:nc0   97.95           50 MB         1 MB    p/tmpwxmh31_5/neuron_op_d6f098c01c780733