Changing Batch size

Batching refers to the process of grouping multiple inference requests together and processing them as a group. Batching is typically used as an optimization for throughput at the expense of higher latency. Batching is usually implemented on a layer-by-layer basis, which allows for each set of weights in a given layer to be reused for each inference in the batch before needing to retrieve additional new weights. This enables Neuron to better amortize the cost of reading weights from the external memory and thus improve the overall hardware efficiency.

To enable the batching optimization, we first need to compile the model for a target batch-size. This is done by specifying the batch size in the input tensor’s batch dimension during compilation. Users are encouraged to evaluate multiple batch sizes in order to determine the optimal latency/throughput deployment-point, which is application dependent.

Step 1. Create a python script for compiling a model

Create a python script named with the following content:

import shutil
import tensorflow.neuron as tfn

model_dir = 'resnet50'

for batch_size in [1, 2, 4, 8, 16]:

# Prepare export directory (old one removed)
    compiled_model_dir = 'resnet50_neuron_batch' + str(batch_size)
    shutil.rmtree(compiled_model_dir, ignore_errors=True)

# Compile using Neuron
    tfn.saved_model.compile(model_dir, compiled_model_dir, batch_size=batch_size, dynamic_batch_size=True)

Step 2. Run the compilation script

Run the compilation script, which will take ~10 minutes on inf1.2xlarge. It compiles the model for a target batch size=1, 2, 4, 8 and 16.


Step 3. Create a python script for inference

Create a python script named with the following content. The script loads the model for CPU which was used in section b. and models for Neuron core which were compiled in Step 2 per each batch sizes.

import os
import time
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input, decode_predictions
from concurrent import futures

# added for utilizing 4 neuron cores
os.environ['NEURONCORE_GROUP_SIZES'] = '4x1'

# measure the performance per batch size
for batch_size in [1, 2, 4, 8, 16]:
    USER_BATCH_SIZE = batch_size
    print("batch_size: {}, USER_BATCH_SIZE: {}". format(batch_size, USER_BATCH_SIZE))

# Load model
    compiled_model_dir = 'resnet50_neuron_batch' + str(batch_size)
    predictor_inferentia = tf.contrib.predictor.from_saved_model(compiled_model_dir)

# Create input from image
    img_sgl = image.load_img('kitten_small.jpg', target_size=(224, 224))
    img_arr = image.img_to_array(img_sgl)
    img_arr2 = np.expand_dims(img_arr, axis=0)
    img_arr3 = preprocess_input(np.repeat(img_arr2, USER_BATCH_SIZE, axis=0))

    model_feed_dict={'input': img_arr3}

# warmup
    infa_rslts = predictor_inferentia(model_feed_dict) 

    num_loops = 1000
    num_inferences = num_loops * USER_BATCH_SIZE

# Run inference on Neuron Cores, Display results
    start = time.time()
    with futures.ThreadPoolExecutor(8) as exe:
        fut_list = []
        for _ in range (num_loops):
            fut = exe.submit(predictor_inferentia, model_feed_dict)
        for fut in fut_list:
            infa_rslts = fut.result()
    elapsed_time = time.time() - start

    print('By Neuron Core - num_inferences:{:>6}[images], elapsed_time:{:6.2f}[sec], Throughput:{:8.2f}[images/sec]'.format(num_inferences, elapsed_time, num_inferences / elapsed_time))

Step 4. Run inference script

Run the inference script


You will see the following result. When the batch size is increased, the inference throughput on the Neuron core is improved significantly while the inference throughput on the CPU is constant,

batch_size: 1, USER_BATCH_SIZE: 1
By Neuron Core - num_inferences:  1000[images], elapsed_time:  1.20[sec], Throughput:  836.67[images/sec]
batch_size: 2, USER_BATCH_SIZE: 2
By Neuron Core - num_inferences:  2000[images], elapsed_time:  1.54[sec], Throughput: 1299.02[images/sec]
batch_size: 4, USER_BATCH_SIZE: 4
By Neuron Core - num_inferences:  4000[images], elapsed_time:  2.69[sec], Throughput: 1485.90[images/sec]
batch_size: 8, USER_BATCH_SIZE: 8
By Neuron Core - num_inferences:  8000[images], elapsed_time:  5.17[sec], Throughput: 1547.62[images/sec]
batch_size: 16, USER_BATCH_SIZE: 16
By Neuron Core - num_inferences: 16000[images], elapsed_time: 10.66[sec], Throughput: 1500.63[images/sec]

In, batch_size indicates the batch size specified at compile time and USER_BATCH_SIZE indicates the batch size at runtime. To maximize throughput, for USER_BATCH_SIZE, specify a mutiple of the batch_size. Please refer to here for more information on batch size.