OpenPose Inf1


For this walkthrough, you need an AWS account with access to the AWS Management Console and the ability to create Amazon Elastic Compute Cloud (Amazon EC2) instances with public-facing IP and Amazon Simple Storage Service (Amazon S3) buckets.

Working knowledge of AWS Deep Learning AMIs and Jupyter notebooks with Conda environments is beneficial, but not required.

Throughout this workshop, you will interact with this instance and the model using a Jupyter Lab server. For more information about provisioning a Jupyter Lab on Amazon EC2, see Set Up a Jupyter Notebook Server.

Compiling a TensorFlow OpenPose model with the Neuron SDK

You can start the compilation process by setting up an EC2 instance in AWS for compiling the model. We recommend a z1d.xlarge, due to its good single-core performance and memory size. Use the AWS Deep Learning AMI (Ubuntu 18.04) Version 29.0—ami-043f9aeaf108ebc37—in the US East (N. Virginia) Region. This AMI comes pre-packaged with the Neuron SDK and the required Neuron runtime for AWS Inferentia.

For more information about running AWS Deep Learning AMIs on EC2 instances, see Launching and Configuring a DLAMI.

When you can connect to the instance through SSH, you activate the aws_neuron_tensorflow_p36 Conda environment and update the Neuron Compiler to the latest release. The compilation script depends on requirements listed in the file requirements-compile.txt. For compilation scripts and requirements files, see the GitHub repo. Download and install them in the environment with the following code:

source activate aws_neuron_tensorflow_p36
pip install neuron-cc --upgrade --extra-index-url=
git clone /tmp/aws-neuron-sdk && cp /tmp/aws-neuron-sdk/src/examples/tensorflow/<name_of_the_new_folder>/* . && rm -rf /tmp/aws-neuron-sdk/
pip install -r requirements-compile.txt

You can then start working on the compilation process. You compile the tf-pose-estimation network frozen graph, available on the GitHub repo. You can adapt the original download script to a single-line wget command:

wget -c --tries=2 $( wget -q -O - | grep -o 'http*://download[^"]*' | tail -n 1 ) -O graph_opt.pb

When the download is complete, run the script to compile it for the AWS Inferentia chip. Because Neuron is an ahead-of-time (AOT) compiler, you need to define a specific image size prior to compilation. You can adjust the network input image resolution with the argument --net_resolution (for example, net_resolution=656x368).

The compiled model can accept arbitrary batch size inputs at inference runtime. This property enables benchmarking large-scale deployments of the model; however, the pipeline available for image and video process in the tf-pose-estimation repo utilizes batch size 1.

To start the compilation process, enter the following code:

python graph_opt.pb graph_opt_neuron_656x368.pb

The compilation process can take up to 20 minutes to complete. During this time, the compiler optimizes the TensorFlow graph operations and provides the AWS Inferentia version of the saved model. During the process you can expect detailed logs such as the following:

2020-07-15 21:44:43.008627: I bazel-out/k8-opt/bin/tensorflow/neuron/convert/] There are 11 ops of 7 different types in the graph that are not compiled by neuron-cc: Const, NoOp, Placeholder, RealDiv, Sub, Cast, Transpose, (For more information see
INFO:tensorflow:fusing subgraph neuron_op_ed41d2deb8c54255 with neuron-cc
INFO:tensorflow:Number of operations in TensorFlow session: 474
INFO:tensorflow:Number of operations after tf.neuron optimizations: 474
INFO:tensorflow:Number of operations placed on Neuron runtime: 465

Before you can measure the performance of the compiled model, you need to switch to an EC2 Inf1 instance, powered by the AWS Inferentia chip. To share the compiled model between the two instances, create an S3 bucket with the following code:

aws s3 mb s3://<MY_BUCKET_NAME>
aws s3 cp graph_opt_neuron_656x368.pb s3://<MY_BUCKET_NAME>/graph_model.pb

Setting up the Conda Environment for tf-pose

When you can log in to the Jupyter Lab server, you can clone the GitHub repo containing the TensorFlow version of OpenPose.

On the Jupyter Launcher page, under Other, choose Terminal. In the terminal, activate the aws_neuron_tensorflow_p36 environment, which contains the Neuron SDK. Activating the environment and cloning are done with the following code:

conda activate aws_neuron_tensorflow_p36
git clone
cd tf-pose-estimation

When the cloning is complete, we recommend following the package install instructions to install the repo. From the same terminal screen, you customize the environment by installing opencv-python and dependencies listed on the requirements.txt of the GitHub repo.

You run two pip commands: the first takes care of opencv-python and the second completes the installation of the requirements.txt:

pip install opencv-python 
pip install -r requirements.txt

You’re now ready to build the notebooks.

On the repo’s root directory, create a new Jupyter notebook by choosing Notebook, Environment (conda_aws_neuron_tensorflow_p36). On the first cell of the notebook, import the library as defined in the script, which is the reference pipeline for image processing. In the following cell, create a logger to record the benchmarking. See the following code:

import argparse
import logging
import sys
import time

from tf_pose import common
import cv2
import numpy as np
from tf_pose.estimator import TfPoseEstimator
from tf_pose.networks import get_graph_path, model_wh
logger = logging.getLogger('TfPoseEstimatorRun')
ch = logging.StreamHandler()
formatter = logging.Formatter('[%(asctime)s] [%(name)s] [%(levelname)s] %(message)s')

Define the main inferencing function main() and a helper plotter function plotter(). These functions directly replicate the OpenPose inference pipeline from One simple modification is the addition of a repeats argument, which allows you to run many inference steps in sequence and improve the measure of the average model throughput (measured in seconds per image):

def main(argString='--image ./images/contortion1.jpg --model cmu', repeats=10):
    parser = argparse.ArgumentParser(description='tf-pose-estimation run')
    parser.add_argument('--image', type=str, default='./images/apink2.jpg')
    parser.add_argument('--model', type=str, default='cmu',
                        help='cmu / mobilenet_thin / mobilenet_v2_large / mobilenet_v2_small')
    parser.add_argument('--resize', type=str, default='0x0',
                        help='if provided, resize images before they are processed. '
                             'default=0x0, Recommends : 432x368 or 656x368 or 1312x736 ')
    parser.add_argument('--resize-out-ratio', type=float, default=2.0,
                        help='if provided, resize heatmaps before they are post-processed. default=1.0')

    args = parser.parse_args(argString.split())

    w, h = model_wh(args.resize)
    if w == 0 or h == 0:
        e = TfPoseEstimator(get_graph_path(args.model), target_size=(432, 368))
        e = TfPoseEstimator(get_graph_path(args.model), target_size=(w, h))

    # estimate human poses from a single image !
    image = common.read_imgfile(args.image, None, None)
    if image is None:
        logger.error('Image can not be read, path=%s' % args.image)

    t = time.time()
    for _ in range(repeats):
        humans = e.inference(image, resize_to_default=(w > 0 and h > 0), upsample_size=args.resize_out_ratio)
    elapsed = time.time() - t'%d times inference on image: %s at %.4f seconds/image.' % (repeats, args.image, elapsed/repeats))

    image = TfPoseEstimator.draw_humans(image, humans, imgcopy=False)
    return image, e
def plotter(image):
        import matplotlib.pyplot as plt

        fig = plt.figure(figsize=(12,12))
        a = fig.add_subplot(1, 1, 1)
        plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
    except Exception as e:
        logger.warning('matplitlib error, %s' % e)
        cv2.imshow('result', image)

Additionally, you can modify the same code structure for inferencing on videos or batches of images, based on the or, if you’re feeling adventurous!

The main() function takes as input the same string of arguments as described in the Test Inference section of the GitHub repo. To test the notebook implementation, you use a reference set of arguments (make sure to download the cmu model using the original download script):

img, e = main('--model cmu --resize 656x368 --image=./images/ski.jpg --resize-out-ratio 2.0')

The logs show your first multi-person pose analyzed:

‘[TfPoseEstimatorRun] [INFO] 10 times inference on image: ./images/ski.jpg at 1.5624 seconds/image.’
This results in lower than one frame per second (FPS) throughput, which is not a great performance. In this use case, you’re running a TensorFlow graph, --model cmu, without a GPU. The performance of such a model isn’t optimal on CPU. If you repeat the setup and run the environment on a g4dn.xlarge instance, with one NVIDIA T4 GPU, the result is quite different:

‘[TfPoseEstimatorRun] [INFO] 10 times inference on image: ./images/ski.jpg at 0.1708 seconds/image’ 
The result is 5.85 FPS, which is much better.

Using the Neuron compiled CMU model

So far, you’ve used model artifacts that came with the repo. Instead of using the original download script to retrieve the CMU model, copy the Neuron compiled model into ./models/graph/cmu/graph_model.pb and rerun the test:

aws s3 cp s3://<MY_BUCKET_NAME>/graph_opt.pb ./models/graph/cmu/graph_model.pb

Make sure to restart the Python kernel on the notebook if you previously ran a test of the non-Neuron compiled model. Restarting the kernel helps make sure all TensorFlow sessions are closed and get a fresh start for the benchmark. Running the same notebook again results in the following log entry:

‘[TfPoseEstimatorRun] [INFO] 10 times inference on image: ./images/ski.jpg at 0.1709 seconds/image.’

The results show the same frame rate as compared to the g4dn.xlarge instance, in an environment that costs approximately 30% less on demand. Despite the cost benefit from moving the workload to an AWS Inferentia-based instance, this throughput doesn’t convey the observed large performance gains of other reported results.

We decided to profile our version of the compiled graph and look for opportunities to fine-tune the end-to-end inference performance of the OpenPose pipeline. The integration of Neuron with TensorFlow gives access to native profiling libraries. To profile the Neuron compiled graph, we instrumented the TensorFlow session run command on the estimator method using the TensorFlow Python profiler:

from tensorflow.core.protobuf import config_pb2
from tensorflow.python.profiler import model_analyzer, option_builder

run_options = config_pb2.RunOptions(trace_level=config_pb2.RunOptions.FULL_TRACE)
run_metadata = config_pb2.RunMetadata()

peaks, heatMat_up, pafMat_up =
    [self.tensor_peaks, self.tensor_heatMat_up, self.tensor_pafMat_up], feed_dict={
        self.tensor_image: [img], self.upsample_size: upsample_size
    options=run_options, run_metadata=run_metadata

options = option_builder.ProfileOptionBuilder.time_and_memory()
model_analyzer.profile(self.persistent_sess.graph, run_metadata, op_log=None, cmd='scope', options=options)

The model_analyzer.profile method prints on StdErr the time and memory consumption of each operation on the TensorFlow graph. With the original code, the Neuron operation and a smoothing operation dominated the total graph runtime. The following output from the StdErr log shows that the total graph runtime took 108.02 milliseconds, of which the smoothing operation took 43.07 milliseconds:

node name | requested bytes | total execution time | accelerator execution time | cpu execution time
_TFProfRoot (--/16.86MB, --/108.02ms, --/0us, --/108.02ms)
   TfPoseEstimator/conv5_2_CPM_L1/weights/neuron_op_ed41d2deb8c54255 (430.01KB/430.01KB, 58.42ms/58.42ms, 0us/0us, 58.42ms/58.42ms)
smoothing (0B/2.89MB, 0us/43.07ms, 0us/0us, 0us/43.07ms)
   smoothing/depthwise (2.85MB/2.85MB, 43.05ms/43.05ms, 0us/0us, 43.05ms/43.05ms)
   smoothing/gauss_weight (47.50KB/47.50KB, 18us/18us, 0us/0us, 18us/18us)

The smoothing method provides a gaussian blur of the confidence maps calculated by OpenPose. By optimizing this operation, we can extract even more performance out of our end-to-end pose estimation. We modified the filter argument of the smoother on the script from 25 to 5. This new configuration took down the total runtime to 67.44 milliseconds, of which the smoother now only takes 2.37ms—a 37% reduction! You can also optimize your version of the end-to-end pipeline by changing the same parameters and reinstalling the tf-pose-estimation repo from your local copy.

Safely shutting down and cleaning up

On the Amazon EC2 console, choose the compilation and inference instances, and choose Terminate from the Actions drop-down menu. You persisted the compiled model in your s3://<MY_BUCKET_NAME> so it can be reused later. If you’ve made changes to the code inside the instances, remember to persist those as well by transferring it back to the S3. The instance termination discards data stored only in the instance’s home volume.