Compile and Deploy a Pytorch model on Inf1 using SageMaker Neo

Amazon SageMaker now supports Inf1 instances for high performance and cost-effective inferences. Inf1 instances are ideal for large scale machine learning inference applications like image recognition, speech recognition, natural language processing, personalization, and fraud detection. In this example, we’re importing a pre-trained model from TorchVision, compiling with SageMaker Neo and deploying it on Inf1 instances on a SageMaker endpoint.

Create a new Notebook

In SageMaker Studio, press the ‘+’ button. New Launcher

Under Notebooks and compute resources click on Notebook - Python 3. Create Notebook Be sure to select PyTorch 1.6 Python 3.6 (optimized for CPU) TF Notebook

Set up the environment

We first need to make sure ipywidgets is installed and restart the kernel.

import IPython
!pip install ipywidgets
IPython.Application.instance().kernel.do_shutdown(True) # has to restart kernel so changes are used

PyTorch model compilation to Inf1 requires SageMaker version 2.11.0 or later.

!pip install -qU "sagemaker==2.11.0"
import sagemaker
import boto3
import time
from sagemaker.utils import name_from_base

role = sagemaker.get_execution_role()
sess = sagemaker.Session()
region = sess.boto_region_name
bucket = sess.default_bucket()
sm_client = boto3.client('sagemaker')

Download the model script

!curl -sS --output

Import ResNet18 from TorchVision

We’re importing a pre-trained ResNet18 model from the TorchVision models and creating a tar.gz archive out of it.

import torch
import torchvision.models as models
import tarfile

resnet18 = models.resnet18(pretrained=True)
input_shape = [1,3,224,224]
trace = torch.jit.trace(resnet18.float().eval(), torch.zeros(input_shape).float())'model.pth')

with'model.tar.gz', 'w:gz') as f:

Upload the model to S3

We will forward the model artifact to Neo Compilation API.

We’re first starting by uploading the model archive to S3.

compilation_job_name = name_from_base('ResNet18-PreTrained-Neo-Inf1')
model_key = '{}/model/model.tar.gz'.format(compilation_job_name)
model_path = 's3://{}/{}'.format(bucket, model_key)
boto3.resource('s3').Bucket(bucket).upload_file('model.tar.gz', model_key)
print("Uploaded model to S3:")

Create the PyTorch model.

At the moment SageMaker Neo only compiles Pytorch 1.5.1 models to Inferentia

from sagemaker.pytorch.model import PyTorchModel

pytorch_model = PyTorchModel(model_data=model_path,

Compile the model

After creating the PyTorch model, we compile it using Amazon SageMaker Neo to optimize performance for our desired deployment target. To compile our model for deploying on Inf1 instances, we are using the compile() method and select 'ml_inf1' as our deployment target. The input_shape is the definition for the model’s input tensor and output_path is where the compiled model will be stored in S3.

The compilation will take about two minutes.

compiled_model_path = 's3://{}/{}/output'.format(bucket, compilation_job_name)
print("Output path for compiled model:")
neo_model = pytorch_model.compile(target_instance_family='ml_inf1',

Important. If the following command result in a permission error, scroll up and locate the value of execution role returned by get_execution_role(). The role must have access to the S3 bucket specified in output_path.

Deploy the compiled model on a SageMaker endpoint

Now that we have the compiled model, we will deploy it on an Amazon SageMaker endpoint. Inf1 instances in Amazon SageMaker are available in four sizes: ml.inf1.xlarge, ml.inf1.2xlarge, ml.inf1.6xlarge, and ml.inf1.24xlarge. In this example, we are using 'ml.inf1.xlarge'.

The deployment takes about 6-7 minutes. SageMaker takes care of all of the provisioning, scaling, patching, high availability, etc. associated with hosting the model.

predictor = neo_model.deploy(instance_type='ml.inf1.xlarge', initial_instance_count=1)

Invoking the endpoint

Once the endpoint is ready, you can send requests to it and receive inference results in real-time with low latency.

Let’s first get a cat picture and display it.

!curl -o cat.jpg
from IPython.display import Image

And invoke the endpoint

import json
import numpy as np

sm_runtime = boto3.Session().client('sagemaker-runtime')

with open('cat.jpg', 'rb') as f:
    payload =

response = sm_runtime.invoke_endpoint(EndpointName=predictor.endpoint_name,
result = json.loads(response['Body'].read().decode())
print('Most likely class: {}'.format(np.argmax(result)))

To check what this class label is, let’s get the ImageNet labels and convert the class id to text.

!curl -o imagenet1000_clsidx_to_labels.txt
object_categories = {}
with open("imagenet1000_clsidx_to_labels.txt", "r") as f:
    for line in f:
        key, val = line.strip().split(':')
        object_categories[key] = val
print("Result: label - " + object_categories[str(np.argmax(result))]+ " probability - " + str(np.amax(result)))

The label should be tiger cat.

Congratulation you just managed to deploy a ResNet18 model on an Inferentia SageMaker Endpoint!

Delete the Endpoint

Having an endpoint running will incur some costs. Therefore as a clean-up job, we should delete the endpoint.