Introduction
Thanks to Amazon Elastic Inference, any Amazon EC2 or Amazon SageMaker instance type may be equipped with the proper GPU-powered inference acceleration. You may now select the instance type that best suits your application's overall compute, memory, and storage requirements. TensorFlow, Apache MXNet, PyTorch, and ONNX models are all supported by Amazon Elastic Inference.
The process of producing predictions using a trained model is known as inference. Inference accounts for up to 90% of overall operational costs in deep learning applications for two reasons. To begin with, standalone GPU instances are usually intended for model training rather than inference. Inference jobs typically process a single input in real-time, consuming a small amount of GPU computing. This makes GPU inference on its own inefficient.
On the other hand, separate CPU instances are not designed for matrix operations and are typically too sluggish for deep learning inference. Second, different models require varying amounts of CPU, GPU, and memory. TensorFlow, Apache MXNet, PyTorch, and ONNX models are all supported by Amazon Elastic Inference.
The process of producing predictions using a trained model is known as inference. Inference accounts for up to 90% of overall operational costs in deep learning applications for two reasons. To begin with, standalone GPU instances are usually intended for model training rather than inference. Inference jobs typically process a single input in real-time, consuming a small amount of GPU computing. This makes GPU inference on its own inefficient. On the other hand, separate CPU instances are not designed for matrix operations and are typically too sluggish for deep learning inference. Second, different models require varying amounts of CPU, GPU, and memory.
Benefits
Reduce inference costs by up to 75%
Using Amazon Elastic Inference, you can choose the instance type that best fits your application's total compute and memory requirements. You can then specify the amount of inference acceleration you require independently. Because you no longer need to over-provision GPU computing for inference, you can save up to 75% on inference costs.
Get exactly what you need.
Inference acceleration can be as little as a single-precision TFLOPS (trillion floating-point operations per second) or as high as 32 mixed-precision TFLOPS with Amazon Elastic Inference. This is a considerably more appropriate range of inference compute than a solitary Amazon EC2 P3 instance's range of up to 1,000 TFLOPS. A simple language processing model, for example, may only require one TFLOPS to conduct inference efficiently, whereas a sophisticated computer vision model may require up to 32 TFLOPS.
Respond to changes in demand
Using Amazon EC2 Auto Scaling groups, you can quickly adjust the amount of inference acceleration up and down to meet your application's demands without over-provisioning capacity. When you use EC2 Auto Scaling to raise the number of EC2 instances, it also scales up the associated accelerator for each instance. When it scales down your EC2 instances as demand decreases, it also scales down the linked accelerator for each instance. This allows you to pay only for what you require when you require it.
Some of the other benefits are:
- TFLOPS of inference acceleration with single precision.
- As many as 32 TFLOPS with mixed accuracy.
- Scale inference acceleration up and down utilizing scaling groups integrated with Amazon SageMaker and Amazon EC2.
- Support for TensorFlow and Apache MXNet.
- Support for the Open Neural Network Exchange (ONNX) format.
- Single or combined precision procedures are available.