Amazon Web Services (AWS), Amazon’s cloud services division, today announced the general availability of Elastic Compute Cloud (EC2) DL1 instances. While new instance types generally aren’t particularly novel, DL1 (specifically DL1.24xlarge) is the first type in AWS designed for training machine learning models, Amazon says — powered by Gaudi accelerators from Intel-owned Habana Labs.

Developers including Seagate, Fractal, Indel, Riskfuel, and Leidos were given early access to Gaudi running on AWS prior to today’s launch. “This is the first AI training instance by AWS that is not based on GPUs,” Habana wrote in a blog post. “The primary motivation to create this new training instance class was presented by Andy Jassy in 2020 re:Invent: ‘To provide our end-customers with up to 40% better price-performance than the current generation of GPU-based instances.’”

Cheaper model training

Machine learning is becoming mainstream as enterprises realize the business impact of deploying AI models in their organizations. Using machine learning generally starts with training a model to recognize patterns by learning from datasets, then applying the model to new data to make predictions. Maintaining the prediction accuracy of a model requires retraining the model frequently, which takes a considerable amount of resources — resulting in increased expenses. Google subsidiary DeepMind is estimated to have spent $35 million training a system to learn the Chinese board game Go.

With DL1 — AWS’ first answer to Google’s tensor processing units (TPUs), a set of custom accelerator chips running in Google Cloud Platform — Amazon and Habana claims that AWS customers can now train models faster and with up to 40% better price-performance when compared to the latest GPU-powered EC2 instances. The DL1 instances leverage up to eight Gaudi accelerators built specifically to speed up training, paired with 256GB of high-bandwidth memory, 768GB of system memory, second-generation Amazon custom Intel Xeon Scalable (Cascade Lake) processors, 400 Gbps of networking throughput, and up to 4TB of local NVMe storage.

Figure 1: Habana's new training chip was designed for high performance AI training at significant scale.

Above: Habana’s new training chip was designed for high performance AI training at significant scale.

Gaudi features one of the industry’s first on-die implementations of Remote Direct Memory Access over Ethernet (RDMA and RoCE) on an AI chip. This provides 10 100Gbps or 20 50Gbps communication links, enabling it to scale up to as many “thousands” of discrete accelerator cards. When coming from a GPU- or CPU-based instance, customers have to use Habana’s SynapseAI SDK to migrate existing algorithms due to architectural differences. Habana alternatively provides pre-trained models for image classification, object detection, natural language processing, and recommendation systems in its GitHub repository.

“The use of machine learning has skyrocketed. One of the challenges with training machine learning models, however, is that it is computationally intensive and can get expensive as customers refine and retrain their models,” AWS EC2 vice president David Brown said in a statement. “AWS already has the broadest choice of powerful compute for any machine learning project or application. The addition of DL1 instances featuring Gaudi accelerators provides the most cost-effective alternative to GPU-based instances in the cloud to date. Their optimal combination of price and performance makes it possible for customers to reduce the cost to train, train more models, and innovate faster.”

Sizing up the competition

In the June 2021 results from MLPerf Training, an industry benchmark for AI training hardware, an eight-Gaudi-system took 62.55 minutes to train a variant of the popular computer vision model ResNet and 164.37 seconds to train the natural language model BERT. Direct comparisons to the latest generation of Google’s TPUs are hard to come by, but 4,096 fourth-gen TPUs (TPUv4) can train a ResNet model in about 1.82 minutes and 256 TPUv4 chips can train a BERT model in 1.82 minutes, MLPerf Training shows.

Beyond ostensible performances advantages, DL1 delivers cost savings — or so assert Amazon and Habana. Compared with three GPU-based instances — p4d.24xlarge (which features eight Nvidia A100 40GB GPUs), p3dn.24xlarge (eight Nvidia V100 32GB GPUs), and p3.16xlarge (eight V100 16GB GPUs) — DL1 delivers an on-demand hourly rate of $13.11 when training a ResNet model. That’s compared to between $24.48 per hour for p3 and $32.77 per hour for p4d.

Eight A100 40GB GPUs can process more images (18,251) per second during training than an eight-Gaudi-system (12,987). But Habana is emphasizing the efficiency of its chips over their raw throughput.

“Based on Habana’s testing of the various EC2 instances and the pricing published by Amazon, we find that relative to the p4d instance, the DL1 provides 44% cost savings in training ResNet-50. For p3dn end-users, the cost-saving to train ResNet-50 is 69%,” Habana wrote. “While … Gaudi does not pack as many transistors as the 7-nanometer … A100 GPU, Gaudi’s architecture — designed from the ground-up for efficiency — achieves higher utilization of resources and comprises fewer system components than the GPU architecture. As a result, lower system costs ultimately enable lower pricing to end-users.”

Future developments

When Intel acquired Habana for roughly $2 billion in December 2019, twilighting the AI accelerator hardware developed by its Nervana division, it looked to be a shrewd move on the part of the chip giant. Indeed, at its re:Invent conference last year, Jassy revealed that AWS had invested in Habana’s chips to expedite their time to market.

As an EETimes piece notes, cloud providers have been cautious so far when it comes to investing in third-party chips with new compute architectures for AI acceleration. For example, Baidu offers the Kunlun, while Alibaba developed Hanguang. Chips from startups Graphcore and Groq are available in Microsoft’s Azure cloud and Nimbix, respectively, but prioritized for customers “pushing the boundaries of machine learning.”

The DL1 instances will sit alongside Amazon’s AWS Trainium hardware, a custom accelerator set to become available to AWS customers this year. As for Habana, the company says it’s working on its next-generation Gaudi2 AI, which takes the Gaudi architecture from 16 nanometers to seven nanometers.

DL1 instances are available for purchase as on-demand Instances, with savings plans, as reserved instances, or as spot instances. They’re currently available in the US East (N. Virginia) and US West (Oregon) AWS regions.