Amazon SageMaker HyperPod facilitates the training and maintenance of LLMs

At the re:Invent conference, Amazon’s AWS cloud arm today announced the launch of SageMaker HyperPod, a new service built for training and repairing large language models. SageMaker HyperPod is available now.

Amazon has been betting for a long time SageMaker, its service for building, training and deploying machine learning models, as the backbone of its machine learning strategy. Now, with the advent of generative AI, it may not be surprising that it also relies on SageMaker as the main product to facilitate its users in training and improving large language models (LLMs).

“SageMaker HyperPod gives you the ability to create a distributed cluster with accelerated instances optimized for contentious training,” Ankur Mehrotra, AWS general manager for SageMaker, in an interview ahead of today’s announcement. “It gives you the tools to effectively distribute models and data in your cluster – and that speeds up your training process.”

He also noted that SageMaker HyperPod allows users to constantly save checkpoints, allowing them to stop, analyze and optimize the training process without having to start over. The service also includes a number of fail-safes so that if one of the GPUs falls for some reason, the entire training process will not fail.

“For an ML team, for example, that’s only interested in training the model – for them, it becomes like a zero-touch experience and the cluster becomes a self-healing group of one way,” explained Mehrotra. “All in all, these capabilities help you train foundational models up to 40 percent faster, which, when you consider cost and time to market, is a big difference.”

Image Credits: AWS

Users can choose to train on Amazon’s own custom Trainium (and now Trainium 2) or Nvidia-based GPU examples, including those that use the H100 processor. The company promises that the HyperPod can speed up the training process by up to 40%.

The company already has experience using SageMaker for building LLMs. For example, the Falcon 180B model trained in SageMaker, using a cluster of thousands of A100 GPUs. Mehrotra noted that AWS drew on its learnings and prior experience scaling with SageMaker to build HyperPod.

Image Credits: AWS

Perplexity AI co-founder and CEO Aravind Srinivas told me his company got early access to the service during its private beta. He noted that his team was initially skeptical about using AWS for training and fine-tuning its models.

“We didn’t work with AWS before,” he said. “There’s a myth – it’s a myth, it’s not a fact – that AWS doesn’t have a good infrastructure for large-scale model training and obviously we don’t have time to do the due diligence, so we believe it .” The team is connected to AWS, however, and the engineers there asked them to try the service (for free). he also noted that he could easily get support from AWS — and access to enough GPUs for the Perplexity use case. It obviously helped that the team was already familiar with doing inference in AWS.

Srinivas also emphasized that the AWS HyperPod team is strongly focused on speeding up the interconnects that link Nvidia’s graphics cards. “They went and optimized primitives – different Nvidia primitives – that allow you to communicate these gradients and parameters to different nodes,” he explained.

Read more about AWS re: Invent 2023 at TechCrunch

Leave a comment