Today Qualcomm is announcing that its Cloud AI 100 PCIe card, first debuted in 2019 and expanded in 2023, is coming to the wider embedded market. Built on a similar architecture to its smartphone AI chips at the time, Qualcomm has been diligently working on the software side to allow this architecture to be used for all types of machine learning models, from computer vision and reasoning to machine learning and even Llama3.1-70B type language models.
Several OEMs, including Lenovo and Aetina, will start to offer AI Embedded Appliances – microwave sized boxes with an AMD or Intel CPU as well as 1-4 of these cards. These models are designed for on-location edge AI needs – imagine it in a local fast-food location that isn’t running a server farm or but isn’t continually communicating up to the cloud.
OEMs/customers can choose between Qualcomm’s three offerings, scaling with compute demand and or concurrent users. The standard and Pro AI 100 devices differ in total compute (350 TOPs vs 400 TOPs) and memory capacity (16 GB vs 32 GB). The AI 100 Ultra puts four of the 32 GB models onto a single device, running at a more efficient frequency, but for a total of 64 AI cores at 150 W. The AI 100 Ultra was the updated card, announced in 2023.
Why is this important? Well let’s go back to the whirlwind that is Qualcomm’s Cloud AI family.
I want to start by broadly defining the machine learning hardware market into three or four main segments. We have
Data center training – NVIDIA B200, H200, AMD MI350X
Data center inference – NVIDIA PCIe, AMD PCIe, Intel Gaudi 3, most startups, AI-as-a-service
Edge inference – PCIe or M.2 but on location in a dedicated appliance
Device inference – AI PC, wrist watch, ultra low power
First announced in 2019, with a deeper dive at the Hot Chips conference in 2021, the Cloud AI 100 hardware was seen as an attempt for Qualcomm to be a player in that second market, one that has become synonymous with lots and lots of smaller chips but with high memory designed to cater for the lumpier (but ultimately more profitable) inference workload that will transpire. We have several dozen companies fitting into that space today, each often focusing on specific customers, workloads, or verticals.
After that 2021 presentation, the AI 100 hardware seemingly vanished from the radar. We didn’t hear anything about it until mid-2023. A lot of questions were asked through official and unofficial channels as to the status – was it actually ready at launch? What software did it support? What was the performance? Who was using it? Where can I get one? Official answers were few and far between, partly because the AI 100 team didn’t have a focused marketing message. Most of Qualcomm’s outbound discussion points at the time were in their growth markets – smartphone, automotive, and telco. At points it seemed no-one in the team the press talked to knew who was in charge of the AI 100 family, let alone provide any official commentary.
Over time I came to learn that the engineering team behind the Cloud AI 100 hardware had been moved around a couple of times during corporate re-shuffles. Companies often move teams under different banners to realign with corporate needs – between 2019 and today, Qualcomm has cancelled a server chip, bought a CPU startup, changed CEOs, gone through the 5G transition, is talking all about AI PC and automotive, and many other major changes. The Cloud AI 100 team, at the top it off, had ended up as part of the automotive division. That’s despite the hardware not being targeted at automotive on the outset – once I’d learned of this, it was still difficult to understand if there was a new automotive angle to the hardware. But nonetheless, through 2019-2023, it was unclear if it was generating revenue, let alone have a roadmap in place, or if the hardware was suitable for the new transformer workloads hitting the market that had killed a number of startups whose architectures couldn’t port as easily.
In 2023, and through 2024, we learned that Qualcomm did have some public engagements with Cloud AI 100, and the new Cloud AI 100 Ultra. We saw another startup, Neureality, work with customers to provide boxes using Neureality as a host to Cloud AI 100 cards. We also saw Cerebras, as part of Cerebras’ big G42 engagement worth over a billion dollars, collaborate with Qualcomm on the inference side of those installations. AI 100 had also started to appear in several Tier-2 cloud instances.
Engagements with partners that I’d spoken to highlighted that Cloud AI 100 was efficient, plentiful, and cheap – in specific cases offering 2x the TCO against NVIDIA. The other benefit of having a data center inference card from a big behemoth like Qualcomm is that they know how to manufacture at scale, on the leading edge, and the company isn’t going to disappear any time soon – whereas a startup in this field could run out of money or not have the expertise.
Even with these announcements, Qualcomm themselves have remained fairly quiet when it comes to the development, capability, and roadmap of the hardware. We have seen several benchmark submissions of the hardware to MLPerf, a quickly coalescing industry benchmark for ML performance, whereas other competitors in this field haven’t submitted results.
However that changes this week. As part of today’s announcement we were able to spend time with one of the senior leaders of the Cloud AI 100 team, talk about the new engagements, and get some insight into the future of Cloud AI 100.
The engagement with Lenovo and Aetina is based on downstream customers requesting easy-to-use edge AI hardware in a microwave-sized form factor. Qualcomm supplies the PCIe cards, and the OEMs build appliances around them for their customers – when asked, Qualcomm said it is up to the OEM to decide on the CPU, networking, or any other management features. Essentially this is a very typical component/OEM relationship.
As some of the use cases for this hardware are in language models, especially big models like Llama3.1-70B, we asked as to the flexibility of a 5-year old architecture using transformer models. Qualcomm stated that the smarts here are in the software. They’ve tailored models to perform at user-expected outputs. This means, in worst case, around 15 tokens per second output, with a couple of seconds for time-to-first token. Smaller models can go faster obviously, but the use of 32 GB of LPDDR per chip will help for concurrent 7B models to be run simultaneously. These appliances are designed for lumpy inference workload, for example in retail, where a model might be used to show new clothes on people as they walk past, or at a retail setting responding to user requests. There isn’t an individual Burger King storefront taking 1000 requests a second, but taking 5-15 might happen during peak hours.
On the software side, Qualcomm is stating that the Cloud AI 100 family already supports thousands of Hugging Face models. The software stack to use the hardware is called Qualcomm’s AI Inference Suite, which comes pre-loaded onto the appliances. It should be noted that this is different to Qualcomm’s AI Hub, used for the new Snapdragon X Elite platforms. While both share similarities in architecture, the software stacks are different to cater for the different markets – eg quantization is more of a feature on AI Hub than Inference Suite. Users can still start with the top level PyTorch/ONNX models but follow different compiler routes for optimization and kernel creation. At some point in the future there may be an effort to unify the two, but not as of yet.
With the analysts on our briefing call being technical, the conversation naturally pivoted to the state of the Cloud AI 100 business and hardware, given its origins. The response was positive – there’s a roadmap! New silicon will be on its way at some point, however more details on what that roadmap looks like will be given at a later date. It wasn’t the sort of call to ask about the structure of the AI 100 team, but I hope that comes as well.
For anyone interested in using the hardware, last year an announcement was made that Cloud AI 100 will be made available in the cloud. Users can go to cloudai.cirrascale.com and sign-up to use Cirrascale’s AI Playground and see it working for free. The playground has a number of different tasks and models, however has a fixed set of inputs in the free tier, but you can play with some of the output properties, such as temperature. For a few extra $$$ you can customize inputs/outputs and have a bigger scope and perception of performance.
Qualcomm’s messaging it seems is that if you are interested in developing a platform with AI 100, then contact them directly. That’s what the startups have done, that’s what the cloud providers have done, and now the OEMs with these edge appliances. This mostly in-bound strategy for business isn’t the one that takes it to new heights or consumption of the TAM, but it’s what Qualcomm is doing right now. With the rich engagement we had on the call with the AI 100 senior team, I hope they understand that there could be more excitement if we saw some outbound customer generation as well. Hardware has never been cooler.