Guest Post By George Cozma, Editor-in-Chief of Chips And Cheese
To say that AMD’s story has been a roller coaster would be an understatement - there is a massive contrast between the AMD of 2014 and AMD of 2024. Where the AMD of a decade ago was floundering, the AMD of today is resurgent and is a key player in many markets. As with many other players in this space, AI is a primary focus, with the company building a dedicated AI team internally to cover the full end-to-end strategy for a rapidly blooming AI market.
In recent weeks, AMD CEO Lisa Su, and SVP/GM of Computing and Graphics Jack Huynh, both answered questions from industry analysts as to the nature of AMD’s AI hardware strategy and how to look at its portfolio.
AMD’s AI hardware strategy comes in three prongs.
The first is AMD’s Instinct series of Datacenter GPUs, which retail in the form of the MI300 series. There are two variants, and the MI300X is focused on AI - it has seen success in getting adoption from large cloud players such as Microsoft and Azure along with some smaller AI-centric clouds like TensorWave. In the latest earnings call, Lisa Su commented about an expanding demand for these chips, pushing up from $2b to $3.5b revenue by the end of 2024. At the launch, AMD compared itself to NVIDIA’s H100 favorably, marking an eight-chip system as equal in ML training, but better in ML inference.
The other variant of this series is the MI300A, offering similar specifications but is a combination CPU/GPU, and targeted at High Performance Computing. It has been adopted into the largest planned global supercomputer, El Captian, which is going to use machine learning models to assist with protecting the United States Nuclear Stockpile.
Speaking to the adoption of MI300, Lisa stated
“We have been pleasantly surprised and [it has] been great to see the momentum with MI300, and where that momentum is coming from. Large cloud [customers] often move the fastest - from workload [to workload]. LLMs play very well to MI300 - our memory capacity and memory bandwidth [are market leading]. AI is the leading workload. [We have] quite a broad set of customers that come in with needs - some are training, some are fine tuning, some are mixed. [But out] confidence from the pattern as we start with customers. [We’ve spent] a lot of work with the software environment as well. New customers are [finding it] easier to reach their performance expectations, because ROCm (AMD’s softwate stack) is getting mature. [Our] largest [MI300] workload are large language models.”
It should also be noted that AMD recently announced it was expanding its chip-to-chip communication protocol, known as Infinity Fabric, to specific networking partners such as Arista, Broadcom, and Cisco. We expect these companies to go and build Infinity Fabric switches, enabling enable chip-to-chip communication for MI300 beyond a single system.
The second prong of AMD’s strategy is their client GPU lineup. This consists of both AMD’s Radeon discrete graphics cards (GPUs), and their APUs which consist of a GPU integrated on to a client CPU which are mostly used in laptops. Both the first and second prong of AMD’s AI strategy relies on their compute stack, called ROCm, which is AMD’s competitor to NVIDIA’s CUDA stack. A long running grumble about ROCm, even the latest version, is inconsistent support across enterprise and consumer hardware - only AMD’s Instinct GPUs have proper support for ROCm and its associated libraries and select discrete GPUs, whereas CUDA runs on nearly every piece of NVIDIA hardware.
However, Jack said in our Q&A,
“We [currently] enable ROCm on our 7900 flagships to allow you to do some AI applications. We are going to expand ROCm more broadly.” “There are schools and universities and startups that maybe can't afford a very high-end GPU, but they want to tinker. We want to enable that community as a developer tool.”
We are hoping this implies a wider ROCm support for current generation hardware as well as all future releases - more than just their flagship RX7900 series.
Lisa has also commented on AMD’s software stack saying,
“The big question [as of] recent has been software. We've made a ton of progress on software. The ROCm 6 software stack was a significant step up. There's still a lot more [to do] on software… we want to address the large opportunities.”
AMD’s third prong is their XDNA AI engines. While the technology comes from Xilinx, this IP was already licensed to AMD before the acquisition. These AI engines are being integrated into laptop processors, and will present as an NPU for Microsoft’s AIPC initiative to compete against Intel and Qualcomm offerings. These AI engines are designed for low power inference rather than the high throughput inference or training that the higher power GPUs are capable of.
Commenting on the position of NPUs versus GPUs, Lisa said,
“There are places where the AI engines will be more prevalent, such as PC and notebooks. If you‘re looking at large[r] scale, more workstation notebooks, [they’ll] probably use the GPU in that framework.”
AMD sees a future of multiple AI workloads and engines: CPU, GPU and NPU. It is worth noting everyone else in the space is making the same noises.
Jack commented,
“[For the] NPU, MS is driving [it] heavily because of power efficiency. The NPU can still drive experiences, but not hurt battery [life]. We're going to bet on the NPU. We're going to 2x and 3x on AI… The NPU is all about battery life - in a desktop, you tend not to worry about battery, but also custom data formats supported [by the NPU can be brought] into the desktop.”
This three pronged approach allows AMD is tackle the AI space on various fronts, showcasing not all the eggs have to be in the same basket. AMD has already seen some success using this approach - in the datacenter AMD is considered the closest competitor to NVIDIA. MI300’s memory capacity and bandwidth puts it in good competition against NVIDIA’s H100 hardware (B100 benchmarks we’re still waiting for). The NPU space is still too new and fluid to really determine if AMD’s strategy is paying off; however it is likely that Microsoft will use the NPU for local ML models, such as assistants or ‘co-pilot’ models.
From our perspective, the weakness of AMD’s strategy is in the desktop GPU side of things, due to the lack of near-universal ROCm support all across AMD’s hardware stack. This is an issue that will take time to resolve - one of the downsides of having a split front is a division of resources. AMD will require strict management to ensure work is not duplicated across the company. However, there are positives, with AMD ever increasing its forecast of datacenter revenue in 2024 claiming the limit is only in demand, not supply.
AMD’s Q1 2024 Financial Results are expected at the end of April (29th or 30th we think, TBC), with the Annual Stockholders Meeting on May 8th.