Any article about IBM’s big mainframe hardware waxes lyrical about the ‘good-old days’ of multi-generational backwards compatible processing that props up almost every key industry world-wide due to features such as 9x9 (99.999999%) uptime, security, and hardware focused on database and transaction processing. Rather than re-write a walk down memory lane for some, let’s focus on the fact that big iron is still here. Not only that, it’s more important and accessible than ever.
In 2021, IBM launched Telum - the processor at the heart of its z16 product line. Eight cores at 5.2 GHz, the chip boasted more L2 SRAM than you can shake a stick at, with virtual caching features that just made sense. The design pivoted away from a more chiplet based approach, with IBM leveraging speedups ideal for transactional work that scaled across many chips in a single system. The chip also became the centerpiece of the fourth generation of LinuxOne – a series of Linux-focused systems purely for the z architecture. Another big focus, for z15 and LinuxOne, was the built-in AI engine with every chip, allowing focused models to run alongside critical workloads.
This year, we’re seeing the next generation of that chip: Telum II. While the systems it is going into haven’t been announced yet (it’s likely the z16 and LinuxOne Gen 5), IBM has started to speak in depth about what’s under the hood. Still eight cores, but using a Samsung 5nm high performance process and increasing the total L2 SRAM from 256 MB to 360 MB, still at an average 19 cycle latency. There’s still an AI accelerator onboard, but the big new features is a built in DPU that shares the internal memory like a core.
The onboard DPU is how IBM is expanding the IO capability of its zSystems portfolio going forward. IBM has a rich history of IO expansion (CAPI, AXON, among others), and with the new platform, it can expand to 192 PCIe slots per system.
These slots can be used for GPUs, storage expansion, cryptography, hardware security modules, or the other part of IBM’s new hardware: the IBM Spyre AI Accelerator (that’s the official name, we’ll just call it Spyre).
Spyre is the productization of IBM’s Artificial Intelligence Unit (AIU) PCIe card we’ve seen them showcase for the last 18 months. Spyre is ‘next-gen’, with similar performance levels but enterprise level features and designed to meet the demands of IBM’s customers. It’s a full-length, half-height PCIe card with an inference chip and some LPDDR5 onboard – but it’s 32 cores of the same architecture and software stack as the AI engine built into Telum and Telum II. This allows workloads to migrate and scale as the demand requires.
IBM’s clear that there’s demand for this – customers enabling the Telum strategy were initially exploring advanced fraud detection methods and similar, which fit into a single core on the Telum chip, but now are looking to bigger models for deeper insights and more accurate responses. For example, a small model on Telum could give a confidence level of fraud as an output – if it’s at 75%+ or 90%+, then the data can be run on larger, more accurate models connected to Spyre cards through the DPU for another reading. This is one example, but IBM also mentioned things like generative models.
There are specifications galore about Telum II and Spyre, and I’m applauding IBM for being open about their design in such detail. I wish we had more info about the microarchitecture of the core, but IBM’s play here is focused on system sales and consulting support. People forget that IBM has over 230,000 employees.
As part of the launch, I was able to sit down and chat with Dr. Christian Jacobi, IBM Fellow and CTO of Systems Development, who was the lead on Telum I and has spent over 20 years at the company in various forms of hardware design.
You can watch the interview live in this YouTube video, or scroll below for the transcript.
Ian: It’s been a couple of years since the Telum I presentation here at Hot Chips - but we were remote, so you recorded it. How was recording it on stage?
Christian: It was weird and awkward! We did the recording at the auditorium in Yorktown, with only three people in the audience. They were all wearing masks while I did the recording. I'm glad that this is behind us!
Ian: The reaction to Telum I thought was pretty good. Everybody was excited to see a new Z architecture-based chip. I wanted to talk a little bit about you and your background because you've been at IBM, how many years now?
Christian: 22 years! I finished my PhD and went straight to IBM at the lab in Böblingen in Germany.
Ian: You have this title of Fellow - but IBM has a strict definition of a Fellow. What's that all about?
Christian: IBM Fellow is the highest technical level. It's the highest technical appointment at IBM, and there's only about 70 fellows at IBM right now (for 220,000+ employees).We have broad responsibility, and Fellows exist in all the major business units - from consulting to software to cloud systems. We generally advise on the broad technical direction in which the company should be going. I came up through chip development - I've spent most of my time in chip development, mostly on the Z systems, but I've also done a couple of stints in Power systems.
Ian: What does chip development mean, core architecture, or are we talking about SoC, packaging, or all of the above?
Christian: I've done pretty much all of the above, at one point or the other. I started out in floating-point logic design, then I did cache design. I’ve worked on L2, L1 caches. I've worked on FPGA-based accelerators, I helped with the NEST architecture that we did on Telum I. I've worked on AI accelerators. So I've been around the block and seen a thing or two!
Ian: So do you remember the first chip you worked on?
Christian: The very first chip I worked on was the Cell processor for the PlayStation back then. Long time ago!
Ian: There are so many interesting things about the Cell chip, we'll save that for a different interview! Did you have any key highlights in your career that you're particularly proud of?
Christian: There's a few. Early on in my career, I was the logic lead for the load-store unit. That's one of the really central units of the core and we did the first out-of-order implementation back then. It was a bunch of years ago now, but that was a huge assignment for me and I learned a lot. I actually went on assignment to Poughkeepsie from Germany for that work. Then later I became the core lead on the z14 and the overall chief architect for the Telum processor - obviously that's a big highlight. It's been a good run, and I’ve really enjoyed the 22 years so far - I'm looking forward to many more to come.
Ian: So you were Chief Architect with Telum I, and now you're CTO of Systems. What does that mean?
Christian: Technically I'm the CTO of System Development, so we do obviously a lot of development activity in the different brands like Power systems and z systems. So I'm CTO for that organization, connecting the dots, making sure that we cover the gaps and that we’re connected into the broader IBM strategy. We look into where IBM is going from a broad AI story, and how it ties into some of the consulting work, and how we prepare that side of the business for when we bring out new systems.
Ian: Where is Power right now? We don't hear often from the team.
Christian: Oh, Power in a really good place. The Power10 system that's out in the market is very successful, both from a volume shipping and from a financial perspective. There's really good work going on for the Power11 chip and the Power12 chip. We're already starting early conversations on what the generation after that will look like.
Ian: Have you ever considered moving from hardware to software?
Christian: It's funny - I've actually moved the other way. I started out as a programmer in a very small company in my hometown. We did enterprise resource planning, software development. Then I went to college, the University of Saarland in Saarbrücken, Germany, and in the very first class, 9:15am, Monday morning. Literally the first class I took, the professor starts talking about AND gates and OR gates and truth tables and adders, and then he shows how you can add multi-digit numbers, binary numbers, using a chain of full adders. And I was hooked on hardware.
Ian: That's all it took?
Christian: It took a little longer than that in the end, but I was pretty amazed by that.
Ian: How would you describe the Z architecture and the chips behind it, like Telum, to somebody who has no clue what IBM hardware does?
Christian: So I would say you have to first understand that IBM Z systems are the backbone of IT infrastructure in many large organizations. They run their mission-critical, most sensitive workloads on those platforms, and because of that it is super crucial that they are highly available, secure, and scalable. So there's a lot that comes from just recognizing these use cases. Then the chips that we're building are designed for purpose. As always with that, “what are the workloads?” “What are the clients doing with those chips?”. That really is deeply infused in the design process as we go through the project from concept phase to execution and release.
Ian: What does high availability mean in this context? I mean, from a basic perspective, I could go to the cloud today and get systems available to me there.
Christian: High availability means eight nines or more of availability. 99.999999%-plus. Just to translate that, it turns into one hour of downtime every 11,400 years. So it's a different standard.
Ian: So a 10th of a second per year?
Christian: Or even less than that.
Ian: Why the change from a chiplet-esque architecture pre-Telum, with compute chips and cache chips up to 960 MB of cache, to this unified architecture?
Christian: I wouldn’t really call it a chiplet architecture! On one of the Z systems, we had a glass ceramic multi-chip module (MCM) with six processor chips and one cache chip on the module. We then went to the current design point because we could reduce everything into a single chip while still growing performance and still growing the total number of available cores and add new functionality like the AI acceleration and post-quantum security and cryptography.
That just wasn't a pain point anymore once we came up with a new cache architecture. So from an efficiency perspective, it just made sense to go to the dual-chip module architecture that we have on Telum I, and now continue with on Telum II.
Ian: The way you've pitched it almost sounds like it was purely because you could do the cache on die with virtual cache?
Christian: That was a huge part of it. We had about a gigabyte, like you said, of physical L4 cache. We came up with that virtual cache hierarchy design point, where we use the very large L2 caches as virtual cache. The ups and downs of every core make it so that they don't all equally use their L2 - we can use that spare L2 or underutilized L2 capacity, as a virtual L3 and a virtual L4. We get to two gigabytes of cache in a drawer of four dual compute modules (8 chips), so we ended up actually having more cache. We didn't need to duplicate as much data anymore.
When you're in the physical hierarchy, we duplicate every line because it’s inclusive. So whatever was in the L2 was also in the L3, what was there was also in the L4. You ended up not actually getting that much additional cache real estate out of a dedicated chip, so with that virtual cache architecture, we ended up growing the effective available cache to the system and that became a big part of the performance gain on Telum I.
Ian: When you build for things like high reliability, does it mean you have to approach the design of the core resources in a different way to perhaps you would do a normal core?
Christian: I’d say high-availability and resiliency means many things, but in particular, two things. It means you have to catch any error that happens in the system - either because a transistor breaks down due to wear over the lifetime, or you get particle injections, or whatever can happen. You detect the stuff and then you have mechanisms to recover. You can't just add this on top after the design is done, you have to be really thinking about it from the get-go.
I'll just give you an example, when we designed the new L2 cache on Telum I, we spent weeks and weeks figuring out how can we maintain the low latency so that we don't have an ECC correction circuit that flows between the L2 cache and the L1 cache, because that's like three or so added cycles of latency. We didn't want to invest only to lose the error protection that you get when you have ECC. So we ended up with a parity scheme, plus a sort of RAID scheme, so we run the parity directly into the L1 cache. When something breaks, we can go back to the RAID scheme and recover the data, and with that mechanism, even if an entire SRAM instance would break maybe with like a catastrophic bitline or wordline fail or clocking fail or so, we can continue to recover the data - but that had to be baked into the base design. That's not something you can add on at the very end.
For Z, customers want the strong coherency if inclusive caches. We're strongly ordered, and so if you don't want to have too much intervention traffic, and you want your last level cache to know exactly what cache line is where so that it can send the interventions to the right place and doesn't need to broadcast interventions, that's very inefficient, so we've always maintained that full inclusivity so that the higher level cache knows where the cache lines are in the lower level caches and can target the interventions caching validations. So now with that virtual architecture, we don't really need to replicate cache lines multiple times unless multiple cores want that cache line at the same time, so that gives us a lot of efficiency in the cache as I mentioned earlier.
Ian: So when you're talking about the reliability, is it literally down to “each cache line has to be able to support failover and redundancy and reliability”?
Christian: And recovery mechanisms! When we detect an error, we have the architectural state of the process that is saved in what we call the R unit, and every instruction that completes updates that checkpoint with ECC. So if an error gets detected, we can completely reinitialize the core and then load the architecture state back into the core and keep running. It's those kinds of things, and it is quite unique compared to traditional core design.
Ian: Why build an AI co-processor into Telum I and now Telum II?
Christian: So this goes back to what I said at the very beginning. Being designed for a purpose and understanding what that purpose is. Our clients are running enterprise workloads: transaction processing, databases. They want to infuse AI into those workloads, but they are not about ‘just running AI’. So when we decided we want to enable clients to really use AI as part of their transactions, and at transaction scale, it means millisecond latency for each inference task. We, like pretty much everybody, thought about how much AI should we add to each core.
When we went and thought that through, we realized that when the cores are running their normal workloads, the AI circuitry sits idle. Then when a core that needs to do AI, it only gets its local AI capability. We realised that's not great for utilization, and we would be much better off taking all of that AI circuitry, consolidating it into one large area, and then whenever a core needs to do AI it can access that entire capacity. As a result, due to our unified AI compute engine, we're giving access to more compute than we could allocate to each individual core by partitioning.
By exploiting the fact that we're a CISC architecture, we could actually make it so that the program runs on the main core, and there's a CISC instruction for matrix multiplication, for convolution, etc. Under the covers, that's the instruction that's executed on the AI accelerator. So we get that it is part of the core, but only temporarily, then when the next core wants to do AI, that big AI engine becomes part of that core in a certain way.
Ian: As you imply, there are eight cores on the chip, and then a single AI acceleration unit. Is the AI utilization low compared to the rest of the compute?
Christian: I would say that it's not that the utilization of that AI engine necessarily is low - we're seeing great adoption. That utilization is going up, which is why we're investing a lot to beef that capability up and we can talk more about that. But every core only performs AI a certain percentage of its time, and when it performs that AI, we wanted to give it as much AI compute as possible. If you had spread that AI capacity across all eight cores, whenever a core did AI, it would only get one-eighth of the AI capability.
Ian: Can you do a virtual compute capacity like you do with the caches?
Christian: When it needs to do AI, it temporarily attaches to the engine, shoves the addresses down the engine as a first-class citizen on the on-chip cache fabric, and it can grab that data real fast and do the compute. When that matrix multiplication is done and another core wants to do AI, it attaches to it for the period of time and basically makes that unit logically part of that core.
Ian: Were your customers coming to you and saying, we need an AI accelerator? Or were you thinking to put some AI in and then teach them how to use it? How does that work?
Christian: We work with our clients very intensively. We have several client councils where we talk through different technology developments and get input from them on where they see the market going. We give them input on what we think might be happening. That happens two to four years in advance for the hardware, and then obviously a little bit closer for the software.
I haven't mentioned this yet, but we're very stack-integrated on the mainframe, so we're really spanning that entire stack and that design thinking with our clients. In that conversation with clients, we postulated that AI will be a big thing, but they also looked at it from their side, they brought data scientists to the table, and application developers, and agreed. We looked at what kind of use cases were needed, and how much compute was needed. We asked about the kind of models would be running.
So it's not like we came up with it, and then told them to “do this”. Rather we asked, “AI is going to be important - do you agree? If you agree, can you bring your AI experts to the table so that we better understand exactly how to build this?”
Ian: What sort of models are run with the on-chip AI? Are we talking tens of millions of parameters, or language models with a few billion parameters? Where does transactional fraud detection fit in that spectrum?
Christian: That’s an interesting development. A few years ago, we were talking about hundreds of thousands of parameters, or millions of parameters - relatively small models compared to what we have today. Those models are somewhat underappreciated today, but they still can deliver a lot of value to our clients in terms of catching most of the transactions - detecting whatever insurance claims fraud or credit card fraud, whatever it may be. They are fast, they are power efficient, and they have super low latency so that they can fit. But what's also happened is that clients have figured out how to use LLMs, specifically the encoder models, to also do fraud detection or things like that with higher accuracy.
Ian: Still a small amount of parameters?
Christian: With a BERT level model, a small number of parameters, a hundred million parameters or so, what we're seeing is this emergence of the ensemble method for AI, where clients run the bulk of the transaction through a relatively simple model that they can do on the chip and get millisecond latency on the inference. The models not only say ‘yay’ or ‘nay’ on fraud, but they also give their confidence score. So you use the fringes of the confidence spectrum to then go to the top 10% and bottom 10% transactions and run through a larger model to get an overall better accuracy of the prediction. We've actually designed a second chip for this next generation, the Spyre chip that is optimized for that use case of the large language models.
Ian: One of the big new features in the Telum II chip this time is the built-in DPU. This is high-speed dedicated networking silicon. How would you describe it?
Christian: It's not only networking! So as you can imagine, mainframes process gazillions of transactions every second, right? Terabytes and terabytes of DRAM, so I/O plays a really important part in the overall system structure and value proposition of the system. We were looking at what's the next innovation in that space, and of course the world is using DPUs now in a number of use cases. We looked at the technology and asked ourselves if this a kind of innovation space that makes sense for us as well.
As we considered it, we thought the typical DPU is really implemented on the ‘off-processor’ side of the PCIe interface. We thought that the technology was great, but it doesn't really make sense for us on the ‘off-processor’ side of PCIe. We ended up putting it is directly on the processor chip. There's an on-board DPU now, which is directly connected to its own 36MB L2 cache. So it's a first-class participant in the overall system coherence and we can run all of our enterprise-class I/O protocols, both for storage and for networking on that DPU.
There are 32 programmable engines, but that firmware gets implemented, and then a bunch of dedicated I/O acceleration logic. What happens in those enterprise-class protocols is they communicate with the main processors, where the transaction workload is running, and creates address lists like about where the next block of data needs to be put in the database buffer pool. The DPU can now coherently fetch these address lists and similar things out of the main memory of the main process processing system without needing to go across the PCIe bus. That's just a simple example, but there are many of those interactions.
Ian: That may save latency, but you have to pay for it in the die area on the main chip.
Christian: And that's okay. We pay for the die area on the main chip, but we save latency, and again, I come back to design-for-purpose. This is what we're doing. We could have added more cores to the chip, but our clients weren't necessarily looking only for more cores. That's one important dimension, but it's not the only important dimension in that trade-off of what makes the most sense. We decided to put that DPU directly on the processor.
Ian: That does the cryptography, the RAS, and it has direct memory access (DMA)?
Christian: Yeah, exactly.
Ian: IBM seems to be using this primary to extend the AI capability on the backend with this AI processor you call Spyre?
Christian: That is one part of the PCIe infrastructure. Behind the DPU are I/O expansion drawers. A fully populated system can have 12 I/O expansion drawers, with 16 PCIe slots each. That's 192 PCIe slots in a machine, that's a lot of I/O capability. Clients need that. I'm coming back to the availability and resilience - clients build out fairly massive I/O infrastructures for failover and redundancy in those things, but that also allows us now with the Spyre adapter to build out a very massive AI inferencing capability in the I/O subsystem.
Ian: You could almost put anything in those PCIe slots? The crypto engines, the post-quantum stuff?
Christian: Yeah, we have network adapters, we have storage connections, we have crypto, HSMs, and we’ll have AI adapters and then there's a couple more.
Ian: Spyre looks to me as if it's just the AIU, the artificial intelligence unit from IBM that we’ve shown in the past. What's different from that IBM Research AIU?
Christian: Spyre is the second-generation version of that chip. We've been working very closely with our friends in Yorktown at IBM research, even back to the Telum I days. The AI engine that's on the Telum I, we’ve now enhanced for Telum II. In Yorktown in the systems development team, we took that design from research and productized it on the processor and then added all of the RAS capabilities and brought it to the Telum I processor and so on and so forth.
Ian: When you say added - do you mean it was built and then you had to change or were you involved in the conversation? In the same way you have your customer councils?
Christian: Similar. But that was in the early days of AI research, where we weren't even in the game yet. The same time we were asking if we want to do AI on this chip, or not. I’m going back like five, six, seven, eight years now. But that's great to have a research organization that thinks that far ahead, then does it in a way that is close enough to what we're doing in system development. Close enough to be able to transfer those assets that are not quite product ready, but we then can take with a product team and work together with the research team to productize them. The same thing happened with the Spyre chip - there was a prototype that research put out, and when we decided we do need something for large language model capabilities on the Z systems we worked with the research team. We had a bunch of our own engineers work together with them to put the second generation out, which now has all of the enhancements that you need in an enterprise setting, like virtualization capabilities and such.
Ian: You mentioned LLMs before as the backend check to the fraud detection model. Is that all they're used for?
Christian: No, that is one main use case. There's a second main use case for the Spyre chip. We have generative use cases on the platform as well - not for general-purpose chatbots or something like that, but very specific uses, for example, code assistance or general system admin assistance. Our clients run massive software applications - some clients have hundreds of millions of lines of code, and of course, AI is a technology they want to apply when optimizing that code, doing code transformation, explanation, things like that. Their code bases are extremely sensitive. Think about it, if a code base represents how to run a bank, or how to run an insurance company, they have a great interest in keeping that code base on the mainframe. Then when they want to, when they run AI models for say, code assistance, they would rather not have that code base flow off to some other system. So we're taking the Spyre adapter, the Spyre chip, and can cluster eight of them into a generative AI cluster. That then has the compute capacity and the memory bandwidth and all that to get a really good user experience when running generative workloads, like Watson Code Assistant for Z.
Ian: When can I buy it on Amazon?
Christian: I'll put you in contact with my sales team here!
Ian: IBM's tour around foundry offerings has been well documented. Now you are working with Samsung Foundry for these chips - both of these chips (Telum II and Spyre) are on a 5nm node, but different variants. IBM used to require a super-specialized high-frequency process for Z, so I want to ask, how is the relationship with Samsung on these highly optimized IBM specialized chips?
Christian: It's fantastic. I mean, we come from a background where we did our own development of technology and silicon nodes, along with research and our technology development division in manufacturing - which we now no longer do. We use Samsung, and we're mostly using their high-performance process as is. There are very few tweaks we're doing, like the number of metal layers, and we're really happy with the relationship. They have great performance, great yield. It's a wonderful partnership.
Ian: Are you using AI in your chip design?
Christian: We are getting there. There are several projects going on right now where we're looking at using AI for simulation screening. We're alsousing AI for just a general sort of know-how engine, where you can load a lot of design data into there, and documentation, and emails, and select threads. and all that into a RAG database. Then you can use that sort of as a Q&A machine when you're sitting at 3:00 AM on the test floor and you're wondering how something works and you can't find the right person. I'll say it's in its early days. But we're getting there.
What's really interesting is that it's not as easy as just loading everything in a RAG database. That data is very sensitive and security-relevant information. Not everybody has access to everything, so how do you build a RAG database that has user-specific access rights and user-specific credentials to certain documents, but not other documents. That's a really complicated space, and what excites me about that is I believe that most of our clients will have similar issues when it comes to actually using AI, so in a way we're client zero and figuring out some of those things and then applying it to others in the industry.
Ian: I was going to say that sounds very much like your IBM consulting product area!
Christian: There's a pipeline here that I think could happen where we figure out some things inside IBM and then take that learning and experience and productize it both as software products and as consulting assets.
Ian: What's been the response to LinuxONE?
Christian: Fantastic.
For clarity, we have basically two brands - We have IBM Z and we have LinuxONE. On IBM Z, you run many of the traditional workloads with z/OS, Kix, IMS, DB2, those kinds of things. But you can also run Linux there, and in fact you actually can run Linux inside the z/OS environment as a container extension now.
Then we have a second brand, LinuxONE, which is really dedicated to Linux-only applications. Of course we have clients who use both, and we have clients who use IBM Z and they use LinuxONE machines for that aspect. Overall, Z is growing very healthily. LinuxONE is the fastest area of growth for us right now.
Ian: Where do you think you spend most of your time, dealing with customers on one or the other?
Christian: What's really interesting is in those councils we have representatives from both, and as I said, many clients use both. When you look at it, the requirements that one side drives, at least on the hardware side, of course, can be similar on the software side. For example, what you would do to have a high-performance, high available, highly scalable solution for like databases on z/OS, it's the same stuff that you really need to run a workload consolidation of many database instances onto a LinuxONE server. So very rarely is there a competition of if this side wants this and that side wants that. It's really working together very nicely
Ian: You’ve got some good synergy going there.
Christian: I mean, think about it - if you consolidate thousands of databases into a single server, what do you need? You need availability. You need security. You need scale and performance.
Thanks for the really interesting interview! Side note: the original Telum was launched in 2021 not 2011, right?
Fascinating!