One of my big pushes in the start-up AI space is for companies that are actually making chips and hardware is to get it into the hands of developers. Too many start-ups sit there only engaging with one or two key customers, and if there’s one thing NVIDIA has shown how to succeed, it’s the fact that everyone can go out and buy a CUDA-enabled AI accelerator today, from the local PC hardware shop. This means there’s an army of developers out there, some home brew and some commercial, developing software for those platforms. Thus getting some hardware into the ecosystem is a must, and I’m pushing more AI start-ups to do it as a long term positive, rather than an immediate revenue gain.
One of the first to do this is Tenstorrent. In December last year, they quietly made available their Grayskull PCIe cards, basically an AI developer’s kit, to the masses. The e75 is the half-height version sporting a 75 watt chip and it comes with a blower fan, so at least you can put it into a PC, and the e150 is the higher powered, more memory version designed as a double width PCIe card for servers with air flow. The chips are available for $599 and $799 respectively, with one minor stipulation – you have to tell Tenstorrent why you want the hardware. The point of this first run of Grayskull cards is to get developers working on projects and providing feedback on the software stack to improve it in the long term. So this is a small-ish run of hardware, hence the picking and choosing, but anyone with a compelling potential use case, end-customer or homebrew, has the opportunity to go and make their case and buy one.
The e75 is, as mentioned, a 75W PCIe 4.0 card featuring 96 Tensix AI cores running at 1 GHz, with 8 GB of LPDDR4 memory and 102.4 GB/sec of potential bandwidth. The cooling kit is designed more as a server blower, but can be used in a standard PC form factor. On the chip is 96 MB of SRAM, and as it currently stands, supports Ubuntu 20.04 with support for other operating systems coming soon.
The e150 (on the left) is a fully enabled version of the silicon, with 120 Tensix cores but running at 150W using the same PCIe 4.0 x16 interface. The clock is bumped up 20% to 1.2 GHz, and those extra cores also provides a total of 120 MB of SRAM. There’s still 8 GB of LPDDR4, at a slightly higher 118.4 GB/sec bandwidth.
Now I can imagine what some of you are thinking – 8 GB is not nearly enough memory for modern AI inference. Tenstorrent does offer a full model support page, which includes BERT, ResNet, YOLOv5, U-Net, and others, however it’s important to note here that Grayskull isn’t the design that will eventually be volume produced. This is purely a developer kit, letting people get used to the hardware configuration and the software stacks before the big hardware comes later in future generations.
The software stacks come in two varieties – a high level and a low level. The high-level is called TT-Buda, using higher-level APIs to get things up and running, along with interfaces into modern machine learning frameworks. The lower level is TT-Metalium, which provides fine-grained control over the hardware for custom operators, custom control, and even non-machine learning code. Tenstorrent states that there are no black boxes, no encrypted APIs, and no hidden functions.
As the first official unboxing of the hardware, I went to the Tenstorrent offices and sat down with Dr. Jasmina Vesiljevic, a Fellow at Tenstorrent, to go through how a user might approach the hardware after getting the cards. You can watch the full interview here on Youtube, or read through the transcription below.
Ian Cutress: How has the development of Tenstorrent’s architecture and software progressed?
Jasmina Vesiljevic: We do a lot of hardware/software co-design, and the one of the key designs of the software. If you just make new hardware and no one can program it, what’s the point? So the entry point is really important, so the hardware/software co-design is one of the key building blocks of what we do, and it's to keep that process, that drives a lot of things here at Tenstorrent.
IC: So what in your background brought you to Tenstorrent?
JV: I did my doctorate in FPGAs - a little bit of the CAD tools, place and route, and then most of it on the high-level synthesis side. Sometimes in the FPGA industry we joke that the FPGA is the hardest thing to program in the world, which makes our pain tolerance level very high!
IC: [laughs] So you can be needled for a long time?
JV: Yes exactly! It gives us a strong drive to make novel architectures really fun to program, really convenient to program - to create these easy entry points to hello world programs, and then to create hardware and software environments that are fun to tinker with. That developers feel empowered and excited about new features coming in, and sort of can just brainstorm and think about all the cool and fun apps they can build on top.
IC: I speak with a number of the FPGA companies out there and I always say you guys need to abstract higher and higher and make it more accessible, I feel like we’re kind of at that same point with machine learning.
JV: We are!
IC: We're dealing with lots of these frameworks, PyTorch,Tensorflow, ONNX etc, and support for those is vital now.
JV: That’s right, and it's really interesting to think of all the entry points and what they're for, and then to think about the levels of abstraction. Then, is it a high-level or low-level entry-point - each one of those is important but for a different use case. So you always want to have a high-level entry point if you don't have one. The entry points are a matter of an ‘and’ not an ‘or’, you never want to go grab the developers by the horns and force them down this path.
IC: You won't force a developer!
JV: You can never force a developer! You will welcome them and you will guide them and have them enjoy all the different paths that lead to Rome, in this case it’s more like Tenstorrent and Tenstorrent hardware. And yeah it’s been really fun to watch sort of the development of the different frameworks.
IC: Before we go down this route, I want to get into one of the reasons why we’re here. I reached out and asked about hardware to unbox, and Tenstorrent said we can unbox these two.
JV: So in these boxes is a product that we call Grayskull, or a chip we call Grayskull, it is our Gen 1 hardware, our Gen 1 architecture, so it’s the first one that is going out to customers. It is a developer’s kit. There are 2 different cards, and e75 which is a smaller card
IC: e75 because its 75 watts?
JV: 75 watts, and then this other one is e150 - again with reference to the wattage. It's slightly bigger.
IC: We’ve been asked to focus on the e75 - I can already tell it's a bit lighter than the e150.
IC: It's Christmas!
JV: Christmas has come early. (we filmed this in early Dec 2023)
IC: So when people get their hands on these, this is what they’re going to end up with?
JV: Exactly, this is what they’ll experience.
IC: (simulates holy/angelic music) So the minute we open - I know a lot of people in the audience are very familiar with unboxing say, and everyone knows what a graphics card is for. So what is a machine learning card for? Here’s a handy “everything you need to set up” card.
IC: We already know that this is the PCIe version, a half-height full-length card. This is a typical, ML PCIe card we would see for mass scope inference, perhaps in a datacentre, but this is the developer version, so there’s a bit more branding. You’ve got a blower fan though, it's a bit hard not to notice!
JV: You’ve got to plug this guy in - it’ll drown out a little bit of noise from your neighbours, but it’s not that loud, and it does fit into your desktop.
IC: Even at 75w you still need some amount of active cooling in this form factor. I mean if you put this in a dual wide, double proper height PCIe form factor, maybe a passive cooler could work.
JV: That’s right, there are fun things that we’re exploring and optimising with respect to the fans of the cooling, so there’ll be more fun announcements coming down the line for that. This is what it looks like today, and we were very eager to get these out and get them into the hands of developers. We’re okay with the cooling, it’s interesting, we just want people using them, we want them going to the website, downloading the tools, plugging them in, trying things out, and just using the hardware.
IC: For those that are interested, there is a sign-up form on the website, where you’ll plead your case for hardware. Somebody will reach out and at last acknowledge that you're actually a developer and that you're actually going to use it and be able to provide feedback.
IC: If I remember correctly the E75 is $599, and the E150 is $799. I know for a lot of entry-level developers might seem like a lot of money, but the realistic expectation to have is that this is a developer kit, it's designed for developers to grasp the system, so there are going to be small/medium businesses who will see if this is useful for their models, and this is the type of price point that it goes at, and it’s low volume part. For reference, SiFive did a board that was $666, and Qualcomm’s kit for their Hexagon DSP is about $600, so this fits in right about that price.
JV: Exactly, the ballpark is around there. We want to make the hardware accessible - we don't want it to hit your pocket too hard, but we want people to be excited to play around with it.
IC: I speak to so many companies in this space, and I keep asking them ‘where’s the dev kit? Make it accessible!’ These guys are actually doing that, because it benefits the company like Tenstorrent to have several thousand developers with, even on high-level frameworks, access to their hardware and optimisation. But yeah we actually have the hardware now.
JV: That’s right, you can actually smell it! We’re excited. We’re really excited to get these into the hands of developers, to get feedback. Tenstorrent is very proud to ship hardware and have people give us feedback, good or bad, we want to hear it all.
IC: Let’s go through a bit more what that experience is going to be like. What is it going to look like for developers who get in touch and end up with a card in hand - is it as easy as just a link to the website to download the stuff?
JV: So the card welcomes you and tells you where to go – that’s your first entry point. From there you can download the drivers, the tools, and get the basics set up. They are the table stakes - so you can plug this into your desktop, and then install our drivers and tools.
IC: Is it Linux only or are you supporting Windows as well?
JV: We’re not supporting Windows… yet. It's on the roadmap, but it's not there yet.
JV: Once you install the basic tools and drivers, you can check the health of the card. You can see it come up, and there are tools that will give you confirmation that the hardware works, and that your computer recognises what has been plugged in. From there you have a choice of going down one of two paths - one is a software stack that we call BUDA, and it's our compiler.
IC: BUDA, that sounds familiar.
JV: It's on our website - it's the compiler that we’ve been working on, and it's a really fun way to get models working out of the box. You download our models from Hugging Face, BUDA will compile it for you, and that’s really fun. We refer to this entry point as a high-level top-down entry point, because you don't have to change your environment, you don't have to rewrite your model, it's ‘push button’ and it works. It will run on the hardware and show you what's going on.
JV: The other way is a bottom-up stack that we call our bare metal programming. That one is a lot lower level, so it comes back to the abstraction and the entry points. The use case is a bit different. It requires you to rewrite things in Python APIs. It's not a PyTorch out-of-the-box experience, it is for developers that want to have fine-grained control over their workloads that they're running on our hardware and have an alternative path to write kernels, all the way down to kernels that run on our RISC cores and drive the heavy math logic.
IC: So custom operators and things?
JV: Custom operators, custom data movement, custom explorations with novel ops they want to plug into their LLMs, control flow, like any, you know - fancy caching, new embedding. All that is accessible to you as a developer for you to tinker with and you are never sort of blocked by a high-level abstraction layer, you can bypass it and you can go directly to kernels and control the low-level hardware.
IC: In order for developers to do that they're going to need to have a deep understanding of the underlying architecture, so are there going to be some disclosures about the Tensix cores soon?
JV: That’s right. So the feedback that we got from customers that have looked at our bare metal software stack is that they come in, start to use our hardware, and after a few weeks, say ‘we understand everything that’s in your hardware’. Now that’s really fun! We have documentation, and our low-level is a non-trivial entry point, but we have documents that explains the view of the architecture and the programming model. We have the 2D grid of cores, the NoC, and so we do set them up with basics. What we then see happen is that experts will log in and they will read our low level programming model. We say this layer is just a reflection, a mere image, of the hardware - so what you see there is what you get. We don't try to package it for you, we don't try to steer you this way or that way. What’s in the hardware is what’s there, and then the bare-metal programming model is just a reflection of how to directly drive everything that’s available in the engine, which is really cool. I’m really excited about sharing that with the community.
IC: The developers you’ve worked with up to today, and the clients - the companies and the partners - they've obviously been NDA’d up to the hilt up until now. But any developer that gets their hands on this means there will be no NDA in place. The idea will be to go out there, go play, talk, poke, tell us what's wrong, tell us what right and such?
JV: Yes, we want that feedback! We want the community to engage with the hardware that we’re going to be shipping. We’re also going to be open sourcing our full bare metal software stack, so that means that you get to see the APIs of course, but you'll also get to see everything under the hood. You get to see the way that the kernels get compiled, the memory allocator, the way the runtime arguments get copied onto the device, how the kernels get dispatched. You get to see all the plumbing and functionality, and it's so cool.
IC: Everything that somebody writes for these cards will be forward-compatible with all future Tenstorrent hardware, right?
JV: So that’s an interesting point. Maybe the right way to think about it is there are two aspects of APIs, because that’s kind of like the API compatibility thought process. So there are host APIs and then there are kernel APIs.
On the host side of the APIs, we looked at OpenCL, we looked at CUDA, and we are very familiar with these low level programming models. We didn't want to reinvent the wheel there, so we mimicked those APIs to be intuitive, and kind of behave very similar. We wanted them to be very familiar to developers of that world, and so host APIs are able relatively to keep backwards compatible. I’ll probably regret that as soon as I say it, but they're kind of defined and they’ve matured to a certain degree. The design space is not being wildly explored in those areas.
Then on the kernel API side, there’s a strong desire to keep backwards compatibility. That’s important to us. However, in reality, if you are allowed to colour outside the box with next gen architectures, you can make leaps with performance and functionality. It’s a conscious decision, to kind of go well this is a new microarchitecture, new gen, we’re going to maintain backwards compatibility, or we’re going to allow ourselves to colour outside the box and make a leap.
IC: So, that's backwards compatibility, but everything will be forward compatible, anything you write for this gen will work on next-gen?
JV: That’s the goal.
IC: That’s the goal?
JV: That’s the goal.
IC: I know you Tenstorrent has been vocal about upcoming roadmaps, especially and as the company is taking on new clients and new investors. Some of those things are changing. From your perspective, obviously it’s one thing to support this, but do you look at it from the high-level and low-level software layer?
JV: High-level entry points are less susceptible to low-level changes, and developers like them because of that and because it gives them a quick path to a desired outcome when they stay within that sort of high-order programming model. We want that, and what we’ve seen (to go back to what we were saying earlier), before there were a lot of frameworks that developers were using. Over time they kind of all consolidated onto PyTorch, so there was a consolidation effort, and now we see a growth in the number of frameworks again – this time high-level ones. It is interesting to note that developers seem to enjoy things that are for a particular purpose, and we see a lot of specialisation there - if you make a high-level API that aims to do everything under the sun, you usually end up with leaky abstractions and developers that get annoyed. So it seems that at the high-level, there are things that are special. APIs that are specialising towards a certain purpose framework, and we’re in that game as well.
With the low-level API, we want to make sure that developers always have access to the hardware, and that nothing is hidden from them. The IP business is a big branch of our business that is fairly important, and for a customer that is a potential IP customer, or already is an IP customer, they want to know exactly what's in there. They want to control it they want to drive it, and sometimes they’ll get ideas like “I wish I had this feature or that feature” and they can visualise that in the software and hardware. That entry point is super important for us.
IC: These cards that we’ve got in front of us they're take home for your workstation, but Tenstorrent has had hardware in the cloud for a little bit as well. How has that been?
JV: It’s an easy entry point, and it's super convenient. It’s SSH in, and off you go to “hello world”. So that’s been really fun, and it’s the fastest way to get people to access the hardware and try out simple things. They then try out complicated things! We have customers who are running on the cloud today, and it’s a great testbed for us. We make drops to the cloud, and we deliver our software to the cloud, so they are our first internal customers, and that feedback loop has been really important. Our customers so far have enjoyed the very quick turnaround that we can give them in terms of machine access and everything is set up and works as they just SSH in.
IC: So the reason why you're perhaps not opening a cloud to developers and instead going down the hardware path is because…
JV: We’re about ‘and’s not ‘or’s, we want both. We have a certain amount of cloud capacity, and today we utilise it to almost 100%. Every single new server we put in, there’s a waitlist of folks waiting to try it. We want to make sure that developers can get hardware in their desktop – so they can hear it run and they can go on and install all the tools and test that flow as well. I think for hardcore serious developers, they love being able to touch the hardware and install and having things in their own hands, as opposed to some server that is distant somewhere, in case something happened to it and it went down. It's about empowering developers.
IC: So, with the cloud, it’s easy to do those fast-track updates, especially if it’s a client that’s putting in money and money over time. For individual developers, how are you going to discuss with that and that community about how updates are being rolled out and things?
JV: So we have releases for both software stacks - announcements will go together with that release cadence. The underlying tools have their own releases as well, so all of this will be announced on our website and developers can uplift to the latest version as they see fit. Of course, in the cloud this is a little more behind the scenes for them. A bit more fluid.
IC: When this hardware goes out, you’ve already got support staff ready?
JV: That’s right. We’re a pretty small team. On the bare metal programming model we’re not a huge team, we’re a small team of very smart individuals, but super excited and dedicated to shipping the software, to open sourcing it, and kind of showing it to the community. That said, we’re not yet at the stage of being able to service and keep up with large pull requests - I think that’s normal, I think that’s kind of like a growing stage that the community mostly understands. We want to be able to develop in the open, we want to be able to show what's there, and then we have a strong ambition to grow to the point that we could have strong collaborations with the open-source community.
IC: Does it matter that there are several dozen other AI hardware start-ups out there doing their own thing? Did you ever think about competition vs. collaboration or anything like that?
JV: I think it helps that there are other start-ups doing similar things that we do. You know the goal here - we’re in a race against Nvidia, and we want as many players in our court as possible. Of course we race with them, and of course a lot of our colleagues work in other start-ups. We all know each other, so it’s fun. It makes for interesting Thanksgiving dinners.
IC: One family member at one company and another at a different company?
JV: It’s happened, it’s happened. My husband and I worked at Xilinx and Altera at the same time, so it’s not unforeseen.
IC: Oh wow!
JV: I think it helps. I think activities, software stacks that grow the community and get it to be diverse, and that increases that design space. It helps. We also learn from each other, and it's a long path ahead I think it's going to be really fun. Now we see software stacks that are moving away from just PyTorch right, so for us we have a roadmap item to integrate into PyTorch 2.0 natively, and to generative a pull request to be in the repo. That’s on our roadmap, but we also see that there's a trend of software stacks that are being developed specifically for a piece of hardware. This is because you can then control the APIs and allow users to do specific things that are native to that hardware, without forcing developers to go behind this general ‘one to rule them all’ API and ecosystem that leads to leaky abstractions. If you want to do specific things, I think it helps, and anything that grows the ecosystem is good and fun.
IC: So for somebody that ends up getting this hardware, what's the first thing they should do? And what's the first model they should run to make sure it all works? What's the shakedown procedure?
JV: We have for both software stacks a landing page that takes you through the first 5 things. For BUDA there’s a few models you can just run - you're 5 clicks and a script away from running. On the bare metal side, there are a few models that are optimised for performance that you can run out of the box, and then there are a few kernels you can run and see how things run end-to-end. The stack also comes with debug tools, kernel performance, and we integrate into the open source tracing tool so you can see performance profile of what's happening. So it's kind of what's the level of exposure and deep dive that you want to go in and there's the first 5 things that take you progressively down that path all the way down to running kernels and seeing what happens.
IC: How often do you have to report back to Jim on what the community is saying - is that decided yet?
JV: it’s moving from hourly to daily. It’s frequent, he deeply cares about what the community is doing, and he's driving Tenstorrent into strong awareness of community values, and software development values and ensuring that the entry point is really convenient and fun for developers.