More Than Moore

Share this post
Tesla’s Dojo Supercomputer Deep Dive
morethanmoore.substack.com

Tesla’s Dojo Supercomputer Deep Dive

Architecture, Silicon, Packaging

Dr. Ian Cutress
Sep 20, 2022
4
Share this post
Tesla’s Dojo Supercomputer Deep Dive
morethanmoore.substack.com

This is the written version of my YouTube video:

One of the best things in the semiconductor industry this decade is the depth and breadth of chip start-ups and innovation. Not only are there new ideas taking hold funded by venture capital and hundreds of millions of dollars, but the big players in compute-heavy verticals have created their own chip design teams to create something that caters specifically to their own needs, rather than rely on what already exists in the market. Tesla, the electric vehicle manufacture leading the way in autonomous driving data and training, has poured what is presumably billions into developing a new machine learning supercomputer that couldn't have existed in the market any other way. Today we go through what it is, how it is built, and why Tesla had to go down this route to design it.

ML Recap: Training vs Inference

Two of the key elements to building a fully autonomous self-driving vehicle are the brain and the algorithm. The brain is the speed, while the algorithm enables both the accuracy and the features. The algorithm needs to be trained and optimized, offline in a big datacenter with lots of data, before it's given to the brain of the car. The two key words here are training and inference. In the data center, the algorithm is trained - it takes lots of labeled data and attempts to predict the best course of action with that data. Over millions of hours of video, billions of hours of compute, the output is an algorithm, a series of calculations with the right numbers, that can take new data and get the right result. When that trained algorithm is put to use, where it isn't training but working after it is trained, we call that 'inference' - it infers a result with new data. Training is long and complex but with known data, and inference works on a trained algorithm with new data, where the exact result is determined by the trained algorithm.

Tesla’s Fully Self Driving Hardware that runs inference

Tesla's self-driving car, conceptually, feels like it just has to have 'the algorithm', but in reality there are probably of 100s of trained neural networks for different elements, each working on new data, and continually being updated back in the datacenter and updated versions of the algorithm sent over the air with updates. There are a number of limits to training a neural network - raw data, software, time, and power. At some level, you can speed it up by throwing more computer hardware at it to make training faster. But, if you increase the number of GPUs to do that training by 10x, then another 10x, then another 10x, there comes a point where the more you buy, the less time you save. At last year's Tesla AI Day, Elon Musk stated that the GPUs that they could buy to do this work could do it, but the requirements in cost and time were astronomical. In order to get their self-driving system to work, they had to build their own hardware. Enter DOJO, the broad name for Tesla's new hardware focused on training neural networks through machine learning.

Share

Tesla Dojo D1, The Chip and Core

Dojo is a broad name, as it refers to a number of things in what we're going to talk about today - almost all the fun stuff here is called 'Tesla Dojo "something"', so we're going to start from the bottom, from the most basic design element, a Dojo Processing Core, all the way up to the Dojo Supercomputer.

Actually that's a lie. I'm going to start at the 'chip' level, and something called the Dojo D1.

The Dojo D1 is a 645 mm2 silicon die, built on TSMC's N7 process. 645mm2 is very big, as big as massive server chips and almost as big as the biggest GPUs. There are three elements to the Dojo D1 chip - the cores, the onboard interconnect, and the connections to other chips. Starting with the cores, there are three hundred and fifty four of them.

Normally when we speak about big AI chips in this industry, we are not looking at cores, but repeated vector units, like a GPU. There ends up being a global scheduler to manage the data, and all the magic to process machine learning mathematics is governed by blocks that do the same operations across multiple bits of data. By reducing the control to one focus point, it saves die area and you get a lot more units, at the expense of configurability. On the other side, a big CPU could have 64 cores, and each core can run different mathematics completely, but it can be inefficient if you want them to work on the same data together. Tesla here has gone for something in the middle.

Unlike many AI chips coming to market, the smallest irreducible Tesla Dojo D1 process unit is a core. Not a compute unit, not a vector array, but a fully-fledged core. The core is, compared to a normal CPU, very simple, but it's still a core. It has its own local SRAM, a dedicated interface to the wide network, and doesn't care what other cores are doing around it. It has rudimentary branch prediction, four fetch buffers, two decode units, scalar schedulers, scalar register files, two address generator units, two arithmetic logic units, a vector scheduler, a vector register file, a SIMD datapath, and four 8x8 Matrix Multiply Units. On the scalar side, the instruction set looks like a RISC-V implementation, but the vector side is custom for machine learning. Clam from Chips and Cheese has said the core handles more like IBM’s Cell processor in the PS3 than a traditional core.

The core uses four way simultaneous multi-threading, with Tesla saying that usually one or two threads are running compute, whereas one or two threads are managing data flow. The core has a 32 B fetch window holding up to 8 instructions from a small instruction cache, and each decode unit can decode up to four instructions across two threads per cycle. The integer side is 4-wide with 8B load and 8B store, while the vector side is 2-wide but with two 64B load and one 64B store. This is direct into the execution ports, provided by explicit transfer instructions. The core is designed to be light, so there are limited protection mechanisms to stop threads interfering with each other as there would be on modern user cores, simply because Tesla knows its workload, and for now, they're the only ones going to use this hardware. Resources between threads are managed by the compiler, which also means resource management to the 1.25 megabytes of SRAM per core. There are no L1/L2 data caches, everything goes through SRAM, so it’s not like a cache, but uses DMA operations. Any access to the SRAM flushes the instruction cache rather than an explicit coherency mechanism. As the SRAM isn’t an explicit cache, it also means that tag and state bits stored, saving space, but also removes load latency at the expense of code complexity. In reality this means this 1.25 MB of SRAM can have L1-cache like latency, somewhere around 4-6 cycles as an estimate.

That SRAM is capable of four hundred gigabytes per second load or 270 gigabytes per second store, and has a built in gather engine at 8B and 16B granularity for the vector side of the core. A list parser also assists with that 8-wide decode engine, enabling certain instructions to execute in the front end and drop early, maintaining a high throughput. While the core has access to the SRAM, so does the on chip network - each core also has a 'network on chip' or NOC router that communicates to other cores on the chip. Data can come from the rest of the network, and rather go through the core to wait to be placed into SRAM, the router can do it itself at 64 byte per cycle read or write with a direct memory access or DMA. That NOC router can communicate to four other NOC routers, up, down, left or right, at 128 bytes per cycle in each direction, both send and receive. What's interesting here is that Tesla stated last year at AI day that the core was designed based on the physical limits of the NOC router and on-chip network - they wanted to ensure that core-to-core communication from NOC to NOC occurred in only one cycle, so that puts physical limits on the design.

Because Tesla knows what it wants to run on these cores, the full datapath is designed around the specific ways that numbers are represented in binary. Whole numbers are integers, whereas fractions are represented by what we call floating point formats, and there are special machine-learning floating point formats not commonly used in regular CPUs or GPUs. It probably wants a special video, but Tesla here is focusing on FP32, BF16, CFP8, CFP16, and a few other formats, each of them varying in the range of numbers represented as well as accuracy. There are special instructions here to deal with those formats that Tesla has developed, as well as a Tesla Dojo instruction set just for this chip.

That's the core and internal interconnect, let's go back to that Dojo D1 Die, the chip at the heart of this.

As I mentioned previously, this chip is 645mm2, and has 354 of those cores we talked about on the chip arranged in a 2D array. Each core has that SRAM, a NOC, and in total we have 440 MB of SRAM.

From AI Day 2011, the per-edge BW has halved since this presentation

On the edge of the chip is the external communication hardware - a custom low power SERDES link that Tesla has developed themselves. A SERDES is a serializer/deserializer, and just a technology that allows data in a chip to be transferred with fewer wires and connections, making it less complicated. Most off-chip connections on your regular computer will be SERDES links, such as AMD chiplet links, or PCIe links. Tesla here has covered all four edges of its chips with these links, 576 bidirectional channels of them, leading to a total of 8 TB/sec connectivity, or 2 TB/sec on each edge (last year Tesla claimed 4 TB/sec per edge, this seems to have halved).

So that's the three things that make up the Tesla Dojo D1 Chip: the core, the on-chip network, and the off-chip connectivity. Now we have to think bigger, and how these D1 chips are laid out together. This is where the magic really happens.

Give a gift subscription

Chip-on-Wafer Packaging: The Dojo Training Tile

Tesla designs the chips, they get sent to TSMC to be manufactured, and then they are tested. The ones that pass the tests, that can do the work at the frequency and voltage required, are called 'known good die'. The ones that don't work are put to the side for the moment. Tesla (and TSMC) then takes 25 of those known good chips and arranges them in a 5 x 5 square, and puts them on a massive silicon interposer to create a 'Dojo Training Tile'. We'll be talking about Tiles a lot for the rest of this, so it is worth taking this in.

A tile is almost as big as a silicon wafer. In fact, TSMC calls this technology 'chip on wafer' technology. Those twenty five known good D1 chips are placed on a big interposer in a five by five grid, and the interposer manages the power for each of the D1 chips as well as the chip to chip connectivity. All those SERDES links I was mentioning earlier, each of those 25 chips are talking to its neighbors at 2 TB/sec each. All of those chips, the big interposer, all that data, also needs to be managed with power and thermal requirements.

A full 25-chip Dojo Training Tile uses unique packaging as the full power is around 15 KW per tile, and 18 thousand amps (about 0.83 volts). If a modern GPU is 400 W, this is almost 40 of those GPUs all in one package. This actually means that each one of those D1 chips are around 550 W each, depending on if the connectivity power is measured as part of the chip or not. Normally with a chip, the manufacturer has to deal with communication and power in the same dimension, but here Tesla are able to focus connectivity across the Dojo Tile horizontally, and the power vertically - the power comes up from the bottom of the interposer, while the casing around it on the top (and technically the bottom) deals with the cooling. Something very similar happens with Cerebras at over 20 KW, so it's not completely new technology, but still bespoke for this solution.

Now we have this big tile, with 25 D1 chips, with 8850 Dojo D1 cores, with 11 GB of SRAM. In terms of compute, each tile is rated at 9 PetaOPs of BF16 machine learning performance. Because of all the 5x5 grid array of D1 chips having off-chip connectivity, a single one of these large tiles has 18 TB/sec per second for tile-to-tile connectivity. Yup, that's right, not only can you have 25 chips per tile, but you can put multiple tiles together. A direct tile to tile link is 4.5 TB/sec per second, but you can arrange tiles in a big 2D array, just like the cores and just like the D1 chips. The idea here is units of scale, and to keep scaling out. A big 2D mesh of Dojo Tiles is called the Dojo system.

In order to enable that off-tile communication, each 25-chip tile also has 40 IO controller dies for point-to-point connectivity. This means a full tile has those included as well, with Tesla saying that the heterogeneous RDL optimized design enables high density and high yield.

Included is the extensive power delivery:

And Tesla has already shown one of these tiles up and running on their test system.

You would think that we can now make a large array of Dojo Tiles, say 10-by-10, and get our amazing machine learning supercomputer, however now we have reached some limit of scale. A Tesla Dojo supercomputer actually limits itself based on how quickly it can get data into these chips. In one dimension, that limit is three tiles across. In the other dimension, Tesla has said that right now, 40 is the sweet spot. So you end up with a three-by-forty array, and it looks something like this.

But, we're still not finished when talking about a Dojo supercomputer. If this Dojo system is limited to 3 tiles wide, at the edge of those tiles are 5 of the D1 individual chips, the 5 on the edge of that five-by-five array. Each one of those five chips is connected to a Dojo Interface Processor, or DIP. This is another chip, simply for managing data.

The Dojo Interface Processor and Networking

The Dojo Interface Processor is a PCIe card that deals with both getting data into the Dojo Tile, but also getting data out and moving it around. Each PCIe card consists of two main chips behind a switch, and each chip is backed by 16 GB of high-bandwidth-memory or HBM. That memory runs at 400 GB/sec, and so each PCIe card with two main chips has 32 GB of HBM and 800 GB/sec of bandwidth. It talks to the Tile at 900 GB/sec with a custom 'Tesla Transport Protocol' interface, such that any data inside the HBM can be transferred to the D1 chip it is connected to at maximum HBM bandwidth. As there are five D1 chips on the edge of each tile, there are five PCIe cards per tile, showcasing a total of 160 GB of HBM per tile edge at 4.5 TB/sec bandwidth. Each PCIe card can talk to a special switch, using that Tesla Transport Protocol-over-Ethernet, at 50 GB/sec. Tesla have developed a custom networking switch and networking protocol to minimize traffic across the Dojo system in a very interesting way, called Z-plane topology.

Imagine that 3x40 Dojo supercomputer I was talking about earlier. So three columns of forty tiles. For column one and column three, each one of those 40 tiles is connected to five Dojo Interface Processors, that stuff with the HBM we were just talking about, and each one of those is connected to a special switch with this custom protocol. Now, sometimes in a machine learning workload, you might need one core in one tile to communicate with another core in another tile that might be fifteen or more hops away. Without any external networking, the data would have to go through multiple chips and multiple tiles just to get to where it needs to go. Instead, using the Dojo Interface Processors, the data can leave the tile, go over this 'external network', and get to the core it needs to in as little as four hops. This allows for multiple routes for data to travel across the network, balancing latency and bandwidth and minimizing congestion.

So now we have an idea of a Tesla Dojo supercomputer. Each D1 chip has 354 cores. 25 D1 chips make up a Dojo Tile. Tiles are arranged in a 3-wide and up to 40-long 2D array. For Column 1 and column 3, each tile has five interface processors that provide data, and those interface processors are connected by a custom Dojo switch.

Now, each of those Dojo switches can also connected to more traditional host computers, either one or many. In order to do this, Tesla has created a special Dojo Network Interface Card, or DNIC, that goes into the PCIe slot of a standard computer so you can communicate to the Dojo supercomputer. This isn't a standard Ethernet card, but you can think of it as a special network card that gives you direct memory access to any of the one million cores on the massive system. What we have here is remote direct memory access for any host that is using a special DNIC to connect to a Dojo Switch.

The Tesla Dojo ExaPod

The total Tesla Dojo D1 supercomputer is a hundred and twenty tiles, capable of one Exa-Op of BF16 machine learning compute, a total of 1.3 TB of onboard SRAM, and a total of 13 TB of interface-processor HBM. Between the D1 chips, the off-tile IO dies, the Tile interposer, the interface processor, the switch, and the network card, I'm counting six different pieces of serious silicon required to make this work.

Share

So how does Tesla manage defects or hardware issues? I mentioned that the base D1 chip is tested to be working before it ever gets put on a tile, so that makes sure most manufacturing defects are ironed out before everything gets assembled. But even then, an individual D1 chip might end up good enough to use but with a dead core or three, or the chip-on-wafer packaging might not be perfect, or something happens during post-packaging that might cause an issue. To get around this, because the system is just one big 2D mesh, the system topology is managed at the software level. Dead nodes are avoided by software, and the routing table of each NOC on each D1 die can be configured to not only avoid dead chips but for link balancing and link recovery. Packet ordering is not guaranteed, but packet counting is used in conjunction with system synchronization through semaphores to ensure atomicity in the cores as well as avoiding starving cores of data. Feed the beast gets harder as you scale, especially if you can't get the data into the cores - this is perhaps why three-wide strikes a balance with the off-tile interface processors with all the high-bandwidth memory.

On the software stack, here Tesla is focusing on its own workloads first - its own models, extensions for Dojo in PyTorch, custom JIT NN compiler with an LLVM backend, drivers, instruction set architectures. That being said, Tesla has said that its instruction set architecture is compiler friendly, and there's a growing industry need for distributed compiler technology.

So in terms of use. I stated that the Tesla Dojo D1 Supercomputer, called an Exapod when it hits one Exa-Op, had only one focus when it was designed: to run Tesla's machine learning focused training workloads to a scale not possible with GPUs without significant software overhead, significant power cost, and significant time cost. The Dojo design has been meant to solve those three vectors as best as possible. Tesla is focusing on its self-driving networks, especially convolutional neural networks, and by focusing solely on what they are doing, were also able to cut out some of the extra layers needed by other AI chip companies that want to sell to the cloud or multiple customers. Tesla has fielded questions multiple times about 'beyond Tesla' use, and the answer has been that at some point Tesla could consider building cloud systems for remote use, however the focus today is simply for their getting their own workloads running at scale before they wonder about how to enable non-Tesla use. The idea isn't off the table, but it's in the non-Tesla 'to-do pile' which is low priority. That being said, there's a glimmer of hope - a Dojo Exapod can support training multiple networks at once, with a disaggregated model system to balance different training workloads. This means that, for example, if multiple customers had instance time on the same machine, it could be split between them depending on what they need. That being said however, the Tesla Dojo relies on a physical memory addresses rather than virtual memory addresses, making multi-user tenancy almost impossible, but the side benefit of reducing latency, core area, and core design complexity.

I should point out that if we look that the ‘irreducible’ unit of the system : you can’t simply have one Dojo D1 chip unless they put it on a PCIe card in the future. The minimum unit of use is going to be a full 25-chip tile, but also paired with at least one Dojo Interface Processor and a host using the Dojo TTP networking protocol. Most AI chips require a host or source with high DRAM capabilities, including GPUs, but there’s an extra level of complexity here to get a Dojo system up and running.

Comparison to Cerebras

If we compare Dojo to other AI chips in the ecosystem, the only one that's on the same sort of scale is Cerebras, and one might argue that Cerebras is going even more in that direction. While Dojo's traditional individual unit is the D1 chip and we get 25 per tile, Cerebras just uses one big wafer-sized chip. Where a single Dojo D1 has 354 cores, and a tile has 8850 cores, Cerebras’s smallest unit of compute has almost hundred times that - the trade-off is control vs. compute. Cerebras' cores are 50% SRAM, and 850000 cores are also connected in a 2D mesh, with the lightweight cores enabling stencil compute and not just matrix multiply engines. Cerebras relies on full-chip networking, whereas Dojo has to manage chip-to-chip networking, and as a result Cerebras has made engineering strides in what we call 'cross-reticle' interconnects to be able to design a single chip so big. Cerebras also has redundancy built in, such that if there are any defects (based on TSMC numbers, there should be about thirty per chip, but Cerebras says in reality there are a lot less), the software can route around them, leading to 100% yield. The Chip-on-Wafer technology from Tesla, I'd argue especially with all those SERDES links, is more expensive to produce. Tesla has put a lot of work on how the rest of the system is connected and communicates with each other. Cerebras does have its own scale out platform too, up to one-hundred-and-ninety-two wafer scale engines at twenty-four kilowatts each, going above Tesla's one-twenty. However Cerebras is selling to dozens of customers, whereas Tesla only has to sell to itself. Cerebras is already in the market with its second-generation hardware, software, and has happy customers. Tesla is still in the process of ramping to its internal workloads, but we'll find out more at AI Day.

Tesla AI Day, September 30th

On September 30th, Tesla is having its AI Day in the Bay Area, where we're expecting to hear more about the Dojo development cycle. I'm reporting today solely on what we learned from last year's AI Day and the disclosures at Hot Chips a couple of weeks ago, and in that last conference they did say that some information was being kept back for AI Day, and it sounded like it was detail about performance and utility. I'm actually in the Bay Area on that day, and I've been sending out feelers to get an invite to the event. Multiple Tesla employees have told me that the event isn't meant to be a marketing exercise - it's well known that Elon isn't a fan of wide-scale press relations - but the event is meant to be more of a hiring expose. They bring in engineers that might not believe the company is capable of what it can do, and gets them to ask whatever questions they want, and sell them the idea that what Tesla does is not only possible, but is pushing whichever specific industry forward. With that in mind, the event isn't focused at people like me in particular, but I'd still love to get to see how the sausage is made. A couple of my contacts have put my name forward for an invite, but I don't have one yet. Short of knowing the people with the invite list, or getting a tweet off Elon saying to come on over, I'm crossing my fingers. This isn't a call to action by any means, but I'll have to wait and see.

I will add one note about how Tesla is developing its self-driving networks. Right now it has a lot of customer vehicles in the field, and a lot of those cars (if not all) can run those networks in ‘watch’ or passive mode. It monitors the driving, and does all the inference and predictions it would normally do, but nothing happens because the driver is in control. If the driver does something different to what the network thought the driver would do, perhaps because the neural network hasn’t seen or been trained on that situation before, or the driver did something really wrong, then the video feed leading up to that incident is labeled as an ‘interesting event’. Either continuously or in batches, Tesla can request those interesting clips back to HQ. Those clips could then be examined, checked to see if it really was an interesting event, look at what the algorithm thought was happening, then perhaps be labeled and could be used to train future models, or provide a basis for generating a dataset similar to what happened and then use that to train future models. When you have a fleet of vehicles in the field that can do this every day, or every hour of every day, terabytes upon terabytes of data can be used to update those models, and the cycle repeats. One of the main issues with machine learning (of which there can be many), is simply the volume of good data needed for training. Then again, the question of what is good data, biased or otherwise, is something that technologists, anthropologists, ethicists, futurologists, and even politicians, have been debating for years. There’s no point training your network on one particular city, by one type of middle class driver who can afford those vehicles, and think it’ll be suitable elsewhere in the world where the roads are different, where the signs are different, where the other drivers are different, the weather is different, the hardware might be different due to cost, and everything else in between.

Well I hope you've enjoyed this deeper dive into the Tesla Dojo architecture, chip design, packaging, and system scale out. Each one of these segments could go much, much deeper, although those deeper dives should all be partitioned to those particular audiences. Put your thoughts, com ments, and probably most important, corrections and things I've missed in the comments below.

Thank you for reading More Than Moore. This post is public so feel free to share it.

Share

Sources:

  1. Tesla AI Day 2021: YouTube

  2. Tesla Keynote at Hot Chips 2022: YouTube

  3. Tesla Presentation 1 at Hot Chips 2022: Will be public in December

  4. Tesla Presentation 2 at Hot Chips 2022: Will be public in December

Share this post
Tesla’s Dojo Supercomputer Deep Dive
morethanmoore.substack.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Dr. Ian Cutress
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing