More Than Moore

Share this post
Hot Chips 2022: More Than Moore Recap
morethanmoore.substack.com

Hot Chips 2022: More Than Moore Recap

It's an engineer's conference. That's why I love it.

Dr. Ian Cutress
Aug 29, 2022
4
Share this post
Hot Chips 2022: More Than Moore Recap
morethanmoore.substack.com

We've just had two full days and Hot Chips is complete for another year. For those new to the term Hot Chips, it is a yearly conference that talks about the leading edge silicon without the fluff of marketing. It's an engineers’ conference. Hot Chips this year focused on a few key areas - GPUs and High Performance Computing, Chip-to-Chip Integration, Academic Chips, Machine Learning, Networking, and finally Edge and Mobile processors. As requested, here are my key takeaways from the event.

Keynotes – Tesla and Intel

The two key notes were from Pat Gelsinger at Intel, talking about how semiconductors run the world, with the second keynote from Tesla speaking about their Dojo processor scale and ecosystem.

To be honest, the biggest highlight of the show was Tesla's Dojo disclosures. It will warrant a piece all on its own, but at Hot Chips they had two 30-minute presentations as well as an hour long keynote, which is quite rare for anyone not named Intel. In the presentations, Tesla spoke in detail about their new Dojo AI Supercomputer, or Exapod.

An Exapod is formed from 120 of Tesla's Dojo Training Tiles, each requiring 15 KW of power. Each training tile has 25 Dojo D1 chips, connected using Chip-on-Wafer technology from TSMC, and each chip has 354 processing cores on-board. Multiplying back out, that's 354 * 25 * 120 = 762,000 cores per Exapod - interestingly Tesla had said a million in their slides, which might just be an artefact of calculating vector units. But either way, rather than each 'core' being a simple execution port with some memory like most AI chips on the market, they are actually fully fledged cores with 4-way hyperthreading: two threads take care of the data traffic, and two threads enable the compute.

Speaking of networking, each one of the Dojo D1 chips (remember 25 per Dojo Tile) has 2TB/sec to speak to other Dojo D1 chips, and each tile has 4.5 TB/sec to speak to the rest of the network. The network is 3 tiles wide and as many tiles long, with each D1 on the edge connecting to two Dojo Interface Processors (DIPs), each of which have 32 GB of HBM and implement a 900 GB/sec of a custom 'Tesla Training Protocol' over Ethernet.

This is then all managed through special Dojo Network Interface Cards (DNIC) connected to regular CPUs.  Using the quantized BF16 standard used in AI, Tesla is claiming 1 Exa-Op of compute for a 120 Tile system, with 1.3 terabytes of internal SRAM and 13 terabytes of high-bandwidth SRAM. Tesla also spoke about the models their training, a disaggregated scalable system, and the software. This warrants a full blown deep dive, which I’ll likely publish on my YouTube channel.

Moving on, I should address the other keynote from Pat at Intel. To be honest, this keynote was a bit of a dud, or at least it was for me. Intel's upper management, especially Pat, are treading a very careful road when it comes to their future, whether that's financials such as Q2, products such as Arc and Sapphire Rapids, or manufacturing such as build out of new fabs to make chips, the company has been plugging an almost identical story to the one Pat presented last year as part of his vision for Intel. He spoke about the need for semis, the importance on being less reliant on others, and investment in the US.

The one thing that might have been new was Intel Foundry Services being a 'systems foundry', which is some marketing for enabling a full stack solution for customers from chips to SoCs to power to packaging. He did mention an 'Intel Chiplet Studio Suite' which I thought was interesting, elevating the concept of rather than taking and licensing IP, customers could just license whole chips already validated in the portfolio using a universal interconnect like UCIe and Intel packaging. Essentially, 'The Moore You Buy, The Moore You Save' (that’s one of my creations, he didn’t say that). He did say that a new class of electronic design automation (EDA) tools would be required however.

If you follow Intel closely, there wasn't anything exciting, but I guess it was kind of nice to have it wrapped up in one place? Unfortunately I've heard that wrap up more than a dozen times now, and my calls with Wall Street are looking for action, rather than more platitudes.

Beyond the keynotes, we had some good surprising talks worth highlighting.

Highlighted Talks – Lightmatter Passage

Lightmatter CEO Nicholas Harris presented their latest silicon photonics technology, which kind of blows the whole area of interconnects wide open. Again, this is something that perhaps I should portion out to another piece of content, but the long and short of it is that silicon can only transmit data so fast with enough bandwidth - by implementing optical chip-to-chip connections, rather than the 2 TB/sec you might be limited to, by using light that could scale to multiple terabytes per second and offer a better match to on-silicon interconnects. In a wire, you can only transmit electrons, whereas a light waveguide in silicon can transmit 8 or more different wavelengths of light, each at the bandwidth of a 1024 wires.

By using in-silicon waveguides, the Lightmatter Passage product as presented is, for lack of a better term, a big 2.5 D interposer. On this interposer, a customer would put their chiplets, which could be any chipets - compute, AI, memory, more compute - and the interposer transmits data through light rather than through electrons. Lightmatter uses special cross-reticle technology meaning your interposer could be 48 full reticle tiles big - basically as big as a Dojo Training Tile (hint hint).

In order to go off-interposer, Passage offers a fiber-attach per edge chiplet, up to 128 terabits per second, for ultimate data-center scale out. The best way to think about this perhaps is imagine a standard dual socket server, but instead of infinity fabric, PCIe, or QPI connecting the two, you put both of them on an interposer - the connection between the two chips would be wide and fast, much wider and faster than anything today, and then off-chip attachments could be made to memory, PCIe, CXL, or even other sockets in other systems.

Lightmatter says their solution can manage 700W per reticle, and the biggest 48 reticle version consumes under 50W nominally, with a 1 pJ/bit per bit transfer. Software reconfiguration also allows those 48 reticles to run different network topologies at 1 ms granularity. Again, this is something I could do a full article on.

Highlighted Talks – Biren BR100

In the world of HPC, the show had an interesting talk from a new entrant in the GPGPU space called Biren Technology. They introduced the BR100, a GPGPU capable of 2 Peta-OPs of INT8 machine learning and support for CXL. The BR100 is actually a dual-chip design, but unlike the AMD MI250X which is dual GPU but each chip is addressed individually, Biren says their solution is shown to the system as one device, much like the Apple M1 Max. Between the two chips is a 900 GB/sec chip-to-chip interconnect.

This means a standard system has 8 BR100 packages, with 16 GPGPUs total. Each of the BR100 packages has 64 GB of HBM2e, and 8 'B-link' connections to allow for an all-to-all topology in their system with 2.3 TB/sec bandwidth. The Hearten Server, working with one of the Chinese OEMs, puts eight of these in a single system using the OAM form factor at 550W each. There will be a PCIe version as well.

Each chip has split into 16 vector cores and 1 tensor core, and the tensor core is a 2D systolic array equivalent to a 64-by-64 matrix multiply. Biren is currently working with Tencent on the Biren SUPA programming model with C++ support. Built on TSMC with CoWoS packaging, they're targeting a 1 GHz frequency, with the focus on training.

Highlighted Talks - NODAR

One unexpected talk of the show was from a company called NODAR. NODAR is nominally a software company, but they're working in the field of generating 3D point clouds for use in autonomous vehicles. The concept behind a 3D point cloud is similar to that of the eyes - our eyes take an image each, and the brain works out how far things are away through depth perception of simply those images and a well-trained brain.

This is often called Stereo Vision, and humans have a fairly narrow angle stereo vision. There's a trade-off : the closer your eyes are together, there is an upper limit on how far you can tell things away can be, but you don’t want a 15 meter wide face. There's also a secondary issue - depth perception requires those two eyes to be fairly stable, and only when you're used to walking, running, driving, and all of that, can you tolerate vibration between the two eyes to be accurate. Well, it turns out cars have a similar problem. You can install a stereo camera on a car, and it will create a depth map or 3D point cloud based on what it can see. In order to get the best data, these stereo vehicle cameras need to be as far as part as possible- this allows them to see large distances, which is important when travelling at 70 miles per hour (or more in Germany).

The problem is that if you make them far apart, they are less tolerant to the bumps and judders of the vehicle, even with the best damping and stabilization technology. What the NODAR team is doing is applying machine learning to this issue. This is also tougher to do than I've put down here - the vehicle cameras have six degrees of freedom - x, y, z, pitch, roll, yaw, but each camera lens has nine more degrees of freedom - focal lengths and distortions, as no two camera lenses are exactly the same. This makes the problem a 24 dimensional optimization problem. Good luck doing that with multi-megapixel cameras at 30 to 90 fps within a small power budget!

NODAR says they've used datasets to solve this issue - at least at 5 frames per second for 5 megapixel cameras for around 50 to 100 watts using NVIDIA hardware. The goal for next year is to bring it to the 20W level for use in large drones / last-mile delivery use-cases. 

Quick Fire Highlights

Intel also spoke about Meteor Lake, its EUV-enabled and first chiplet-based product. Based on the images, it's confirmed that Intel has a base tile, a compute tile, a GPU tile, an SoC tile, and an IO tile. The base tile is passive, transferring power and data, and the chips are connected using a 2 GHz Foveros at a 36 micron pitch. Intel showed a diagram of the compute tile, built on Intel 4, with 6 P-core and 8 E-cores. On a different slide, this was the 'mid-range' product, probably 45 W for mobile, and the desktop versions will simply use longer compute tiles with more cores. During the livestream I had a rant about this, which I should probably spin out into a piece.

Intel were able to say that initially their chiplet design didn't clock as high as monolithic designs, but over time they improved the manufacturing so it is expected to clock higher. Arrow Lake will use a similar design but with an Intel 20A CPU tile and the same 36 micron Foveros pitch, whereas Lunar Lake after that will be on 'Intel Next' with a 25 micron Foveros Pitch. We're not sure if this is Foveros Omni or regular Foveros, as Intel has stated both will go down to 25 micron.

Cerebras, after showing their wafer scale now for a number of years, went into detail on its core architecture. Half of their 850'000 cores is SRAM! Each core is around 30 mW, so that makes it 24 KW per wafer scale engine. Sean Lie of Cerebras also spoke about how their compiler maps networks to the chip, and how execution of most chips is limited to matrix-matrix operations, whereas due to scale they can enable full vector-scalar compute at all BLAS levels. The chip has also been used for stencil compute, and the company seems to have over 25 paying customers they can publicly mention.

NVIDIA had a number of talks this year, firstly with their Orin chip enabling next-gen autonomous vehicles as an upgrade to Xavier, but they also spent a talk speaking about their new Arm-based Grace CPU. With 72 cores and a custom scalable data fabric, Grace is a pure scale out play as it embeds NVLink direct onto the chip, allowing for the use in Grace+Hopper NVSwitch designs. Two Grace chips can be connected with a 900 GB/sec link, with quad-socket support. Each Grace can be paired with 512 gigabytes of LPDDR5x, for almost 550 gigabytes per second bandwidth. Grace is designed to offer a compute partner and a memory capacity partner to Hopper, allowing for larger datasets to be trained.

Intel spoke about Ponte Vecchio - not much new, but they did go into cache bandwidth numbers, as well as confirming that the RAMBO cache is a lot of extra L2 cache for the chip. The L1 and L2 support multiple modes, such as Write Through, Write Back, Write Streaming and Uncached. One of the big targets here was supporting AI models with variable parallelism, and Intel has a few workload performance results ready to show.

CXL was a big part of the show this year, as I expect it will be last year. The first tutorial session was a complete run through of CXL 2 and CXL 3, explaining a lot about the memory models involved and how they work. Samsung were at the event talking about their CXL Memory Expander, which I've covered in a video on my YouTube channel previously.

Arm also came to the show with a talk on its Morello SoC, used to validate new security designs for future silicon. They cited that 70% of safety issues on modern chips are memory related, and this SoC implements the CHERI architecture, adding in new capabilities or metadata alongside regular data such that normal programming operations can be verified as legal or compromised/illegal. The SoC was built in 7nm and runs at 2.5 GHz, but it is more about function than performance - a proof of concept rather than a development platform. Those using the platform according to Arm showcase about a 73% reduction in memory issues using this architecture at the expense of 10% core performance (less than 10% if you consider the cache too). Learning from the platform will be implemented in future Arm chip design.

AMD was present, but didn't present anything new. One talk on the Instinct MI200 series of GPUs was a refresh of what we already knew, and the talk on Ryzen 6000 mobile SoCs was also everything we've seen and covered before. Though perhaps the most contentious part of the event was AMD's Ryzen 6000 Q&A, where the speaker absolutely refused to answer any questions from the audience. All the questions related to security and integration of elements such as Pluton, and for whatever reason, rather than try and state why he wouldn't answer the questions, just stated 'outside of the scope of the presentation', despite it being part of one of the slides. As one point he said something similar to 'as agreed I won't answer questions on this' to which the Chair just nodded his head - I'm not sure what he agreed that with, perhaps the Hot Chips organizers, but certainly the audience wasn't informed at all, and it's clearly a point of contention for the audience if all the questions were pivoted in that way. I'm really surprised he went into that talk with those sorts of answers to those questions. It was a really poor showing, and the event suffered a little because of it.

I should add in notable mentions of good talks from UntetherAI (1000+ RISC-V Inference Chip), Ranovus (Silicon Photonics), and Juniper (Chiplet Switch ASICs) that were very enjoyable to listen to.

Final Thoughts

In the end, both in my thoughts and a few peers I've spoken to, this year was an average year for Hot Chips, boosted by the presence of Tesla Dojo. Normally at Hot Chips we'd have a big CPU or GPU architecture disclosure, but this year we didn't really have that - no server x86 CPUs, and no consumer x86 CPUs/GPUs that we didn't already know about. NVIDIA Grace was more an SoC than an architecture overview, and so we missed a big one-two punch. The Intel keynote underdelivered if I'm honest, whereas previous keynotes from the likes of TSMC or even Raja's Intel were better. Given the cycle that we're in, with both Intel and AMD here about to launch next-gen CPUs, and the big three launching GPUs, that the cycles for next year won't align either.

This ultimately brings up the question - did you go to Hot Chips? What do you want to see this time next year?

My minimum specification is that I can already see one big topic worth addressing next year: there are now enough silicon photonics companies with silicon in the works that it's getting exciting -  a session with Ayar Labs, Lightmatter, Lightelligence, Intel, or something along those lines would be good. Not the star, but a good session. Perhaps also we can get some of the new Arm server chips from Ampere, Alibaba, Tencent, Amazon, in for a session too. I wouldn't mind hearing about Ampere One, Yutien, perhaps a Graviton 4, or a more detailed insight into Grace as well. There might even be a chip or two coming from others not announced yet worth sticking in there. If you're watching, start getting the sign off to present at Hot Chips 2023 today.

Share More Than Moore

Share this post
Hot Chips 2022: More Than Moore Recap
morethanmoore.substack.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Dr. Ian Cutress
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing