4 Comments
User's avatar
Tanj's avatar

Really well done post! You take good photos.

On the network blade the 6 connectors are not the same size each as the 4 connectors on the GPU blade. Also there are 6 switch blades for 18 GPU blades? That suggests the connectors are a different density - double density and 1.5x number of connectors?

If you ask Perplexity "in LLMs what are the patterns that show different scales can be applied to groups of weights and activations? What research papers have reported on how to use microscaling effectively?" look at the various papers it finds to see the basis for microscaling.

Expand full comment
amnon izhar's avatar

the GPU sleds are slimmer and only show 4 diff pairs per row vs the switch sleds are thicker can support 8 diff pairs per row. The 1.5 multiplier is because the switch blade shows 3 NVlink switch per blade vs the current NVL72 that supports 2 NVLink switches per sled.

Expand full comment
Peter W.'s avatar

+1 on this being a great writeup!

My comment is more about Nvidia getting into the inferencing and hosting business themselves: Jensen Huang alluded to Hyperscalers such as MS (Azure) and AWS already as "competitors", and not only as clients for Nvidia's hardware. While I don't think that Nvidia will try to be the next OpenAI or Grok etc - all still money- losing businesses-

getting into the AI hardware hosting business makes a lot of sense for them. Why leave the mark-ups that hyperscalers charge their customers for using Nvidia hardware on the table? Conversely, I would bet that this perspective is already intensifying efforts by those players to come up with the next generation of their own custom hardware (ASICs) that can inference roughly as well as Nvidia's B200 and B300. But, as of today, none of the players in the US has managed that feat yet, and we (okay, I) don't know just how capable the Chinese AI accelerators really are.

IMHO, the biggest potential challenge to Nvidia's dominance and bottom line would be some kind of joint effort by, for example, MS and AWS to put their heads together, maybe joined by Google or Meta? Of course, the probability of that happening is more in the "when hell freezes over" range, but, given how they resent and fear their dependance on Nvidia, who knows?

And, maybe, maybe, AMD's next generation Instincts will surprise us, too, but it doesn't look like it right now.

Expand full comment
Sherman's avatar

> As the response to your ChatGPT question is being built out word-by-word, the model needs to reference all the previous words in the conversation to decide what comes next. It doesn’t recompute everything from scratch; instead, it looks up cached intermediate results (the key and value vectors) from earlier in the conversation. These lookups involve many small reads from random places each time a new word is generated.

No? A request will only pull a contiguous KV cache for a sequence once (assuming it isn't already on HBM), until its generation stops.

Expand full comment