Nvidia have updated the technical brief with 80 TFLOPs of FP32 per Blackwell GPU. This compared HGX chips of which H100 HGX has a quoted 60 TFLOPs vs 67 TFLOPS of the H100 SXM alone. 11% lower. I will apply the same to Blackwell since its also quoted in HGX.
More importantly, assuming the rumored (but not yet confirmed) 8*10 structure of GB100 per chip, that makes 160 SM, we can start guessing the clockspeed improvements. This is made harder by the fact that both H100 and A100 never used the full 144 and 128 SM respectively, so we can assume that even if the rumored specs are true, we don't know how many units are active. Secondly, the full Blackwell chip uses 1000W for two chips, so I guess each chip has a 500W TDP? That heavily limits possible clockspeeds in a way that gaming GPUs don't yet have to worry about. I shall do a sort of worst case guess using all 160 SM per chip in my calculations and ignoring the effect limited TDP could have on the clockspeeds.
H100 HGX = 60 TFLOPs
B100 HGX = 80 TFLOPs
Difference is 33% HGX to HGX marketed spec. Difference of this to SXM is 19%.
So taking 19% clock speed and the latest GPC:TPC speculation from Kopite7kimi as well as assuming no real changes to the SM other than perhaps cache, I am guessing:
GB202 (+20%): 192 SM, 3.12 ghz max 153TFLOPs vs 82 TFLOPs from 4090 (but maybe also 600W). A 160 SM 5090 at 3.12 ghz is 55% higher. Real performance around 40% over 4090.
GB202 (+30%): 192 SM, 3.3 ghz max 162TFLOPs vs 82 TFLOPs from 4090 (but maybe also 600W). A 160 SM 5090 at 3.12 ghz is almost 100% higher. Real performance around 60% over 4090. (I think GB202 will have trouble scaling in general)
GB203 (+20%): 84SM, 3.12ghz, 67TFLOPs vs 52 TFLOPs rtx 4080 Super. 30% higher, 25% higher real performance, around 4090 -5%. An 80SM SKU would be firmly below 4090, in between 7900XTX and 4090.
GB203(+30%): 84SM, 3.3ghz, 70 TFLOPs vs 52 TFLOPs 4080S. 35% higher. 30% higher real performance, around 4090. 80SM SKU would hover closer to 4090 than 7900XTX.
Clocks alone are not enough for GB203 even in the 30% higher clocks case, if Blackwell only has 144SM active, then clock improvements would actually be an impressive 32-45% instead, clearly not from the process but perf/watt increase from architecture. I saw a posts and videos by AGF, RGT of architectural improvements in the SM and/or structure of TPC and/or links between the TPC structures, so hopefully those translate to performance of overall products.
NVIDIA Blackwell Architecture Technical Overview
NVIDIA's Blackwell GPU architecture revolutionizes AI with unparalleled performance, scalability and efficiency. Anchored by the Grace Blackwell GB200 superchip and GB200 NVL72, it boasts 30X more performance and 25X more energy efficiency over its predecessor.
resources.nvidia.com
More importantly, assuming the rumored (but not yet confirmed) 8*10 structure of GB100 per chip, that makes 160 SM, we can start guessing the clockspeed improvements. This is made harder by the fact that both H100 and A100 never used the full 144 and 128 SM respectively, so we can assume that even if the rumored specs are true, we don't know how many units are active. Secondly, the full Blackwell chip uses 1000W for two chips, so I guess each chip has a 500W TDP? That heavily limits possible clockspeeds in a way that gaming GPUs don't yet have to worry about. I shall do a sort of worst case guess using all 160 SM per chip in my calculations and ignoring the effect limited TDP could have on the clockspeeds.
H100 HGX = 60 TFLOPs
B100 HGX = 80 TFLOPs
Difference is 33% HGX to HGX marketed spec. Difference of this to SXM is 19%.
So taking 19% clock speed and the latest GPC:TPC speculation from Kopite7kimi as well as assuming no real changes to the SM other than perhaps cache, I am guessing:
GB202 (+20%): 192 SM, 3.12 ghz max 153TFLOPs vs 82 TFLOPs from 4090 (but maybe also 600W). A 160 SM 5090 at 3.12 ghz is 55% higher. Real performance around 40% over 4090.
GB202 (+30%): 192 SM, 3.3 ghz max 162TFLOPs vs 82 TFLOPs from 4090 (but maybe also 600W). A 160 SM 5090 at 3.12 ghz is almost 100% higher. Real performance around 60% over 4090. (I think GB202 will have trouble scaling in general)
GB203 (+20%): 84SM, 3.12ghz, 67TFLOPs vs 52 TFLOPs rtx 4080 Super. 30% higher, 25% higher real performance, around 4090 -5%. An 80SM SKU would be firmly below 4090, in between 7900XTX and 4090.
GB203(+30%): 84SM, 3.3ghz, 70 TFLOPs vs 52 TFLOPs 4080S. 35% higher. 30% higher real performance, around 4090. 80SM SKU would hover closer to 4090 than 7900XTX.
Clocks alone are not enough for GB203 even in the 30% higher clocks case, if Blackwell only has 144SM active, then clock improvements would actually be an impressive 32-45% instead, clearly not from the process but perf/watt increase from architecture. I saw a posts and videos by AGF, RGT of architectural improvements in the SM and/or structure of TPC and/or links between the TPC structures, so hopefully those translate to performance of overall products.
Last edited: