An in-depth look at Google’s first Tensor Processing Unit (TPU) (Developed in 2015-2016)

Bacon1 · May 14, 2017

https://cloud.google.com/blog/big-d...k-at-googles-first-tensor-processing-unit-tpu

The road to TPUs
Although Google considered building an Application-Specific Integrated Circuit (ASIC) for neural networks as early as 2006, the situation became urgent in 2013. That’s when we realized that the fast-growing computational demands of neural networks could require us to double the number of data centers we operate.

Usually, ASIC development takes several years. In the case of the TPU, however, we designed, verified, built and deployed the processor to our data centers in just 15 months. Norm Jouppi, the tech lead for the TPU project (also one of the principal architects of the MIPS processor) described the sprint this way:

“We did a very fast chip design. It was really quite remarkable. We started shipping the first silicon with no bug fixes or mask changes. Considering we were hiring the team as we were building the chip, then hiring RTL (circuitry design) people and rushing to hire design verification people, it was hectic.”

(from First in-depth look at Google's TPU architecture, The Next Platform)

The TPU ASIC is built on a 28nm process, runs at 700MHz and consumes 40W when running. Because we needed to deploy the TPU to Google's existing servers as fast as possible, we chose to package the processor as an external accelerator card that fits into an SATA hard disk slot for drop-in installation. The TPU is connected to its host via a PCIe Gen3 x16 bus that provides 12.5GB/s of effective bandwidth.

Pretty in depth article, but with all the buzz around GV100 figured you might want to see what's already out there for those specific tasks.

itsmydamnation · May 14, 2017

Thax for that, interesting that they are only using int8. Do we know if GV100 can do packed 4x int8? We know vega can, it would be interesting to know if int8 is really good enough for most inputs/weights.

itsmydamnation · May 14, 2017

i'll just add the post i made in the comment section of that link:

question was :

Since you have access to Nvidia Tesla V100 cards, could you give us a comparison in efficiency of utilizing solely the Tensor Cores on a GV100 vs your 28nm and 14nm TPUs?

I would really like to know how your TPUs stack up against Nvidia's latest and greatest.

my answer

i'll give your question a go,

Given in GV100 there are 672 tensor cores that are 4x4 you end up with 21504 ALU's . we dont know GV100 clock speed but if we assume around GP100 (~1400mhz) you end up with 60 Teraops per second. But its also working with FP16 inputs so your going to get more resolution in the outputs or more range ( how much FP16/32 matters over int8 i have no idea).

I dont think the idea of GV100 is to go head to head in matrix multiply but to have good enough dedicated hardware for it so your not wasting clock cycles doing this work with your more flexible FP32/64 and int 32 units.

For example Vega is supposed to be able to do 4xint8 packed math so you would end up with 4096x4x16x4x1400 ( compute unit x vector unit x SIMD width x packed int8 x clock ) = 2936 Teraops a sec (peak)!!! but that used all 4096 Compute units, GV100 still has 5376 "CUDA cores" to do whatever outside of the tensor cores to do "stuff".

William Gaatjes · May 14, 2017

Interesting article. Thank you.

Headfoot · May 15, 2017

Very interesting tidbit from the abstract of the paper:

Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU’s GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk/view

In other words, there is a lot of untapped potential in a Rev 2. if it got more bandwidth

antihelten · May 18, 2017

Google just announced the next version of their TPU, TPU 2.0 or TPU Cloud

Performance wise each board is capable of 180 TFLOPS and contains 4 chips (so 45 TFLOPS per chip), it is unclear whether this is FP16 or FP32. The previous generation was capable of 92 TOPS using INT8. Given that TPU 2.0 runs FP16 or FP32 and TPU 1.0 runs INT8 a direct comparison cannot really be made, but seeing as a 1:2 ratio between INT8 and FP16 is often the norm, one could argue that a theoretical TPU 2.0 capable of running INT8 at said ratio would be capable if it's 90 TOPS if it's listed performance is FP16 and 180 TOPS if it's FP32. So either the same as TPU 1.0 or double that of TPU 1.0.

GV100 is 120 TFLOPS using FP16, but running the calculations at FP32, so all in all they are close enough to each other that it will probably come down to efficiency more than anything else (TPU 1.0 was 40 watts, so TPU 2.0 could possibly be 160 watt per board (4 chips per board), GV100 is 300W).

It is also worth noting that you can't actually buy TPU 2.0, so the only area where it competes directly with GV100 is via the cloud (Google Cloud Compute and Nvidia GPU Cloud respectively).

xpea · May 18, 2017

well looking at the massive heat sinks, I highly doubt Google TPU2 is still 40W. My guess is around 80W which is very poor efficiency for a dedicated chip of this performance (45TFLOPS)

for comparison, the first TPU rated at 40W looks like that:

antihelten · May 18, 2017

xpea said:
well looking at the massive heat sinks, I highly doubt Google TPU2 is still 40W. My guess is around 80W which is very poor efficiency for a dedicated chip of this performance (45TFLOPS)

for comparison, the first TPU rated at 40W looks like that:

Quite possible, I really have no idea what the wattage is. Either way though I think it's a bit misleading to say that 45 TFLOPS at 80W is very poor efficiency, seeing as it would still be comparable to GV100 (120 TFLOPS at 300W), and equal or better than TPU 1.0 (92 INT8 OPS at 40W, assuming that you treat 2 INT8 OPS as equal to 1 FP16 OPS).

xpea · May 18, 2017

antihelten said:
Quite possible, I really have no idea what the wattage is. Either way though I think it's a bit misleading to say that 45 TFLOPS at 80W is very poor efficiency, seeing as it would still be comparable to GV100 (120 TFLOPS at 300W), and equal or better than TPU 1.0 (92 INT8 OPS at 40W, assuming that you treat 2 INT8 OPS as equal to 1 FP16 OPS).

No it's very poor when compared to GV100 because the latter at 300W is with rasterizer, geometry engine, texture unit, FP64, FP32, INT32, hardware scheduler when TPU2 only does FP16 matrix. In other words, GV100 will be far from 300W if it had only tensor cores...

antihelten · May 18, 2017

xpea said:
No it's very poor when compared to GV100 because the latter at 300W is with rasterizer, geometry engine, texture unit, FP64, FP32, INT32, hardware scheduler when TPU2 only does FP16 matrix. In other words, GV100 will be far from 300W if it had only tensor cores...

So what if it includes all of those things, they won't be active when running tensor type workload, and thus won't contribute to the power usage (except for the hardware scheduler which would possibly still be active with tensor workloads), and you have no idea how much power GV100 will be using when only running tensor operations (unless I missed some announcement from Nvidia on this).

An in-depth look at Google’s first Tensor Processing Unit (TPU) (Developed in 2015-2016)

Bacon1

Diamond Member

itsmydamnation

Platinum Member

itsmydamnation

Platinum Member

William Gaatjes

Lifer

Headfoot

Diamond Member

antihelten

Golden Member

xpea

Senior member

antihelten

Golden Member

xpea

Senior member

antihelten

Golden Member

TRENDING THREADS