No and no.
TPUs are small-ish FP16 ALUs. The wiring probably costs them more than the actual die area.
But FP64 eats a LOT of die area.
Looking deeper into this the additional L2 cache is almost nothing in terms of die area, but the TPUs look like they consume close to the same amount of die area as the FP64 cores.
Each TPU has to perform matrix multiplication on 4x4 FP16 matrices and add the result to another 4x4 matrix (which can be FP32). That works out to 64 FP16 multiplications and 64 FP additions (48 FP16 + 16 FP16 or FP32) per cycle. That can not be done with a trivial amount of die area.
Given that the number of gates required for multiplication scales with the square of the precision (from what I remember high speed logic arrays are used in FPUs) a TPU doing 64 FP16 multiplications should be 4x bigger than a FP64 core (64/2^4 = 4). The V100 has 4x the number of FP64 cores than TPUs (32 vs 8 per SM) so the area hit from the TPUs should be close to the area hit from the FP64 cores.
An approximation of the die hit for adding TPUs can also be derived from comparing the area per shader module of the GP100 with the GV100 while accounting for the increased density of 12nm. This approximation yields that 103mm^2 of the die is dedicated to the TPUs (815mm^2 - 5376 cores/3840 Cores * 610mm^2 / 1.2x density = 103mm^2) on GV100. The die hit for FP64 cores can be approximated by looking at the die size of GP100 vs GP102. This approximation yields that the 1920 FP64 cores on GP100 consume 139mm^2 of die (610mm^2 - 471mm^2 = 139mm^2) or 0.072mm^2 per FP64 core. On 12nm this would translate to 0.060mm^2 per FP64 core, thus the hit for FP64 on GV100 should be 154mm^2 (2560cores * 0.060mm^2 = 154mm^2). So it looks to me the TPU are not smallish at all, but that the TPUs take up close to the same amount of area as the FP64 cores.