Yes it true, but when intel start using ddr4, they abandoned the l4 cache, I'm never said it's doesn't have it use, I'm just said that l4 cache usually used to mitigate bandwidth problem.
Yes it helps bandwidth and is primarily used for graphics, but it does reduce latency rather than going to main memory... that's why it would have to be big, as if you have lots of data too big for L4 and constantly needing to look up L4 before hitting main memory it will increase latency.
The reason it is not included more often is because it is expensive, uses power, may affect clocking potential...so for desktop it is only used for igp graphics, other than that the die area is better used for more cores or save space and higher clocking potential.
In enterprise it has more potential as you are looking at throughput and efficiency, loads of low clocked cores with wide vectors, lots of memory sensitive workloads, whether that be latency or bandwidth, as we see better multi chip techniques, higher density nodes and less core scaling, we will see big caches come into play.
The reason it has not been used in enterprise and HPC yet is because it has been more beneficial to use that die area for more execution units, there has been enough bandwidth also to get the job done.
Adding more than 32 cores to a quad channel ddr4 bus with some die not having direct IMC access is useless accept in corner cases.
New topology and better IF will help, as will ddr5...but it won't be effective having much more than 32 cores on that socket without seeing some large L4 cache in future.
Just my take.