I thought that the increase in the Die-Die bandwidth in Threadripper was that inter-die links that would have gone to Dies 3 and 4 in Epyc were instead linked between Dies 1 and 2 in Threadripper, resulting in a major increase in die-die bandwidth between the remaining two dies.
As for the thread topic, I still can't see AMD tearing up a lot of the processor socket to add more CCXs to the Zepplin Dies in Zen+ or Zen2. In 2P EPYC, there are direct CCX to CCX links between the 2 processors. This means that if the number of CCX units per zepplin die changes, those links are going to have to increase in number. That will require a redo of the processor socket, which AMD specifically said that they were not going to do for several years. This leads me to believe that either one of two things will happen:
1) The number of cores in a die will stay the same and they will focus on improving IPC with expanded caches, process improvements, and with tweaks to performance critical circuits in the cores
2) The number of cores will increase via expanding the number of cores in a CCX. This can be accomplished by abandoning the direct connect scheme that currently exists between the individual cores and instead, changing that to an integrated data switch in each CCX. You can then expand to 6 cores, direct the intercore connection links to the data switch, and have it do the heavy lifting of establishing the inter core connections and forwarding the information. This will unfortunately increase inter-core latency a bit, but will still be much faster than leaving the CCX. We're also talking about something that will be running a faster performing process as well, meaning that overall CCX latency as measured in ns delay will be reduced from current levels if the current layout is retained. This could result in a 6 core CCX that has similar average latency to the existing 4 core CCX as measured in ns. It is not insurmountable to expand the CCX. As long as you leave the cores themselves alone, save for process node tweaks and critical path cleanup, you can keep development costs down.
There is one other possibility. Instead of going to 6 core CCXs, or going for a wholesale relayout of the existing die, they could instead use the 7nm node to go with 4 CCX units and have the CCXs be paired with respect to die to die and processor to processor communication links. The pairs of CCX units are joined with a multi-port high speed data switch that handles L3 cache access, CCX exterior data links, etc. To the rest of the die, it looks and acts just like the existing single CCX (save for some core addressing control logic). The CCX units themselves stay the same, save for process tweaks, addressing control logic changes, etc. Intra-ccx latency stays consistent. Inter-ccx latency gets bifurcated between paired CCX latency and exterior CCX latency. NUMA coding will need a little tweaking, but, by that time, more software should be NUMA aware. This will essentially turn an existing RYZEN processor into a low I/O Threadripper, and turn Threadripper into a faster 1-P EPYC with lower memory bandwidth. EPYC would, I think, be a mess with massive inter-die bottlenecks unless they can find a way to make a major improvement in inter-die communications bandwidth.