As someone who has speculated in this a fair amount, my biggest concern is the large energy cost in moving data across larger distances and how to mitigate that issue.Why? You can get 4096bit HBM2 to communicate with a GPU.
All that's important are lanes of communications.
When you look at GPU uarchs, they have blocks communicating with cache & each other. Within each block, there's TMUs, ROPS, and cores. Thus they often cut a few blocks out to make a lower end SKU.
A multi-chip approach wouldn't work without the interposer giving it the lanes of communications that it needs.
We'll see it with Navi, late 2018 or whenever 10/7nm comes around early on.
If we assume that the best games going forward will use DX12, we might see, as an intermediate step the use of explicit multi-adapter techniques for the multiple die. Maybe 1 bank of HBM2 per die with no texture, etc, duplication, instead of all the die needing access to all the memory. Worse case a DX11 game will see each small die sufficient to give good frame rates. This would also allow great memory scaling.
Small die in this case is relative as interposers allow the total GPU area to exceed the past single die practical limit of 600mm^2.