The modular approach would be the way to go, similar to the CCX design in Zen. If I recall correctly AMD's blocks in their GPU's are already somewhat modular, the PS4 Pro and Scorpio SOC's are proof of this. The only caveat would be making the masks for these dies, they would need to move a good bit of volume to make the masks worth it financially.
The problem with this approach is that ROP's, CU's, and Cache don't always scale linear into performance. Simply doubling things gets really inefficient as the bottlenecks move to different parts of the design as you add more CU's, need more bandwidth, or as Fury showed us the need to keep all the shaders fed. This design method would take an extreme amount of retooling much like Keller had the CPU division do, being able to find these bottlenecks quickly and adjust things accordingly would allow AMD to optimize their own designs more quickly and would possibly help time to market.