The "simplest" Multi-Chip approach is that what Nvidia does with B100/200: Just fuse two GPUs together.
Easy for compute, graphics is a whole other story due to the serial nature of APIs.
Also the bigger you go, the demands on interconnect bandwidth go up exponentially.
That makes a ~300...350mm2 GPU a little bit bigger (you need to have an additional chip-to-chip interface) but does not add other cost. Smaller Die of the same portfolio are not affected as well. The big GPU, which uses two of the chiplets (600...700mm2 total) would have doubled media engines etc. but for a Halo part not a huge problem. You can also put those in use for prosumers etc. (e.g. GB202 has more video encoders/decoders compared to the the smaller Blackwell GPUs).
Thing is you need a big, expensive and high demand substrate, and you could just build a big enough monolithic part.
Packaging cost should also not increase by too much. I think RDNA3 alike organic "Infinity Fanout Links" are enough for a few TByte/s of bandwidth.
Not between compute engines, you need CoWoS-L for enough bandwidth to sync L3+memory ~5TB/s bidirectional. L2, forget about it, you would need to have 2 GPUs work together as one through all sorts of tricks.
CoWoS is a nonstarter for client, same reasons as HBM.
Much easier to build one big compute engine and stack it on top of cache and memory PHYs.
For VRAM and Infinity Cache bandwidth surely enough. The question is, what happens with the L2$? I assume, that this one needs to be some sort of "private" for each chiplet. I do not see, that you can route much if not all L2$ traffic over the chip-to-chip interface and keep the cost reasonable.
Well N4C had L2 private to each SED, even with SoIC-X having coherent L2 across multiple chiplets is a very tall ask.
So once again the best idea without going beyond retscale base dimensions is a simple 3D stack, frontend+compute+L2 up top, L3+memory below.
And I guess a MID or two connected to the base with fanouts for IO. Thing is expensive enough as is.