Sounds reasonable, other than stating that the memory controller is only on the first chiplet. The memory controller is in each chiplet each and serves memory access requests from a portion of the address space from each of the chiplets (routed from the shared L3). What is in the first chiplet though is the PCIe interface with the CPU that handles CPU memory access requests.
Regarding checkerboard pixel processing, it is very similar to how current monolithic GPUs contain multiple rasterizers that process a subset of the screen space. Large triangles go to multiple rasterizers, smaller ones might fit in a single one (with proper alignment).
Regarding synchronization, the GPU pipeline is already mostly unsynchronized, with a lot parallel work happening on the fly. The majority of synchronization happens per-pixel, where ordering of pixel blending is important, but with the checkerboard partitioning the synchronization will happen within each chiplet (actually, each one of multiple rasterizers within each chiplet). When across-pixel synchronization needs to happen (e.g. render-to-texture) some GPU commands might indeed stall, but asynchronous compute tasks might fill in.
I wonder how per-vertex (or per-primitive or per-patch) processing will be distributed, though. Each triangle must (unless out-of-order rasterization is enabled) be rasterized in-order, so that blending happens in-order at the end of the pipeline. Maybe binning (where triangles are distributed among buckets) will make a comeback? Or there will be a different process to sort primitives from multiple chiplets before rasterization? They already must deal with this issue with multi-SA monolithic GPUs, the shared L3 should help...