Very well written! Hopefully, these insights are allowed past the marketing department, managers, and other bureaucratic mechanisms that "protect" the developers. Hopefully, the driver team has enough bandwidth to address this, compared to working on new features, new hardware, or fixing other bugs.
Some of the issues you can see in the graphs indicate architectural issues, such as batching strategy and DMA chunk sizes. Other things could be hardware choices, such implementing less of the DMA mastering on the GPU side (i.e., using Windows paging to transfer data to the GPU instead of sending a physical address to the GPU DMA engine).