On dual-CCD Ryzens, as well as on Threadrippers and EPYCs, we who are active in
citizen science/ scientific computing occasionally use something more targeted than "core parking": When we have multithreaded applications which heavily share data between threads, we assign CPU affinity to thread groups (usually to processes a.k.a. tasks) such that the heavy sharing is confined to a CCX ( = to a last level cache domain) which is common to this thread group. On EPYCs or/and on old Zen 2 CPUs, we also sometimes arrange for such sharing across two CCXs if we want more cores per thread group. Maybe that's what you would call micro-management, but in our case, this is set-and-forget. IOW it's trivial to set up and operate once the performance characteristics of the application are known.
We have got applications in which this nets us just a moderate percentage of performance and power efficiency gain. But we also have edge case applications in which the gain can be way up in double-digit percentages on Ryzen, or even more on EPYCs with more CCXs. This specifically happens if vector arithmetic is performed on datasets that need to be kept synchronized between threads. Without scheduling hints, the cores would wait on RAM accesses a lot and also increase task energy while doing so.
However, whether or not
your use case (DL training) has got similar requirements is not known to me.
Edit, PS:
Now this raises the question: Why not use CPUs from AMD's competitor, ones which have a single cache domain per processor? The answer is of course that currently AMD's CPUs are just so much better performing and efficient in computing-heavy tasks, and AMD's splitting of larger CPUs into CCXs is one of the very reasons for this advantage of theirs. And by that I specifically mean CCXs, not even CCDs.