We already have accelerators. GPUs are ones, sound cards pretty much were ones. Specifically lower latency sound mixing I honestly think that ship sailed ages ago already. We had excellent hardware mixing of sound already, but then Microsoft came and instead of standardizing hardware mixing made all the driver models incompatible with the new Windows OS release. Since then cheap on board sound and software mixing had been considered good enough for like 99.99% of all use cases.
I used to have a Creative Soundblaster Platinum which had an external logic box to keep the noise level down. It was outstanding. I now use a firefly red, but honestly, A/B between the two (when I still had the Platinum) the old Platinum had better sound (using a high quality PA). So yea, I think we definitely have gone backwards there.
Still, we have P Cores, E Cores, GPU's and AI processors today. I remember reading something about cores designed to run JVM stuff ultra fast (although any interpreted language and "fast execution" should never be used in the same sentence ). I suspect that moving forward, specific hardware will be used more and more to improve performance by orders of magnitude. This will be especially true as process technology advances grind to a screeching halt and the cost of new nodes continues on its exponential curve.
Or maybe software developers will be able to specifically access certain cores in the CPU through the software for specific routines in the code.
This takes me back to the early '90's when I was using the CardD and Software Audio Workshop (SAW), written by Bob Lentini in Assembly. It was absolutely miraculous that you could multitrack record and mix with a pentium class computer. I was at Javits centers in NY way back when saw the demo of CardD and SAW. I bought them within a week, convining the band we HAD to have them. We also had a small recording studio. I used it instead of a DAT recorder to digitize final mixdown and for mastering. It was fabulous for the time and even holds up today. SAW, coded in Assembly was a few MB and ran from the exe file. It was stable, tiny, fast, and brilliant. I remember Bob had studied the Windows API for a bit and decided he couldn't do this through Windows, it was going to have to happen in Assembly and just made it happen.
Good software can be the way "around" insufficient hardware and it is the more elegant approach. We've already seen what good game drivers can do.
I have made plenty of code where you can specify core affinity in Windows. If you are careful, you can also get a time critical thread devoted to a core. I suspect that in the future, you will be able to dedicate tasks to a specific kind of core in addition to a core (if you can't already).
I have found that good C code (and even C++) will operate as quickly as assembly. In fact, in many cases (most all cases for me) the compiler will know cool tricks that you don't know about for different processors and use a more efficient binary than you can get using assembly. I actually tried this in the 90's. Once I found the compiler wrote better assembly than me (using the option to output the asm file of the C code), I never wrote another asm program again, but instead used inline assembly in C to do the low low level dirty work .
Now days, software engineers don't have the very first clue how low level instructions are created or carried out. Ask them about what a linker does, or God forbid, how to set one up for an embedded system (when the OS doesn't do all the work for you) and they are lost.
Shoot, most of them only program in Python .... which only barely ranks in my book as a real programming language, and only then because of the crazy extensive data analysis libraries it has.
Including larger Front-End, 8x wider predictor, 8-Wide decode, L1-I 64KB, UOP Cache(L0) 5250, Queue 192, ROB 576, Branch Order Buffer 180, Scheduler FP 114 and PRF ~400, Scheduler INT 97 and PRF 290, Execution units on 10 (4x FP + 6x ALU) ports instead of 5 (3x FP/ALU + 2x ALU), Non-Scheduling-Queue Buffer, L0-D 48KB, L1-D 192KB and L2 3MB. + all resource control logic.
Also, splitting the unified FP/ALU scheduler with 97 entries for 5 execution ports into a separate scheduler for 6 ALU units with 97 entries and 4 FP units with 114 entries required large resources.
I recall many a post where all of those individual elements were estimated to provide some % IPC uplift. When all of them were combined numbers like 30% were being thrown about.
Sooooo. There are really only a couple of explanations I see. 1) Intel has some awful flaw in Lion Cove architecture that needs fixing. 2) All of those core improvements are implemented horribly .... I mean really horribly.
I personally am going with #1. I think that Intel is going to seriously surprise some people with the NEXT core improvement in Lion Cove. Not sure I am buying 30% IPC, but 20% might be easily the case.