Aigo is almost there. (Salud mi amigo)
not about "normal sceduling"
There aren't enough terms to describe Intels original doubling speed optimizations. They were done in thirds. Go figure. Intel would return to the unoriginal divide by four, drop one for overhead. Remember the original dx33? Cheat was it quadropled speed, dropped one crank, for "engineering overhead", and add four memory sticks, dropping one for higher speed thresholds.
Start to sound like a mantra?
Your primary schedular is actually unit zero. Then BD core 1 becomes odd man out. Start by sceduling core 2 (really #3 by scedule), then skip to core 5, (BD core 4), and so on. Once you get out of base 10 thinking, you start getting closer to the truth. Its base 3 thinking. Look at other results. Enumerate by thirds. Not tenths. Hence the need for different thinking. Go back to when Intel first did three stick memory and figure out why. Headroom for speed (in a very basic way). A quad controller will be more effective if its only controlling 3 channels. Will have buffer flows looking like genius level when run on simpler levels. Run modern threading on older spacial sceduling and you will swamp it like first year entry programmers first try at hyperthreading.
Break its thermal threshold when scheduling. Core 2, then 5, on to 8, back to 3. Which quads of the base root are you waking on charge and which have applied heat from neighboring cores? Wanna bet will make difference? 95% of programmers don't have clue. They take easiest quickest route, instead of providing leeway. Try running around the shortcomings of Intels compilers (study Dan Corbit for lifelong lessons) or Amd's for fairness sake, and write them yourselves. Not MS's, or anybody else's. Figure it out by yourself. Your, close Aigomorla.
Not about 4 cores vs. 8. Which 4 cores? Why not 5? Or more importantly, which # out of the 8? Doesn't it depend on how its sceduled and WHAT is sceduled where (think layers of issues of importance) in a semi stacked level. So you popped stack on a virtual pile, which gets the cooled off integer integrator vs. Running parallel threading on other three base cores. So one "BD core is smokin" from lots of heavy floatpoint, why would bury it with threading between its "twin pair" while its cooling back off. Too many ppl don't bother thinking about how it actually goes through. Instead of having a conveinient "cooling off period" to lower tdp's, they have to throw kitchen sink at the toilet.
I can look at certain results knowing how will play out. Keep throwing water at candle, will eventaully wesr down the base until it lowers sides and puts out the wick at top.
Run 12 threads at same 6/8 core, will still bottleneck somewhere. Put some organization to how and where they get run and does it make a difference? Can you increase performance by crossing partial threading to opposite sides so buffers clear before rescheduling? Why not run an L2 buffer overflow to mirrors cache and let it flow back in same L3 stacking profile? How many programmers used to individual cores think like twin pairs and allow its trancendence?
A whole new genre of tweaks are yet to come. How is another animal.
Will they think out of 15 year old box? Many of these guys, thats most of their "sentient aware" lives. Can they learn from pasts and project forward and find someone elses unsaid truths? I don't have one in my hands yet... Will wait for "new" prices to come down first. Then will trash a few kernals until I see what makes it sing. Will tickling middle toe be more effective than smashing that big toe with three pound hammer? Maybe. Prolly not lol. Still have a Thunderbird desktop cpu that never was made to fit in laptop (mine was some kind of demo) that was "tickled" to dore more than laptop or desktop was meant to. Why? Bios dev thought out of box. He passed buck to me to figure out how to not make it tone deaf, for 7 years it sang operas while its sisters tried to figure out how to queen pawns. Move order theory hadn't got past negative progressions, until nul move theory gained steam. Skip a cell of root moves to advance a negative? Novel. Out of box even. Killed a 6 month continuos data feed to play in it. Found why Thunderbird died quick death when dual core oppositional stance on same integer line made huge difference scheduling both sides of same sine curve with expectant integer eval. Bi-directional doesn't work on bad single fpu without true "fpu co-processor" to shed excess null voids overlapping on single stacks. Parity is not strong suit of two dimensional exuecution of three or four demensional thinking (hard messy lesson).
Take my cell phone for ex. Dual core A9 Dx2. Droidfish running single threaded single core found matein 6 moves slower than my 6yr old nephew. Give it multi-threading and finds it faster then me. Contact dev and all of sudden, he turns on core awareness, and it smokes four times as fast. Why? Should have doubled only. 6 different iterations of theorums changing next level downs, causing greater gains through negative parsings. Each layer shed the slough of unlikely returnd through raw trail. Each built on each others failures that were sucesses. (CCC archives, Dr. Bob, Ed Screoder, etc. 20+ years ago) taught about posititive failures if you can figure out pattern of cascading leaks via different chipsets. They unleashed the compiling monster that became Dan Corbit. Tweak it depending on looping thresholds and you can surpass the norm with triviality. Some high paid twit (now, not then, lol) decided better to keep his long term prospects open by not tweaking his boss off with a better idea than his. Maybe he will let his pass off go as little noticed as possible. Can your recompile of his ss2 sub dep go better without his letdowns? Most of you that have read thus far can guess that might be yes. Depends on what your motivations are. Hence why I find more satisfaction in the Linux circles. They aren't waiting for some mega giant corp to come up with a better driver. Go do it differently. Will be happy to Alpha/Beta for ya if have a spec you need run.