Intel microarchitecture: Nehalem v. Skylake

Makaveli · Apr 13, 2016

This thread is a great read!!!

intangir · Apr 14, 2016

Dresdenboy said:
Regarding Theseus' ship: Of course, the components don't survive process related and microarchitectrual changes (even an ALU has to change with a new scheduler or result bus), but at the high level it continues to be a ship of about the same class, just being faster while consuming less, being less ugly, whatever. It doesn't become a Titanic instead. Does adding a HW divider and increasing some buffers make the Husky core (Llano) a new microarchitecture? Yes, as it's not the same K10 anymore. But it's clearly been derived from it. So at which level or at which size do changes need to happen to call it totally new (like created from scratch), or just new (not the same anymore)?

Well, perhaps not if the replacement components have the exact same specifications. But the point of new design projects that aren't "dumb" shrinks is that they're not simply re-implementing the same circuits with smaller transistors, they're actually changing the functionality and capability of the units they're working on. So it's not a perfect translation of the Ship of Theseus analogy. Perhaps a wooden boat does become a luxury cruise liner after enough steps!

(I put dumb in quotes because nowadays, the transistor design rules are changing so much between processes that it takes a lot of know-how and ingenuity just to shrink the exact same logical functionality onto a smaller process and have it work electrically. Schematic designers have to run very hard just to remain in place, and they're asked to improve timing cycles and power consumption on top of that!)

But you raise a good point in that replacing a single component at a time keeps the high-level chip structure mostly the same. Any changes to the big picture are necessarily gradual. But I think they do add up after a while. It's all still recognizably a P6-style structure: you have instruction decoders that feed instruction dispatchers that feed execution units that need to get their data from registers and memory and write back CPU state, possibly redirecting the instruction decoders, with various optimizations like data pipeline forwarding, branch prediction, and memory disambiguation adding various additional connections between units. And changing that would require a new compute paradigm replacing OOO execution. But unless you want to call all such microarchitectures both from Intel and AMD "the same" since the Pentium Pro in 1995, I think you can recognize that they do become new microarchitectures after a while.

hrga225 said:
Well,if you look at processor floorplans,I would say that Intel's designs haven't changed radically in 20 years.Or to recap: Enough steps in evolution after some time look like revolution.

I couldn't phrase it better myself. This is how evolution works, and it bears repeating that all that is necessary for macroevolution is microevolution continuing over a longer span of time. Does that apply to computer architecture? I say yes!

Dresdenboy said:
Interesting: The ARM developers wrote the ISA model and a microarchitectural simulator in BASIC back then. But this was a simple uarch. Alpha was complex. And as someone said a while ago, the best simulator of a chip is the chip itself.

Oh, sure. I believe I heard that in the first 2-3 hours of power-on, even at the 1 or 2 GHz an A0 stepping chip might manage, it runs through more cycles than all the simulations done in the three years of preceding work to create that first set of photomasks.

Dresdenboy said:
But I think, that situation has changed, also indicated by Keller's statements regarding availability of x86 and ARM traces, LinkedIn profiles, papers, use of FPGAs, etc. While in the past the logic complexity of big chips was way beyond what could have been simulated in a cycle exact uarch simulator or even significant parts of them in SPICE, the computing performance continued to grow exponentially, while microarchitectures got features added in a more linear fashion. Colwells "big head" argument actually describes what happens without these options.

The growth might also help to manage the testability. The improvements and complexity can only grow as much as the design team can handle it. So there is a shift from creating abstract "mind models" of the uarch to think through known use cases (while simulating small parts to support this analysis) to simulating increasing amounts of the uarch's components at increasing detail.

I don't know about that. Microarchitects have been working with instruction traces for quite a while; they've been constantly updating their library as the set of important software applications and their performance kernels have evolved over the years. Colwell himself alluded to the one they used at Intel to make the decision to develop Itanium in that Stanford lecture that you can view and I can't anymore, and that decision must have been made in the early '90s?

And I know a lot of work has been done on cutting-edge performance validation techniques, such as emulation with FPGAs and the like. Moore's Law cuts many ways. It helps with the nitty-gritty experimentation and data gathering, but it still doesn't make the analysis of results and engineering decision-making any easier. Like Colwell said, even the smartest people's heads aren't big enough to hold everything they need to think about at once.

If I could dig up one of those histogram plots giving you the performance impact of a uarch tweak on a set of instruction traces run through a performance simulator, I would've dummied up an example of how a microarchitect has to go from perfsim results to actual design decisions. But I couldn't find such a plot in my initial Googling and this is already turning into my usual novel-length posts, so I'll spare you that for now.

But yeah, had I such a graph to show you, it would probably be on the order of ~100 core instruction traces representing the strategic workloads you're targeting on your design project, simulated and graphed as percentage performance gained against the baseline uarch you've chosen (probably the n-1 gen microarchitecture), and reduced to near illegibility as a set of tiny little bars sorted by height and ranging from maybe 10 or 15% for the workloads it's really good for, to the negative 5% for the (hopefully small) set of workloads that inevitably find a way to run slower no matter what kind of change you're making. And then based on that you decide whether that change is worth it, can you tweak any parameters in that one feature that might improve its histogram curve, consider if you combine it with another uarch feature, do you get performance divots, and so on and so forth.

Oh, and again from Colwell's lecture, you might get a performance problem when it's discovered later in implementation that they can't actually design a circuit that gets the needed bit over from one unit to the other fast enough, so you have to hack in a heroic change that saves the day, but replaces that one big problem with a little problem and a mystery problem that you only saw one little sign of and thought "huh, that's weird. But it's only one little thing and I can ignore it", but over the remaining course of the project, the mystery grows and grows, until one day you say, aw, we were better off with the original big problem. Such is the life of a CPU designer.

Dresdenboy said:
P.S.: Oh, I had to add to the newer Colwell video, that the relevant part about using AMD to force development of arch/uarch improvements at Intel begins at minute 26. BTW Dave Ditzel sat in the auditorium during that other talk at Stanford. This all is a very valuable input to my chipdesign game. ^^

Glad you found value in it! But I find it ironic that Colwell's overall point in that newer video is that there are things companies just can't do, either because it costs too much money or because the companies are driven by competition, which makes it hard for them to work together on essential technologies such as finding a replacement for CMOS. I mean, think of all the billions of dollars that Samsung et al. pour into transistor process development, just to duplicate what Intel has already done. How wasteful! And that's why we need government-funded research, because historically that's the only way such essential leaps in technology have been made.

cytg111 · Apr 14, 2016

Makaveli said:
This thread is a great read!!!

One of the best read in years.

Deders · Apr 14, 2016

cytg111 said:
One of the best read in years.

'Like'

intangir · Apr 14, 2016

nenforcer said:
Don't forget the progressive TDP improvements which have been made since the original Conroe Core 2 Quads we have seen go from

95W .45nm Quad Core Core 2 Quad ->
130W .32nm Hex Core Gulftown ->
95W .32nm Quad Core Sandy Bridge ->
77W .22nm Quad Core Ivy Bridge ->
88W .22nm Quad Core Haswell ->
65W .14nm Quad Core Broadwell ->
65(91)W .14nm Quad Core Skylake

Yes, and it's another way that the benefits of server chip development continue to trickle downmarket into the desktop space. Four cores at 65W means eight cores at 130W! Not to mention it enables nifty quieter, more mobile systems in the space most of us care about.

Dresdenboy said:
I often have the feeling, that many discussions going on here revolve around hidden wrong or simply missed assumptions, due to the increasing complexity. How often is energy efficiency left out of uarch discussions, while under the presence of a power wall this even helps high performance processors?! Simulation is the key to standardizing the evaluation of ideas and handle more complex scenarios. We use them for ADAS too. It's just not possible to recreate lots of realistic traffic conditions in a few architected NCAP/NHTSA tests. And nobody wants to go out and try to provoke some crashes to test new ideas (at least not here ).

Oh, indeed. I see a lot of posts around here saying "true enthusiasts" don't care how much power their desktop systems burn. But there are practical limits for power consumption; system reliability and materials engineering depend on those limits being adhered to, and we're already at the highest practical TDP in most situations. Burning any more wattage requires exotic engineering solutions for cooling, and how much of the market really wants to deal with liquid cooling systems?

So if CPU designers can get the same performance for less power through better power-saving features, that translates to more performance at maximum TDP. And in what circumstances are CPUs power-limited these days? All of them.

majord · Apr 14, 2016

Dresdenboy said:
Thanks. I'll add the changes soon. I also have the annotated Nehalem core, partly annotated Sandy Bridge and Core (the latter to show some new features). Somewhere I should also have a more abstract view of the core layout of Core in my archive.

I'll post it on my blog, if that is OK with you.

of course OK. Be good like to see the others you have

Dresdenboy · Apr 14, 2016

majord said:
of course OK. Be good like to see the others you have

Fine. I'll post it soon as there isn't much to write.

Now to some images, which might help a bit:

Slide 113 here: http://en.slideshare.net/sssuhas/01-intel-processor-architecture-core

Album: http://imgur.com/a/aozXy

BTW, when Intel presented the changes of SNB vs. NHM, it clearly marked the "new" things (yellow, red), indicating, that the other stuff in the core is "old" or somehow inherited, see for example
http://arstechnica.com/business/2010/09/intels-next-must-have-upgrade-a-look-at-sandy-bridge/
and
http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/2

Also interesting:

majord · Apr 15, 2016

Nice! I remember seeing some of them now, but has been ages! good thing you archived them

Never seen the 2nd last or last one though. Or at least I don't remember. Last NHM floorplan is brilliant. Can only make out some of the labels though

Sad how little they divulge now. Skylake has been the worst by far when it comes to details. They didn't even release official die sizes AFAIK. Why the later I couldn't guess, given it's not exactly a secret when its stuck to a package

Dresdenboy · Apr 15, 2016

majord said:
Nice! I remember seeing some of them now, but has been ages! good thing you archived them

Never seen the 2nd last or last one though. Or at least I don't remember. Last NHM floorplan is brilliant. Can only make out some of the labels though

Sad how little they divulge now. Skylake has been the worst by far when it comes to details. They didn't even release official die sizes AFAIK. Why the later I couldn't guess, given it's not exactly a secret when its stuck to a package

They'll have their reasons, or Zen moments. Who knows...

And did you mean the left or the right part of the NHM floorplan?

Here's another interesting paper describing, how Nehalem has been synthesized on multiple FPGAs for simulation:
https://www.researchgate.net/profil...thesizable/links/55d7bb1708aed6a199a69296.pdf

Intel microarchitecture: Nehalem v. Skylake

Makaveli

Diamond Member

intangir

Member

cytg111

Lifer

Deders

Platinum Member

intangir

Member

majord

Senior member

Dresdenboy

Golden Member

majord

Senior member

Dresdenboy

Golden Member

TRENDING THREADS