Oh really?One guy can't design a CPU. It takes thousands of people.
See here: http://www.megaprocessor.com/progress.html
One guy.
Oh really?One guy can't design a CPU. It takes thousands of people.
There must be some kind of unknown advantage with zen though. It is designed by keller afterall.
Or this: https://www.parallella.org/2016/10/05/epiphany-v-a-1024-core-64-bit-risc-processor/ There's a report about Parallella that details the work done. Impressive.
There must be some kind of unknown advantage with zen though. It is designed by keller afterall.
I'm just hoping he transferred some of the ideas of the Apple Ax(X) cores to the Zen and K12 core designs. As those cores have some impressive single-thread results: http://browser.primatelabs.com/ios-benchmarksAnd as I know, Keller was project manager/team leader on Zen project, not design engineer or something. Though I'm sure he had few good inputs for Zen design
I'm just hoping he transferred some of the ideas of the Apple Ax(X) cores to the Zen and K12 core designs. As those cores have some impressive single-thread results: http://browser.primatelabs.com/ios-benchmarks
(Calculating it to IPC doesn't make it any less impressive, only that the OS isn't comparable)
I can image that Apple would sue everyone for using transistors in a chip... But I'm not saying that AMD needs to copy stuff from others, just look at good examplesApple would sue AMD into oblivion if any of Apple's IP ended up in Zen. It's the norm for employment contracts to contain provisions that say something along of the lines of "anything you invent while working for us we own".
Most of the discussion about this already happened back then, but it is an interesting thought.I am confident that Keller has gone for a Apple Cyclone like design given that he architected Cyclone and that Zen and K12 are going to be similar. In fact there are few similarities between Zen/K12 and Cyclone
http://www.anandtech.com/show/7910/apples-cyclone-microarchitecture-detailed
Zen has 6 integer pipelines and 3 FP pipelines. Cyclone has 4 integer ALU pipes, 2 Load/Store and 3 FP pipes. I am sure Zen has a similar design. I am also thinking that Zen can execute and retire up to 6 instructions/micro-ops per clock. As for decode and issue I think Zen might be 4 - 5 while K12 will be 5-6. I am willing to bet Zen is going to be very high IPC and competitive with Skylake.
That high level stuff should be irrelevant (state of the art). Or did someone got sued for building a car with 6 tyres instead of 4?I can image that Apple would sue everyone for using transistors in a chip... But I'm not saying that AMD needs to copy stuff from others, just look at good examples
However, going back in time, this post from @raghu78 1,5year ago, could have some truth in it.
Most of the discussion about this already happened back then, but it is an interesting thought.
-Those figures are from where? I would love to see some evidence if you have any...If you think that in a process with 1/6 of leakage, -20% capacitance, 1.5x conductance and 20% less vcore, you can't do an 8 core CPU in 95W with over 3.2GHz, well, I am not trying anymore to convince you... I quit. BELIEVE (this is a correct term here) what you want...
RWT Bulldozer said:AMD was unwilling to share any specifics on gate delays, although some discussions at comp.arch suggest a target of ~17 gate delays vs. ~23 for Istanbul
Yea, we can all remember this 1.15V 2.8GHz 95W CPURWT Barcelona said:The device described at ISSCC was targeted at 2.2-2.8GHz at 1.15V, while operating within a 95W maximum thermal envelope. AMD claims that their 65nm process has a 15ps FO4 inversion delay, which suggests that Barcelona’s pipeline is just a little less than 24 FO4 delays.
Where's the evidence for that?It s clearly for the whole frequency range,
-1/6 leakage, etc... at what frequency?
1/6 leakage was from the comparison of NEON FPU implemented on 28nm BULK HP (the same used for XV) and 14nm FF LPP, at the same 330mW total. Leakage went from about 100mW (on 330mW total) to 18mW (on 330mW total). At this power the NEON FPU went from 1.something GHz at about 1V to 2.41 GHz at about 0.9V. I don't have a link at hand and you can search yourself.-Those figures are from where? I would love to see some evidence if you have any...
-1/6 leakage, etc... at what frequency?
-These process metric figures have always scaled as such or better. Look at Intels. They are transistor level figures. They have never directly translated from transistor level to chip level as you are incorrectly assuming.
-Less Vcore means nothing on its own. It's a function of smaller processes as current isn't scaling with it (due to leakage problems). Power used to scale do to Vcore and Current scaling. Now, best case, they lower one and raise the other, so current requirements are growing much higher than they used to be. This limits clockspeeds/heat.
-Leakage is in Amps i.e. current. And a chip has different types that constitute power draw to reach anywhere near the published TDP values. Even a hugely exaggerated 1/6 of... 20mA leakage idling is only 3.3mA, in leakage power.
-LPP is traditionally for good leakage power control and lower clocks, using denser cells... For mobile. That's by design.
-Where do you get your gate delay info from?
-Gate delays can hint at clocks only IF the process is maturing, yielding and performing well.
-BD was a speed demon design. Are you saying Zen is a design for higher clocks?
Looking at the 40% IPC claim, this cannot be the case as lower delay per stage would mean increased pipeline stages so higher instr latencies. That would mean lower IPC (more gate depth). It's either one or the other. You can't have both. I would assume at least a 20 FO4 logic depth delay in the critical timing parts for Zen. Willamette was 16ish.
Yea, we can all remember this 1.15V 2.8GHz 95W CPU
-If they fail to get higher than 3.2GHz luanch, what will that mean? Is that a disaster in your books?
Where's the evidence for that?
The rest of your post was compete pseudoscience garbage. Like saying Mars is looking black today so it will definitely be 50C in Berlin.
Sent from HTC 10
(Opinions are own)
Just to point-out, support for 64-bit code was added on P4's Prescott, which had 32-bit ALUs, as well as double L1 data cache and double L2 cache.The fact that a low FO4 INTEL design had poor IPC, was due to two little 16 bit dual pumped ALU (with 64 bit code...Awful), few FPU pipeline and lack of L1i instruction caches, with too little L0 uop cache... And other things that maybe i have forgotten.
And frequency.Leakage is a function of temp and voltage, in the case that interest us if leakage is 1/6 at say 0.9V it will be also 1/6 at 1V and so on for the full operating range..
Just to point-out, support for 64-bit code was added on P4's Prescott, which had 32-bit ALUs, as well as double L1 data cache and double L2 cache.
And frequency.
Because higher frequency requires higher voltage/current->causes more leakage->more heat.
Not frequency, only voltage..
You can supply a gate with say 1V and it will drain a current, if you dont clock the circuit and that frequency is hence 0 (since it s static) it will leak the same that if it s clocked at say 1GHz...
His point is that if and when you need to clock lower a transistor, you can also lower the Vcore... It would be inefficient to not do so...
Also the voltage domain are not so much, so if we e.g. clock gate an FPU during INT only calculation, that FPU is fed with high Vcore anyway and so it leak more...Of course but if the circuit is not clocked the leakage will still remain, so it s independent of frequency, now that one need to increase voltage, and hence leakage, to clock at higher frequency is another matter..
This is the direct voltage -> leakage relatoinship. KTE actually added the frequency -> voltage(min) relationship. Both make f -> V_min -> I_leak.Not frequency, only voltage..
You can supply a gate with say 1V and it will drain a current, if you dont clock the circuit and that frequency is hence 0 (since it s static) it will leak the same that if it s clocked at say 1GHz...
Also the voltage domain are not so much, so if we e.g. clock gate an FPU during INT only calculation, that FPU is fed with high Vcore anyway and so it leak more...
The scalability you have with finFETs is really quite a large range because it has very little leakage. When you turn off your clocks—when you are not doing active work—you can get very close to nil energy, and leakage is lower than previous technologies.
-Low IPC cores in mW is a completely different matter to dealing with high IPC Cores at 1-20W.1/6 leakage was from the comparison of NEON FPU implemented on 28nm BULK HP (the same used for XV) and 14nm FF LPP, at the same 330mW total. Leakage went from about 100mW (on 330mW total) to 18mW (on 330mW total). At this power the NEON FPU went from 1.something GHz at about 1V to 2.41 GHz at about 0.9V. I don't have a link at hand and you can search yourself.
Yes, but that is all isolated FET level performance in unknown conditions vs. mass production +1billion tranny chip under standard conditions. The two do not correlate directly. There are far too many other factors, too.For the other figures, they were given from another user in this thread few pages ago... Maybe you missed them... I don't know the source, but seems reasonable, given the HUGE differences between a BULK process (gate on one side, very thin conducting layer) and a fin fet process (gate on three sides, almost fully conductive channel)... It is not necessary to be an electronic engineer to predict more transconductance (if you know what it is)... And lower capacitance is normal if you make all smaller: the capacitance is proportional to transistor area...
Does not correlate to the performance metrics of a full equivalent processor.Intel 45nm to 32nm said:The decreased oxide thickness and reduced gate length enables a >22% transistor performance gain in terms of drive current.
These transistors provide the highest drive currents and tightest gate pitch reported in the industry. Leakage current can also be optimized for a >5X reduction in leakage over 45nm for NMOS transistors, and >10X reduction inleakage for PMOS transistors.
Yea if you add pipe stages in the critical timing logic, the latencies automatically increase. If you began with a 23 FO4 design for instance, you would add stages/latch to increase clocks with a bit of timing overhead, which means less logic per pipeline stage, so FO4 logic depth would decrease per stage (lower than 23). Remember also the extra inverters for buffering/saving clock and data signals at each stage. That's the IPC vs Frequency comprise.For the FO4 again... If I increase latencies and stage number, i had to be an awful engineer if I increase also the FO4...
What is low FO4 and what is medium IPC?It's perfectly feasible to do a low FO4 and medium IPC design.
Leakage has many different components. Static and dynamic just to oversimplify.Certainly but the leakage is very low with finfets, typicaly leakage is about 1/10^6 times the switching current, it s low but it imply all the transistors that are in the CPU while switching does not, hence leakage will be something like 20-30% of the losses with planar transistors FI.
-Low IPC cores in mW is a completely different matter to dealing with high IPC Cores at 1-20W.
-mW gain is noise at this level with a CPU at 95W.
-Such gains also do not scale with size/speed/power. Every process has a sweet range.
-LP low IPC chips at low GHz are making the same frequency/efficiency gains which x86 made 10-20years ago. That's the low hanging fruit, and was A LOT easier to attain than at today's x86 level. HP processors made these same gains from 1980-2000. But those gains are COMPLETELY uncomparable to today's uarch/process changes.
Hence why the A15 is higher/near Bobcat power with similar performance but FAR lower efficiency than XV/SKYL. It's a bit of of a DUH moment for x86.
Yea if you add pipe stages in the critical timing logic, the latencies automatically increase. If you began with a 23 FO4 design for instance, you would add stages/latch to increase clocks with a bit of timing overhead, which means less logic per pipeline stage, so FO4 logic depth would decrease per stage (lower than 23). Remember also the extra inverters for buffering/saving clock and data signals at each stage. That's the IPC vs Frequency comprise.
Depth latencies are in ps depending on the GHz... 2.5 FO4 is what is seen as the minimal for a flop, around 5 for a P4 ALU for example. Research on optimal FO4 at current nodes tends to be outdated being based on Willamette and Alpha examples using SPEC95/2000 (Alpha achieved 27.8x SPECInt 95 perf improvement in 6 years with 8.3x improv in cycle time and 3.5x in architecture).
The IBM study you are quoting (A. Hartstein et al?) using time for instruction execution to define performance was again using outdated loads and architectures. It also assumed an infinite cache model and used loads with a lot of ILP and minimal stalls. They also found the optimal to be very different for the given workloads (10-28 stages for traditional vs modern vs old SPEC).
What is low FO4 and what is medium IPC?
Because for us, high and low IPC are relative to the competition and era. Skylake would be high right now. And IPC is also heavily influenced by caches, trace/uop caches and branch prediction.