Nvidia Denver... finally here... and it looks good

xpea · Aug 11, 2014

from hotchips 2014, Nvidia long awaited custom 64bit ARMv8 CPU on TSMC 28nm running at 2.5GHZ claimed to be faster than intel haswell 22nm 2955U at 1.4GHz base clock on SPECint bench.

details on the uarch:
http://blogs.nvidia.com/blog/2014/08/11/tegra-k1-denver-64-bit-for-android/

whitepaper here: http://www.tiriasresearch.com/downloads/nvidia-charts-its-own-path-to-armv8/
(free registration)

Khato · Aug 11, 2014

Will be interesting to see if there was any more information provided than what's in the whitepaper. Since the whitepaper, well, it goes to the level of detail that I've come to expect from NVIDIA, which is to say not much.

As for the performance figures in the whitepaper... Depends entirely upon the power consumption necessary to reach that frequency. And given the fact that they went from 4x A15 to 2x Denver I'm not exactly optimistic with respect to the power consumption figures. A 2.5 GHz Denver trading blows with a 1.4 GHz Haswell Celeron isn't exactly a stunning success unless it does so at Silvermont power levels. (As a note, it looks like the single-threaded performance ends up around 50% higher than a 2.4 GHz Silvermont.)

xpea · Aug 11, 2014

Khato said:
A 2.5 GHz Denver trading blows with a 1.4 GHz Haswell Celeron isn't exactly a stunning success unless it does so at Silvermont power levels. (As a note, it looks like the single-threaded performance ends up around 50% higher than a 2.4 GHz Silvermont.)

well, it's the first time that an ARM CPU can touch intel big Core performance and it's already an amazing achievement.
For power level, I don't see it higher than the 4 A15s in K1-32

Sweepr · Aug 11, 2014

A8 vs Denver should be interesting, I really like the two fast cores approach. Note that a 1.4GHz capped Haswell (2MB L3) isn't the most accurate comparison. If leaked specs are correct then fastest 4.5W TDP Broadwell-Y runs @ up to 2.6GHz (1.1GHz base, 4 threads, 4MB L3) + 24 EUs @ 850MHz GPU. Should be out even sooner than Denver.

Arachnotronic · Aug 11, 2014

Sweepr said:
A8 vs Denver should be interesting, I really like the two fast cores approach. Note that a 1.4GHz capped Haswell (2MB L3) isn't the most accurate comparison. If leaked specs are accurate fastest 4.5W TDP Broadwell-Y runs @ up to 2.6GHz (1.1GHz base) + 24 EUs @ 850MHz.

A8, Denver, and Broadwell-Y...should be a nice year for cool CPUs

Khato · Aug 11, 2014

xpea said:
well, it's the first time that an ARM CPU can touch intel big Core performance and it's already an amazing achievement.
For power level, I don't see it higher than the 4 A15s in K1-32

Eh, I wouldn't call it that amazing given how far they have to lower the goal posts in order to achieve such. Granted they at least managed to trade blows with Haswell running at 56% the frequency instead of having to step down to the lowest 1.1 GHz model.

As for power, maybe 2x Denver is lower than 4x A15, or maybe not? I don't believe that I've seen any claim from NVIDIA on that respect? Which likely wouldn't mean that it's good news.

Enigmoid · Aug 11, 2014

Note that the celeron 2955U is locked to 1.4 ghz.

Also note that the celeron 2955U gets around 2200-2400 in geekbench 3 multicore. Well below what the S800 and the Tegra 4 would get (up to 3000 unthrottled). Beating the 2955U is therefore not terribly as impressive.

xpea · Aug 11, 2014

Intel17 said:
A8, Denver, and Broadwell-Y...should be a nice year for cool CPUs

too bad they are not direct competitors, so it will be very difficult to evaluate their own merit. But at least they are the best of each ecosystem (iOS / Android / Windows)

xpea · Aug 11, 2014

Khato said:
Eh, I wouldn't call it that amazing given how far they have to lower the goal posts in order to achieve such. Granted they at least managed to trade blows with Haswell running at 56% the frequency instead of having to step down to the lowest 1.1 GHz model.

So why nobody has done it before ? how would have you done it better ?
and you forget that little huuuuuuuuuge intel process advantage, FinFet 22nm + vs planar 28nm
do you really expect someone to beat Intel in CPU knowing their insane manufacturing advantage ?

TuxDave · Aug 11, 2014

xpea said:
So why nobody has done it before ? how would have you done it better ?
and you forget that little huuuuuuuuuge intel process advantage, FinFet 22nm + vs planar 28nm
do you really expect someone to beat Intel in CPU knowing their insane manufacturing advantage ?

Well.... AMD did a pretty good job back in the P4 days.

FatherMurphy · Aug 11, 2014

I wonder if this Transmeta binary code translation is the real deal, the "secret sauce" that might be Nvidia's edge in a very competitive market, where its competitors in the CPU field are far more equipped/experienced to produce custom designs. If it works so well, why isn't anyone else doing it? Or is this a one-off, niche type design that won't serve Nvidia well in the long-term?

Anyone capable of explaining the binary code translations to a laymen like me? Based upon my readings of the whitepaper and other sites, the Transmeta method failed because years ago when Transmeta implemented its technology, hardware was not fast enough to take advantage of the method. That is, the software overhead to translate resulted in a net efficiency loss. But now hardware is much more powerful but power-efficient, and the (according to Nvidia), the Transmeta method, refined, now results in a positive efficiency gain.

Interesting stuff. I look forward to Anandtech's analysis.

Khato · Aug 11, 2014

xpea said:
So why nobody has done it before ? how would have you done it better ?
and you forget that little huuuuuuuuuge intel process advantage, FinFet 22nm + vs planar 28nm
do you really expect someone to beat Intel in CPU knowing their insane manufacturing advantage ?

Because it's well outside of the performance targets that ARM has been interested in.

And no, I don't expect anyone to beat Intel, quite the opposite actually. But declaring how awesome an architecture is merely because of it being capable of trading blows with Haswell running at a fraction of its design point without any indication of power consumption while doing so is questionable at best. If Denver is providing that level of performance on half a watt of power consumption then yes, it'd be an awesome architecture set for world domination. Whereas if it's burning through five watts it's a non-story and NVIDIA would've been better off sticking to the ARM designed cores.

xpea · Aug 11, 2014

Khato said:
Because it's well outside of the performance targets that ARM has been interested in.

And no, I don't expect anyone to beat Intel, quite the opposite actually. But declaring how awesome an architecture is merely because of it being capable of trading blows with Haswell running at a fraction of its design point without any indication of power consumption while doing so is questionable at best. If Denver is providing that level of performance on half a watt of power consumption then yes, it'd be an awesome architecture set for world domination. Whereas if it's burning through five watts it's a non-story and NVIDIA would've been better off sticking to the ARM designed cores.

for power consumption let's wait for the reviews, shall we...
regarding other things, I totally disagree with you. What's the point for Nvidia to make another A57 implementation ? doing the same as Mediatek and Allwinner ? useless, they will be crushed in price with no differentiation whatsover. And they will miss a critical window this fall.
May I need to remember you that Denver will be here at least 6 months before Qualcomm own 64bit uarch and with a perfect timing to put them in the light for this year hot season and Android L release? New Nexus tablet will have TK1-64bit.
No, really, I disagree with you, despite all the delays, Nvidia is in a very good position with Denver to make a hit in Android TV / tablets / chromebook devices this Xmas.

xpea · Aug 11, 2014

Of course, consumers will probably care more about performance improvements, with NVIDIA also claiming a bump in battery life too. The first devices to use Tegra K1 "Denver" will hit the market later this year, NVIDIA says

source: http://www.slashgear.com/nvidia-details-tegra-k1-denver-tomorrows-android-superchip-11340691/

tviceman · Aug 11, 2014

Sounds interesting. Hopefully it delivers!

Madpacket · Aug 11, 2014

Hmm, faster game emulation would be awesome. Hopefully nvidia releases a new Shield handheld with this processor, leave the 32 bit stop-gap one for the tablet.

ams23 · Aug 11, 2014

Sweepr said:
A8 vs Denver should be interesting, I really like the two fast cores approach.

The A8 CPU performance should get close to Denver but may not match Denver in some benchmarks. The Google Octane v2.0 scores for Denver is reportedly more than 2x higher than A7-Cyclone!

ams23 · Aug 11, 2014

Enigmoid said:
Also note that the celeron 2955U gets around 2200-2400 in geekbench 3 multicore. Well below what the S800 and the Tegra 4 would get (up to 3000 unthrottled). Beating the 2955U is therefore not terribly as impressive.

Look carefully at the graph above that NVIDIA presented. The Haswell [2955U] core that they benchmarked scores much higher than both Cortex A15 (in Tegra 4 and Tegra K1 32-bit variant) and Krait 400 (in S800) in Geekbench 3 Single-Core. In several of the benchmark tests, this Haswell [2955U] core is superior in performance to the A7-Cyclone core.

S800, Tegra 4, Tegra K1 32-bit variant all have twice as many cores as Haswell [2955U], A7-Cyclone, and Denver, so naturally they would look good in comparison using Multi-Core benchmarks. That said, even in Multi-Core benchmarks, dual-core Denver should come close to matching quad-core Cortex A15.

tviceman · Aug 11, 2014

ams23 said:
The A8 CPU performance should get close to Denver but will probably not match Denver in most benchmarks. The Geekbench 3 Single-Core and Google Octane v2.0 scores for Denver are reportedly more than 2x higher than A7-Cyclone!

Gotta wonder about that power consumption when under full load. It'd be nice if it was LESS than Tegra K1's 4 A15 cores.

jdubs03 · Aug 11, 2014

ams23 said:
Look carefully at the graph above that NVIDIA presented. The Haswell [2955U] core that they benchmarked scores much higher than both Cortex A15 (in Tegra 4 and Tegra K1 32-bit variant) and Krait 400 (in S800) in Geekbench 3 Single-Core. In several of the benchmark tests, this Haswell [2955U] core is superior in performance to the A7-Cyclone core.

S800, Tegra 4, Tegra K1 32-bit variant all have twice as many cores as Haswell [2955U], A7-Cyclone, and Denver, so naturally they would look good in comparison using Multi-Core benchmarks. That said, even in Multi-Core benchmarks, dual-core Denver should come close to matching quad-core Cortex A15.

In terms of single-thread performance (and multi), the 2955U gets a bit less than the A7 in 64-bit performance, though for 32-bit it does score considerably (~200 points single, ~500 points multi) higher. If Denver is only at ~2955U levels, I would be massively disappointed.

This information confuses me too, because the ST for TK1-32bit is ~1120, with mutli at ~3450. How could Denver only score ~200 points higher than 32-bit? Doesn't make sense to me.

If Denver were to match the multi-thread score of K1-32bit (which it should, or else I consider that another disappointment), I would expect the single-thread to be significantly higher than 1300-1400. It looks like I may have been wrong with my prediction of a ST score of 2000+.

The A8 CPU performance should get close to Denver but will probably not match Denver in most benchmarks. The Geekbench 3 Single-Core and Google Octane v2.0 scores for Denver are reportedly more than 2x higher than A7-Cyclone!

Source for this? I haven't fully looked at this stuff, but if the single-thread score was almost 2x the score of A7, the score would be around ~2700. And I don't think the A8 is going to get that close, though I think the A8 could get to around ~2100.

ams23 · Aug 11, 2014

tviceman said:
Gotta wonder about that power consumption when under full load. It'd be nice if it was LESS than Tegra K1's 4 A15 cores.

Who knows, but it's pretty safe to say that CPU single-core perf. per watt is way better with Denver than R3 Cortex A15.

Here is some general info from the paper:

The most unique aspect of Denver is the dynamic code optimization. The core microarchitecture of the CPU is unique in that it has an in-order pipeline, but uses special software to reorder and optimize instruction traces. During repetitive code sequences, the Denver CPU collects dynamic runtime information during code execution and passes this information to the dynamic code optimizer; enabling the optimizer to assess more optimized ways for the code to be executed. The CPU uses hidden time slices to run the optimizer or can use the second core for optimizations for the active core.

The dynamic optimizer runs in its own private and protected state and is not visible to the operating system or any user code. The signed and encrypted dynamic optimizer code loads at boot into a protected part of main memory. By performing the reordering and register renaming in software, Denver eliminates the power hungry out-of-order control logic and yet it can achieve comparable results.

The profiler gathers info on program flow such as branch results (such as taken, not taken, strongly taken, and strongly not taken) and other hardware statistics tables and counters. The optimizer (Figure 1) recognizes opportunities to improve execution and then can rename registers, reorder loads and stores, improve control flow, remove redundant code, hoist redundant computations, perform loop unrolling, and other common optimizations. Because the run-time software performs optimization, the profiler can look over a much larger instruction window than is typically found in hardware out-of-order (OoO) designs. Denver could optimize over a 1,000 instruction window, while most OoO hardware is limited to a 192 instruction window or smaller. The dynamic code optimizer will continue to evaluate profile data and can perform additional optimizations on the fly.

ams23 · Aug 11, 2014

jdubs03 said:
In terms of single-thread performance (and multi), the 2955U gets a bit less than the A7 in 64-bit performance, though for 32-bit it does score much higher. If Denver is only at ~2955U levels, I would be massively disappointed.
If Denver were to match the multi-thread score of K1-32bit (which it should, or else I consider that another disappointment), I would expect the single-thread to be considerably higher than 1300-1400. It looks like I may have been wrong with my prediction of a ST score of 2000+.

Source for this? I haven't fully looked at this stuff, but if the single-thread score was almost 2x the score of A7, the score would be around ~2700. And I don't think the A8 is going to get that close, though I think the A8 could get to around ~2100.

See the graph I linked to a few posts above from the paper: http://forums.anandtech.com/showpost.php?p=36609897&postcount=17

Denver should be significantly ahead of A7-Cyclone with most benchmarks. Looking at the Haswell variant that NVIDIA used, it is well ahead of A7-Cyclone in most of the benchmarks too other than Geekbench 3 Single-Core (where it is roughly equal, if perhaps just slightly behind), and DMIPS and Memset benchmarks (where it is behind).

That said, A8 CPU should have fairly similar performance to this Denver CPU.

tviceman · Aug 11, 2014

ams23 said:
Who knows, but it's pretty safe to say that CPU single-core perf. per watt is way better with Denver than R3 Cortex A15.

We'll see. The low-level software translation layer gives me worry. Transmeta did this, and to my limited knowledge, didn't have the best of success when it came to final product performance. Of course, we're talking ex Transmeta engineers creating Denver now, so who knows if they've been able to significantly improve it in the 7+ years since they last did it.

What is interesting to me is that it sounds like Denver is instruction set agnostic. If Nvidia were to ever acquire an x86 license, it'd be on. :wishful thinking:

ams23 · Aug 11, 2014

tviceman said:
We'll see. The low-level software translation layer gives me worry. Transmeta did this, and to my limited knowledge, didn't have the best of success when it came to final product performance. Of course, we're talking ex Transmeta engineers creating Denver now, so who knows if they've been able to significantly improve it in the 7+ years since they last did it.

What is interesting to me is that it sounds like Denver is instruction set agnostic. If Nvidia were to ever acquire an x86 license, it'd be on. :wishful thinking:

Heh, yes, that is wishful thinking

According to NVIDIA:

The slight overhead of the dynamic optimization process is outweighed by the performance gains of already having optimized code ready to execute. In cases where code may not be frequently reused, Denver can process those ARM instructions directly without going through the dynamic optimization process, delivering the best of both worlds.

Dynamic Code Optimization works with all standard ARM-based applications, requiring no customization from developers, and without added power consumption versus other ARM mobile processors. That’s because the 7-wide superscalar design allows faster throughput than would otherwise be possible at the same clock speed.

jdubs03 · Aug 11, 2014

ams23 said:
Heh, yes, that is wishful thinking

According to NVIDIA:

Aright yea, I was only going off 2955U ST-perf, I didn't even glance at that chart (abnormal occurrence there) and noticing that its quite a bit higher. My apologies.

Here is a bit of extrapolation data:

I had to post this on facebook to get this work lol. The ST-perf is 33.33% higher than the A7. So ~1800-1950 (which makes sense as 1120*1.625 = 1820), not too far away from my 2000+ guess. It can give you an idea of the other metrics too.

To think that they can get this kind of performance at 28nm planar, while challenging Core M and its 14nm 2nd gen tri-gate (even with the impressive reduction from 11.5W TDP to 4.5W TDP, with ~+performance) is pretty outstanding. I would also assume the power consumption is on par with TK1-32bit.

Nvidia Denver... finally here... and it looks good

Senior member

Golden Member

Senior member

Diamond Member

Lifer

Golden Member

Platinum Member

Senior member

Senior member

Lifer

Senior member

Golden Member

Senior member

Senior member

Diamond Member

Platinum Member

Senior member

Senior member

Diamond Member

Golden Member

Senior member

Senior member

Diamond Member

Senior member

Golden Member