Nvidia Denver... finally here... and it looks good

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

tviceman

Diamond Member
Mar 25, 2008
6,734
514
126
www.facebook.com
Apple will likely beat Denver in CPU performance with A8, but having it 20nm is a significant advantage and it's not entirely of Nvidia's concern since no one besides Apple uses Apple's SoC's.
 

Fox5

Diamond Member
Jan 31, 2005
5,957
7
81
I wonder if this Transmeta binary code translation is the real deal, the "secret sauce" that might be Nvidia's edge in a very competitive market, where its competitors in the CPU field are far more equipped/experienced to produce custom designs. If it works so well, why isn't anyone else doing it? Or is this a one-off, niche type design that won't serve Nvidia well in the long-term?

Anyone capable of explaining the binary code translations to a laymen like me? Based upon my readings of the whitepaper and other sites, the Transmeta method failed because years ago when Transmeta implemented its technology, hardware was not fast enough to take advantage of the method. That is, the software overhead to translate resulted in a net efficiency loss. But now hardware is much more powerful but power-efficient, and the (according to Nvidia), the Transmeta method, refined, now results in a positive efficiency gain.

Interesting stuff. I look forward to Anandtech's analysis.

I don't understand why this dynamic recompilation is necessary. Most Android apps already run in the Dalvik VM, which should be able to do similar dynamic recompilation. The only exception would be if the ARM instruction set is just poorly designed for high performance.
That said, being that the cores are in-order, they should be pretty power efficient, and maybe nvidia was able to add some extra hardware to speed up the recompilation. Still, it seems like they're duplicating effort to be going from Dalvik -> ARM -> nvidia ucode.
 

ams23

Senior member
Feb 18, 2013
907
0
0
Ok, here is roughly what the graph shows regarding performance relative to R3 Cortex A15 in Tegra K1:

DMIPS
Baytrail (Celeron N2910): 0.45x
S800 (Krait 400 8974AA): 0.95x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): 1.30x
Haswell (Celeron 2955U): 1.00x
Tegra K1 (Denver): 1.80x

SPECInt 2K
Baytrail (Celeron N2910): 0.70x
S800 (Krait 400 8974AA): 0.60x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): 0.90x
Haswell (Celeron 2955U): 1.30x
Tegra K1 (Denver): 1.45x

SPECFP 2K
Baytrail (Celeron N2910): 0.85x
S800 (Krait 400 8974AA): 0.80x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): N/A
Haswell (Celeron 2955U): 1.95x
Tegra K1 (Denver): 1.75x

AnTuTu 4
Baytrail (Celeron N2910): N/A
S800 (Krait 400 8974AA): 0.80x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): 0.70x
Haswell (Celeron 2955U): N/A
Tegra K1 (Denver): 1.00x

Geekbench 3 Single-Core
Baytrail (Celeron N2910): 0.65x
S800 (Krait 400 8974AA): 0.80x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): 1.20x
Haswell (Celeron 2955U): 1.20x
Tegra K1 (Denver): 1.65x

Google Octane v2.0
Baytrail (Celeron N2910): 0.70x
S800 (Krait 400 8974AA): 0.65x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): 0.70x
Haswell (Celeron 2955U): 1.45x
Tegra K1 (Denver): 1.30x

16MB Memcpy (GB/s)
Baytrail (Celeron N2910): 0.85x
S800 (Krait 400 8974AA): 0.80x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): 1.15x
Haswell (Celeron 2955U): 1.55x
Tegra K1 (Denver): 1.40x

16MB Memset (GB/s)
Baytrail (Celeron N2910): 0.40x
S800 (Krait 400 8974AA): 0.75x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): 0.80x
Haswell (Celeron 2955U): 0.65x
Tegra K1 (Denver): 1.05x

16MB Memread (GB/s)
Baytrail (Celeron N2910): 1.25x
S800 (Krait 400 8974AA): 1.55x
Tegra K1 (R3 Cortex A15): 1.00x
A7 (Cyclone): 1.85x
Haswell (Celeron 2955U): 2.55x
Tegra K1 (Denver): 2.60x


So on average, TK1-Denver is ~ 1.45x faster than A7-Cyclone.

The A8 CPU is expected to be clocked ~ 40% higher than the A7-Cyclone CPU, so the overall performance should be similar to the TK1-Denver CPU.
 
Last edited:

sefsefsefsef

Senior member
Jun 21, 2007
218
1
71
I don't understand why this dynamic recompilation is necessary. Most Android apps already run in the Dalvik VM, which should be able to do similar dynamic recompilation. The only exception would be if the ARM instruction set is just poorly designed for high performance.
That said, being that the cores are in-order, they should be pretty power efficient, and maybe nvidia was able to add some extra hardware to speed up the recompilation. Still, it seems like they're duplicating effort to be going from Dalvik -> ARM -> nvidia ucode.

It's recompiling from ARMv8 to a proprietary VLIW instruction format (meaning it's not out-of-order, and all the instruction-level-parallelism is encoded into the instructions themselves). Confirmed in January by Anand:

https://twitter.com/TheKanter/status/420283744847548416

I'm quite interested to see a new VLIW product. Maybe we will see The Mill some day. I can only dream.
 

jdubs03

Senior member
Oct 1, 2013
377
0
76
Maybe an out-of-order implementation once they get to node shrinks? To get even more performance out those cores without so much worry about power dissipation.
 

sefsefsefsef

Senior member
Jun 21, 2007
218
1
71
Maybe an out-of-order implementation once they get to node shrinks? To get even more performance out those cores without so much worry about power dissipation.

That goes against the very idea of VLIW. If there was more ILP to be extracted, then your compiler/recompiler should have found it. The advantage of VLIW is the highest IPC possible while still being an in-order processor. It's supposed to be the best of both worlds (highest performance and lowest power), as long as the compiler can do its job well enough.
 

Khato

Golden Member
Jul 15, 2001
1,225
281
136
Maybe an out-of-order implementation once they get to node shrinks? To get even more performance out those cores without so much worry about power dissipation.

That kinda goes against the idea of such an architecture - tailor the code to the architecture via pre-compiling such that out of order execution isn't even necessary.

NVIDIA is basically trying to follow in the footsteps of Transmeta and Itanium (same concept, just that it forces code to be compiled in that fashion rather than doing such at runtime.) Which isn't necessarily a bad thing since if you can get it right it provides excellent efficiency and performance from what I recall on the theory - please correct if I'm remembering incorrectly as that's from something like ten years ago now.
 

mavere

Member
Mar 2, 2005
187
2
81
That kinda goes against the idea of such an architecture - tailor the code to the architecture via pre-compiling such that out of order execution isn't even necessary.

The dynamic optimizer might make sense for Nvidia's 1st custom design, but I don't think that tool explicitly excludes any other tool.

It's still a software system, which means there's automatically room for dedicated logic to improve efficiency. It's still a JIT system, which means there's automatically room for static+guided analysis during native compilation to improve efficiency.
 

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
I think the IPC is less than a lot of people expected. Not comparable at all with Core or Cyclone. Although it makes up for the lower IPC with a higher clock speed, Core M has a turbo that is comparable to that, but compared to Krait, Cortex A and Atom it seems very good, like its GPU performance. Power remains to be seen and I don't expect very low prices either. Design wins and TTM are also unclear to me.
 

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
Any word whether the optimization engine can be patched/updated after the SoC has shipped? I wonder whether we will see GPU-style driver updates, with specialised code paths for high profile software/commonly benchmarked games.
 

ams23

Senior member
Feb 18, 2013
907
0
0
According to the paper, Denver's Dynamic Code Optimization does not rely on any benchmark specific code to achieve it's measured performance.
 

ams23

Senior member
Feb 18, 2013
907
0
0
I think the IPC is less than a lot of people expected. Not comparable at all with Core or Cyclone. Although it makes up for the lower IPC with a higher clock speed, Core M has a turbo that is comparable to that, but compared to Krait, Cortex A and Atom it seems very good, like its GPU performance. Power remains to be seen and I don't expect very low prices either. Design wins and TTM are also unclear to me.

What ultimately matters most is real world perf. per watt at real world clock operating frequencies, and that is still to be determined.

TK1-Denver is rumored to power the very first 64-bit Android tablet this fall running the Android L OS in the form of a Google Nexus 8.9" tablet built by HTC.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
I don't understand why this dynamic recompilation is necessary. Most Android apps already run in the Dalvik VM, which should be able to do similar dynamic recompilation. The only exception would be if the ARM instruction set is just poorly designed for high performance.
That said, being that the cores are in-order, they should be pretty power efficient, and maybe nvidia was able to add some extra hardware to speed up the recompilation. Still, it seems like they're duplicating effort to be going from Dalvik -> ARM -> nvidia ucode.

On the contrary, a huge portion of Android apps contain NDK code running ARM, x86, or MIPS instructions. Especially if you bias the list towards the most popular and most performance demanding apps.

Besides that, Android has already been transitioning Dalvik bytecode compilation from JIT to AOT (on install) with ART.

One other thing that you're missing is that a big part of the translation/optimization quality for Denver is its ability to exploit runtime information in performance counters, both before first compilation and for recompilation. If it's especially like Transmeta's CPUs it also has some mechanisms for speculation where the recovery will often involve changing how the blocks are compiled. So dynamic translation is a big element of the design.
 

monstercameron

Diamond Member
Feb 12, 2013
3,818
1
0
Any word whether the optimization engine can be patched/updated after the SoC has shipped? I wonder whether we will see GPU-style driver updates, with specialised code paths for high profile software/commonly benchmarked games.

interesting thought.
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
I remember when Nvidia claimed one of their chips was faster than Core 2. Turned out to be a doctored benchmark.
 

DrMrLordX

Lifer
Apr 27, 2000
21,813
11,167
136
I don't understand why this dynamic recompilation is necessary. Most Android apps already run in the Dalvik VM, which should be able to do similar dynamic recompilation. The only exception would be if the ARM instruction set is just poorly designed for high performance.

That depends on whom you ask. Opinions differ, of course.

Bottom line, though, is that there are some applications out there using native code instead of relying 100% on Dalvik. Such apps may be well-represented within the suite of software that one would reasonably expect to run on an Android device.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,868
3,419
136
nice to see something new and exciting in CPU land... not the boring contentious march of intel. Will be interesting to see how it performs.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |