Intel Skylake / Kaby Lake

Drazick · Nov 2, 2016

wingman04 said:
I would be interested to see Data / Signal / Image processing dual channel bench test vs quad channel bench test?

It is simple, think of taking an array, very large array (25,000 x 25,000) and just add it a constant to each element.
Think of having 2 of those and just multiplying element by element.
Apply 2D convolution on it with very small kernel (5 x 5).

Let's do the math for the first one.
The CPU has 4 cores each has 2 AVX units (If I remember correctly).
Assuming the array is float, the AVX can handle 8 elements per cycle.

At 3 [GHz] it means:

8 [Element] * 4 [Bytes / (Element * AVX Unit)] * 2 [AVX Unit / Core] * 4 [Core] * 3 [GHz] = 256 [Bytes] * 3 [GHz] = 768 [GBytes].

Namely for highly dense AVX operation (As Signal / Data processing is) we'd need 768 [GBytes] of bandiwdth.
Of course it is unrealistic, and we have memory hierarchy but as you can imagine, with large arrays and the small caches it means that at some point the CPU will be waiting for data from memory.

Hence any increase of that will benefit the overall performance.

By the way, for instant, Gaussian Blur in Photoshop on 48 [MP] image is totally Memory Bounded.

witeken · Nov 2, 2016

mikk said:
Almost twice as fast at the same clock speed.

They tested notebooks with a poor cooling (therefore CPU performance down in some cases), but what's interesting that the GPU Turbo stays so high. They said that a newer driver gives higher priority to the GPU over CPU and that gave them a boost in games on their Kabylake devices. That explains the surprising gaming results on 15W Kabylake. Because in the past for this increase Intel required a new GPU generation.

I'm not sure my XPS 13 with Kaby Lake is that much faster than SKL, but I haven't done any graphics benchmarks on it.

witeken · Nov 2, 2016

Has anyone actually looked at the die size of Kabylake? Is the density meaningfully different ?

Arachnotronic · Nov 2, 2016

witeken said:
Has anyone actually looked at the die size of Kabylake? Is the density meaningfully different ?

You are welcome to take your system apart to measure the die size for us! Consider it a public service

But seriously, I'm sure once the desktop parts are out they will be delidded and somebody will measure them.

wingman04 · Nov 2, 2016

Drazick said:
It is simple, think of taking an array, very large array (25,000 x 25,000) and just add it a constant to each element.
Think of having 2 of those and just multiplying element by element.
Apply 2D convolution on it with very small kernel (5 x 5).

Let's do the math for the first one.
The CPU has 4 cores each has 2 AVX units (If I remember correctly).
Assuming the array is float, the AVX can handle 8 elements per cycle.

At 3 [GHz] it means:

8 [Element] * 4 [Bytes / (Element * AVX Unit)] * 2 [AVX Unit / Core] * 4 [Core] * 3 [GHz] = 256 [Bytes] * 3 [GHz] = 768 [GBytes].

Namely for highly dense AVX operation (As Signal / Data processing is) we'd need 768 [GBytes] of bandiwdth.
Of course it is unrealistic, and we have memory hierarchy but as you can imagine, with large arrays and the small caches it means that at some point the CPU will be waiting for data from memory.

Hence any increase of that will benefit the overall performance.

By the way, for instant, Gaussian Blur in Photoshop on 48 [MP] image is totally Memory Bounded.

Drazick I think you are wrong.
AVX2 Is advanced Vector Extensions are extensions to the x86 instruction set architecture. Think of AVX like a buffer for Registers commands to the 64bit architecture do to the lack of memory speed. Modern processors uses 64bit adder per clock cycle, 64-bit computing is the use of processor that have datapath widths, integer size, and memory address widths of 64 bits. The 64 Bit bottleneck is why memory bandwidth does not help much in any benchmarks. With 64bit processors, faster 64 bit memory speed is needed.

Drazick · Nov 2, 2016

@wingman04 , It has nothing to do with 64 Bit or anything else.
It has to do with the throughput of data getting into CPU and the throughput of the CPU computation.

Modern i7 can make data processing at a rate which is order of magnitude faster of the memory BW.
In our days where data is large it becomes a real bottleneck.

That's why GPU's has much larger Memory Throughput (FPGA's as well).

For data science the number one issue with current CPU's is the bandwidth.
We want more eDRAM, Cache and Memory Throughput.

mikk · Nov 2, 2016

i5-7600k test: http://diy.pconline.com.cn/851/8515185_all.html

They are using only DDR4-2133 for Kabylake.

Sweepr · Nov 2, 2016

NotebookCheck MacBook Pro 13'' (Late 2016) Review

http://www.notebookcheck.com/Test-A...-2-GHz-i5-ohne-Touch-Bar-Laptop.182091.0.html

Yet another chinese Core i5-7600K review. 9% faster than i5-6600K at stock, curiously 1% faster at equal clocks.

http://diy.pconline.com.cn/851/8515185_all.html

mikk said:
Almost twice as fast at the same clock speed.

https://www.computerbase.de/2016-11/intel-core-i7-7500u-test/2/

They tested notebooks with a poor cooling (therefore CPU performance down in some cases), but what's interesting that the GPU Turbo stays so high. They said that a newer driver gives higher priority to the GPU over CPU and that gave them a boost in games on their Kabylake devices. That explains the surprising gaming results on 15W Kabylake. Because in the past for this increase Intel required a new GPU generation.

That's very useful, thanks. Looking at ComputerBase data:

- Kaby Lake-U can sustain iGPU Turbo after a 30-min gaming session, unlike Skylake-U
- Biggest improvement is seen when both CPU and iGPU are taxed - 'U' parts with Iris 640/650 will benefit immensely from this
- Gaming performance is up by >20-30% as a result (HD 620 vs HD 520)
- Plays their ''Jellyfish'' HEVC Main10 (Ultra HD, 400 Mbit/s) test video smooth with 2-3% CPU usage, while i7-6700K @ 4.5 GHz is at 75% and i7-6700K @ 4.5 GHz + RX 460 is at 5%

wingman04 · Nov 2, 2016

Drazick said:
@wingman04 , It has nothing to do with 64 Bit or anything else.
It has to do with the throughput of data getting into CPU and the throughput of the CPU computation.

Modern i7 can make data processing at a rate which is order of magnitude faster of the memory BW.
In our days where data is large it becomes a real bottleneck.

That's why GPU's has much larger Memory Throughput (FPGA's as well).

For data science the number one issue with current CPU's is the bandwidth.
We want more eDRAM, Cache and Memory Throughput.

A CPU cache is faster memory which stores copies of the data from frequently used main memory locations, that is why it does not help much with more or less cache because the data has to be frequently used over again like Prime95 instruction sets for finding prime numbers. If it is new data that is needed, the CPU fetches the data from system memory speed.

It does not mater if the data set is larger. The the 64 bit CPU that can only handle 64 bits of data at one time, The subject is about x86 not FPGA or Graphics eDRAM

Modern i7 can make data processing at a rate which is order of magnitude faster of the memory BW.

That is what I'm saying CPU processing has more to do with the fetch speed of small 64bit chunks than 128 bit chunks.

The memory controller can only retrieve blocks of data, whether the all the data is needed or not. The memory controller dumps the unneeded data if the CPU only needs 64bit or less of the data that was fetched, then the memory controller rereads again. In the old days we had memory defragmenting programs to speed up operations, like a clean boot.

Folks confuse big data sets with memory bandwidth problems, when they just need to think of it in simple terms, the CPU is much faster then memory, however the processor can not calculate big data sets at once, they have to be 64bit wide or less for the CPU to calculate, So in a perfect world the memory would be as fast as the CPU 64 bit speed.

Hopefully the future nanotube memory will help improve memory speed. Instead of latency of ns it is picoseconds

Don't get me wrong here, wide memory bus on the x86 64 bit has it's place. When the CPU calls for swapping memory address data to and from things like video cards.

Quad-channel RAM vs. dual-channel RAM: The shocking truth about their performance
http://www.pcworld.com/article/2982...ing-truth-about-their-performance.html?page=2

Quad channel vs dual channel TESTING DDR 4 in Photoshop, Sony Vegas, Power Director
https://www.youtube.com/watch?v=lq6f9LNOd64

jelome1989 · Nov 2, 2016

@Sweepr
What games are benchmarked in that test because it got destroyed by the 6700k in one game. 95 fps vs 73 fps?

imported_ats · Nov 3, 2016

Drazick said:
It is simple, think of taking an array, very large array (25,000 x 25,000) and just add it a constant to each element.
Think of having 2 of those and just multiplying element by element.
Apply 2D convolution on it with very small kernel (5 x 5).

Let's do the math for the first one.
The CPU has 4 cores each has 2 AVX units (If I remember correctly).
Assuming the array is float, the AVX can handle 8 elements per cycle.

At 3 [GHz] it means:

8 [Element] * 4 [Bytes / (Element * AVX Unit)] * 2 [AVX Unit / Core] * 4 [Core] * 3 [GHz] = 256 [Bytes] * 3 [GHz] = 768 [GBytes].

Namely for highly dense AVX operation (As Signal / Data processing is) we'd need 768 [GBytes] of bandiwdth.
Of course it is unrealistic, and we have memory hierarchy but as you can imagine, with large arrays and the small caches it means that at some point the CPU will be waiting for data from memory.

Most stand DLA and SP libraries make extensive use of cache blocking that greatly reduces required memory bandwidth. There are of course workloads that require lots of memory bandwidth but basic DLA/SP are well handled by modern cache systems and cache blocking.

Drazick · Nov 3, 2016

@wingman04 , The processor being 64 Bit has nothing to do with how much data it can handle at once.
Nothing to do.
There are DSP's which are 16 Bit and access much more data than an Intel CPU.

Again, CPU's has muli layer memory hierarchy.
As closer we get to the CPU the faster and smaller the memory is.
The problem, since it gets smaller (Register smaller than L1, smaller than L2, smaller than L3 which is smaller than memory).
Yet if the data being processed is large while the operation on the data is simple (Which the CPU can handle using its full power) we're bottlenecked by the main memory system.

This is why faster DDR4 are showing gains in SkyLake.
Yet since the speed can't be multiplied by 2 it would be easier to double the number of channels.
Again, whether the CPU is 64 Bit or 32 Bit doesn't have any affect on how the data throughput.
It only means how are the memory address is.

@imported_ats , You are right.
In order to mitigate the limited throughput the algorithm designers are very careful with the locality of the operation.
For example in Matrix Multiplication once the CPU fetch some data form memory they make most use of it.
But again, if the data large enough and the operation is simple at some point the memory bandwidth will limit the speed.

As I wrote before, try Gaussian Blur in Photoshop.

Jelf · Nov 3, 2016

OK, color me confused.
It is way past time for me to build a new system.

I have read a bunch but it is not clear to me if there is going to be a Skylake for us DIY types that supports DDR4, is 65W and had Iris Pro graphics.

Can anyone shed light?

VirtualLarry · Nov 3, 2016

I think that the only Iris Pro Skylake CPUs are in BGA form-factor. (Not for desktop.)

Jelf · Nov 3, 2016

Thanks Larry
Looks like I will be using the i5-6600

LTC8K6 · Nov 3, 2016

Seems like you could wait for KL now, which has better graphics than SL.

witeken · Nov 3, 2016

i5 7500K review

http://diy.pconline.com.cn/851/8515185_all.html

What you see is what you get: 7% faster transistors, 7-8% better performance .

~No IGP difference.

Power: 3W idle more.
Idle to load delta: 3% more.

Dave2150 · Nov 3, 2016

witeken said:
i5 7500K review

http://diy.pconline.com.cn/851/8515185_all.html

What you see is what you get: 7% faster transistors, 7-8% better performance .

~No IGP difference.

Power: 3W idle more.
Idle to load delta: 3% more.

So basically a factory overclocked Skylake with 0 IPC improvement.

witeken · Nov 3, 2016

Dave2150 said:
So basically a factory overclocked Skylake with 0 IPC improvement.

No because its build on a different (derivative) process node. And it doesn't really make sense to talk about "overclocked", does it.

Dave2150 · Nov 3, 2016

witeken said:
No because its build on a different (derivative) process node. And it doesn't really make sense to talk about "overclocked", does it.

Tbh all that matters is how well it overclocks. Unless the average kabylake clocks higher than the average skylake, then it's basically Skylake's 'Devil's Canyon'.

JoeRambo · Nov 3, 2016

witeken said:
No because its build on a different (derivative) process node.

Given that in the past Intel extracted much nicer perf and power usage gains from CPU steppings ( like Nehalem C0 -> D0 stepping ), one has to wonder how much of a tweak this "derivative" process is. In my opinion Intel is simply selling what in the past was a new stepping of existing products as completely new product to properly milk the market.

witeken · Nov 3, 2016

JoeRambo said:
Given that in the past Intel extracted much nicer perf and power usage gains from CPU steppings ( like Nehalem C0 -> D0 stepping ), one has to wonder how much of a tweak this "derivative" process is. In my opinion Intel is simply selling what in the past was a new stepping of existing products as completely new product to properly milk the market.

Steppings are also derivative process nodes. But this is more than a small update because they did do pretty decent changes, relatively speaking.

https://en.wikipedia.org/wiki/Stepping_level

Arachnotronic · Nov 3, 2016

JoeRambo said:
Given that in the past Intel extracted much nicer perf and power usage gains from CPU steppings ( like Nehalem C0 -> D0 stepping ), one has to wonder how much of a tweak this "derivative" process is. In my opinion Intel is simply selling what in the past was a new stepping of existing products as completely new product to properly milk the market.

No, there were substantial changes made to 14nm+ over 14nm. Taller fins for better drive current is the big one.

witeken · Nov 3, 2016

Arachnotronic said:
No, there were substantial changes made to 14nm+ over 14nm. Taller fins for better drive current is the big one.

Don't forget the improved strained silicon. I bet that Intel's strained silicon performance is completely unmatched in the industry.

And don't forget the bigger gates.

When it comes to managing power and performance, circuit designers are well aware of the Dennard scaling, but he reminded that unfortunately, we aren’t in that regime any longer. “That period is gone. In any of the modern nodes, in fact, all things being equal, I will often be able to take a technology node — call it 7nm — make the gate pitch bigger, make the cells bigger, and end up with a smaller chip that runs faster. We are technically in the reverse-Dennard era now.”

http://semiengineering.com/moores-law-debate-continues/

witeken · Nov 3, 2016

Newsflash.

ASML to invest $1B for 25% in Carl Zeiss subsidiary to propel the development of high NA (= numerical aperture) EUV.

The way optics work is that your effective wavelength is physical wavelength divided by the numerical aperture. Immersion lithography has NA=1.35, which is as high as it can get for lithography. It used to be a lot lower, but improved steadily over time, last jump was with immersion lithography from 1 to 1.35 (at 32nm).

Right now, EUV NA is 0.33. It used to be 0.25 a few system iterations ago. Now they want to propel it to >0.5.

Right now 0.4 is the next in the pipeline afaik.

https://www.asml.com/press/press-re...ography-due-in-early-2020s/en/s5869?rid=54430

Edit: And ASML has paused their share buybacks for this. What a great and lofty decision! Intel should also stop wasting their money on these nonsensical costs.

Intel Skylake / Kaby Lake

Member

Diamond Member

Diamond Member

Lifer

Senior member

Member

Diamond Member

Diamond Member

Senior member

Junior Member

Senior member

Member

Junior Member

No Lifer

Junior Member

Lifer

Diamond Member

Senior member

Diamond Member

Senior member

Golden Member

Diamond Member

Lifer

Diamond Member

Diamond Member