Question Intel Mont thread

igor_kavinski · Feb 20, 2023

https://store.acer.com/en-us/aspire-3-laptop-a315-510p-3905

First Gracemont laptop available for sale.

Anybody got disposable $500 to buy and test this laptop?

soresu · Jun 13, 2024

igor_kavinski said:
I was thinking of the lovely first party Nintendo Switch games when I wrote that

Ninty always has been able to mine gold from of out of date hardware.

Kosusko · Jun 16, 2024

Intel Details Skymont
source: https://chipsandcheese.com/2024/06/15/intel-details-skymont/

Kosusko · Jun 20, 2024

Architecture All Access: Live at Lunar Lake ITT: Next Gen E-core Skymont

Kosusko · Jun 24, 2024

Skymont E-core Architecture Explained by Intel Fellow | Talking Tech | Intel Technology

DavidC1 · Jun 30, 2024

According to Chief Architect of Skymont and Intel Fellow Stephen Robinson, the 3x3 setup is slightly larger than a hypothetical 2x4 setup.

Then at least for the particular setup, the increase in decoders isn't a exponential increase.

Decoder sizes:
3-wide = 3^2 = 9
4-wide = 4^2 = 16
5-wide = 5^2 = 25

3x3 = 27
2x4 = 32

So not exponential then. Let's continue?

"Slightly smaller"
3^1.5 x 3 = 15.6
4^1.5 x 2 = 16

3^1.4 x 3 = 13.97
4^1.4 x 2 = 13.93

Too small of a gain for him to make a comment.

3^1.3 x 3 = 12.51
4^1.3 x 2 = 12.12

That's it.

Lion Cove = 8^1.3 = 14.92
Hypothetical 9-wide = 17.4
Hypothetical 4x 3-wide = 16.7
Hypothetical 10-wide = 20
Hypothetical 5x 3-wide = 20.9

7-wide traditional x86 decoder takes up similar transistor count(power/area) as Skymont's 3x3 setup. 10-wide is a little smaller than 5x 3-wide cluster setup.

Looking further:
20-wide traditional = 49.12
12x 3-wide cluster = 50.05

soresu · Jun 30, 2024

DavidC1 said:
3x3 = 27
2x4 = 32

Don't you mean 3x3x3 and 2x4x4?

(or 3³ and 2x4² if you want to be more efficient in expression)

soresu · Jun 30, 2024

I wonder if this is the overall direction that both Intel and AMD are going with all their CPU µArchs going forward, or if Intel is content to continue Cove and Mont in very separate lanes for the forseeable future.

soresu · Jun 30, 2024

DavidC1 said:
Looking further:
20-wide traditional = 49.12
12x 3-wide cluster = 50.05

I strongly suspect that at those kind of numbers there would be more overhead involved that significantly increases the area to less than favorable levels.

There is likely a sweet spot to target, and I'd warrant 12x any width is way out of that area.

DavidC1 · Jul 1, 2024

soresu said:
Don't you mean 3x3x3 and 2x4x4?

(or 3³ and 2x4² if you want to be more efficient in expression)

No, I'm saying a 3x3 decoder is going to be 27 and 2x4 decoders are going to be 32 if it was exponential but unlikely.

soresu said:
I strongly suspect that at those kind of numbers there would be more overhead involved that significantly increases the area to less than favorable levels.

It probably is but the problem happens on the traditional decoder setup too.

It's a lot of generations between Skymont and a hypothetical 12x cluster decoder so other innovations will likely happen.

DavidC1 · Jul 5, 2024

While they have 8 ALUs in Skymont, not all are of the same capability. Stephen Robinson said it was "cheap to add" and helps with peaks.

According to the C&C diagram, still only two of the ALUs are capable of doing INT MUL and INT DIV, the rest are not. So it could be said to add "Simple" ALUs.

Interesting line of thought I had after I read this again:

Skymont’s address generation setup is peculiar because there are four AGUs for store address generation even though the data cache only has enough write bandwidth to service two stores per cycle. Again this feels unbalanced, but having more store AGUs lets Skymont figure out where stores are going sooner. That helps the core figure out whether a load needs to get data from cache or a prior store. Of course Skymont will try to predict whether loads are independent of prior in-flight stores, but figuring out the true result faster helps correct a incorrect prediction before too much work gets wasted.

I think maybe this is something that may be unique/more beneficial to Clustered Decode architectures, because it needs to figure out further ahead.

A great description of Gracemont/Skymont's Clustered Decode and how it really works(it's not just for branches!):

Intel's actual approach is way more clever; They run the branch predictor ahead of the decoders by at least 3 branches (probably more). The branch predictor can spit out a new prediction every cycle, and it just plops them on a queue.

Each of the three decoders pops a branch prediction off the queue and starts decoding there. At any time, all three decoders will each be decoding a different basic block. A basic block that the branch predictor has predicted that the program counter is about to flow through. The three decoders are leap frogging each other. The decoding of each basic block is limited to a throughput of three instructions per cycle, but Skymont is decoding three basic blocks in parallel.

The decoded uops get pushed onto three independent queues, and the re-namer/dispatcher merges these three queues back together in original program order before dispatching to the backend. Each decoder can only push three uops per cycle onto its queue, but the re-namer/dispatcher can pull them off a single queue at the rate of 9 uops per cycle. The other two queues will continue to fill up while one queue is being drained.

The branch prediction result will always land on an instruction boundary, so this design allows the three decoders to combine their efforts and maintain a throughput of 9 uops per cycle, as long as the code is branchy enough. It works on loops too, as far as I'm aware, intel doesn't even have a loop stream buffer on this design; The three decoders will be decoding the exact same instructions in parallel for loop bodies.

But Intel have a neat trick to make this work even on code without branches or loops. The branch predictor actually inserts fake branches into the middle of long basic blocks. The branch predictor isn't actually checking an address to see if it has a branch. Instead it predicts the gap between branches, and they simply have a limit for the size of those gaps. Looks like that limit for Skymont is 64 bytes (was previously 32 bytes for Crestmont)

So you could think the E core architectures have a mindset of breaking down units as simple as possible, be more dedicated and then having many, many of them to be parallelized.

Not having an LSD is interesting too. Not only it's a totally different line of thought from the P cores(dating back since P6 basically), maybe they thought of it as saving transistors.

DavidC1 · Jul 5, 2024

Wow, Crestmont has more changes than I thought:

• Increased Branch Prediction Bandwidth (128B/cycle max from 32B/cycle on Gracemont).
• Larger Branch Target Buffer (6K entry from 5K) with Enhanced Path Based Branch Prediction.
• Wider allocation width (6-wide from 5-wide).
• Larger second-level TLB and larger dedicated 1GB page TLB.
• 48-bit VA with 52-bit PA used for MKTME keys.
• 2x SIMD integer multiply units, faster integer divide units.
• VEX-based AVX-NE-CONVERT convert, AVX-VNNI-INT8 and AVX-IFMA ISA extension.
• ECC protected Data Cache (in server products).
• Linear address masking (LAM)1, Linear Address Space Separation (LASS)2, Secure Arbitration Mode (SEAM), andTrust Domain Extensions (TDX) ISA extensions.
• Performance Monitoring enhancements include eight general-purpose counters (from six), precise distributionsupport for general-purpose counter 1 (totaling three counters), timed PEBS support, LBR event logging support,and multiple new events.

Shivansps · Jul 23, 2024

Radxa launched the X4 SBC, a RPI sized SBC with the Intel N100 at RPI5 prices.
It has a 2.5gbe nic and x4 3.0 M.2 slot.

Radxa X4

Credit Card Size, Big Performance: Intel N100 and RP2040 Inside

radxa.com

This might be a paper launch because it is out of stock everywhere.

gdansk · Jul 23, 2024

Shivansps said:
Radxa launched the X4 SBC, a RPI sized SBC with the Intel N100 at RPI5 prices.
It has a 2.5gbe nic and x4 3.0 M.2 slot.

Radxa X4

Credit Card Size, Big Performance: Intel N100 and RP2040 Inside

radxa.com

This might be a paper launch because it is out of stock everywhere.

Wow, that really seems to blow away the Raspberry Pi 5. If they can make any for the price they claim.

SteinFG · Jul 24, 2024

80 dollars for an 8GB N100 "mini pc" is insane. Though I wonder why they aren't showing its heatsink (is it even included?)

Shivansps · Jul 25, 2024

It looks like it is sold separately, it is listed in Arace Tech for $15 https://arace.tech/products/radxa-x4

I particulary like that the M.2 slot is x4 3.0, most boards with the N100 usually have x2 M.2 / PCIe even the ones in ITX/mATX form factor, completely killing the possibility of using a dgpu for gaming.
Not that x4 3.0 will be great, but it is bare minimum you need to get playable perf out of the LP cards like the RX 6400 and A380 LP 6GB

igor_kavinski · Oct 19, 2024

DrMrLordX said:
How much do those mini PCs cost?

Topton Intel i3 N305 8-Kern 4xi226-V 2,5G Firewall Mini-PC Alder Lake 12. Generation N200 N100 DDR5 4800 MHz Lüfterloser Soft-Router Proxmo - AliExpress 7

Smarter Shopping, Better Living! Aliexpress.com

www.aliexpress.com

soresu · Oct 19, 2024

Shivansps said:
Radxa launched the X4 SBC, a RPI sized SBC with the Intel N100 at RPI5 prices.
It has a 2.5gbe nic and x4 3.0 M.2 slot.

Radxa X4

Credit Card Size, Big Performance: Intel N100 and RP2040 Inside

radxa.com

This might be a paper launch because it is out of stock everywhere.

Oh huh, never heard of this Amston Lake before.

Basically ADL-N with some upgrades including the 2.5Gbe nic which also has PoE support.

igor_kavinski · Oct 19, 2024

soresu said:
Basically ADL-N

No, this is Atom 2024

Intel seems to have a problem letting go of legacy brand names.

AMD hasn't looked back since Ryzen, other than the exception of cheap Athlons. However, AMD isn't above sullying the Ryzen brand (7520U in 2024 is just pathetic).

The common culprit seems to be negative wit marketing...

DrMrLordX · Oct 19, 2024

@igor_kavinski

32GB of RAM and 512GB storage for ~$450 is pretty good. Not in love with the Alder Lake-N SoCs but still.

Kosusko · Oct 26, 2024

Intel Core Ultra 7 265K @ Kocicak CPU-Z 2.11.2 x64 (Version 17.01.64)

Paradigm shift: 12c .LITTLE Atom Skymont E-cores beats 8C big Lion Cove P-cores !

source:

Share your CPUZ Benchmarks!

Hey guys i just have a question did my CPU should use more power wattage ? i mean now it's using near 75 watts but i see somehwere it can use almost 117 max TDP do anyone have same CPU and how much this CPU will power Draw ?

www.techpowerup.com

Thank you Kocicak alias BoggledBeagle on techpowerup forum.

igor_kavinski · Oct 26, 2024

Yeah, Thanks @Kocicak !!!

Looks like PERFECT scaling for the Skymonts!

Skymonts are 45% faster than Lion Cove cores with 50% more cores (8P vs. 12E).

igor_kavinski · Oct 26, 2024

Seriously, Intel should release 4P+24E SKU for Arrow Lake NOW!

LightningZ71 · Oct 26, 2024

This is going to be HIGHLY dependent on cache residency. If whatever task you are running can live in 1MB of L2, then the Skymont chip should easily beat the P cores. If it spills over 1MB of L2, the Skymont cores will rapidly deteriorate in performance as they fight for L3 access on 4 way shared interfaces and suffer the very high L3 latency that Arrow Lake has.

What I would be interested in is a 2 P core + 24 e core design with two major changes: double the L2 to 8MB for the Skymont clusters and increase the L2 on the P cores to 4MB. That would hide the L3 cache even more and make the E cores more load agnostic.

Kosusko · Oct 26, 2024

Atom Skymont vs Zen5c

12c Atom Skymont (Intel Core Ultra 7 265K (20C/20T)

• CPU Multi Thread : 8 916,10 (+68,42%)
• CPU Single Thread : 740,6 (+42,70%)

8c Zen5c (AMD Ryzen AI 9 HX 370 12C/24T)

• CPU Multi Thread : 5 294 (59,38%)
• CPU Single Thread : 519 (70,08%)

source: https://diit.cz/clanek/preview-asus-zenbook-s16-s-amd-ryzen-ai-9-hx-370-je-v-redakci
source: https://www.techpowerup.com/forums/threads/share-your-cpuz-benchmarks.216765/post-5360549

DrMrLordX · Oct 26, 2024

igor_kavinski said:
Seriously, Intel should release 4P+24E SKU for Arrow Lake NOW!

They would need to tape out an entirely new compute chiplet for that.

Question Intel Mont thread

Lifer

Diamond Member

Senior member

Senior member

Architecture All Access: Live at Lunar Lake ITT: Next Gen E-core Skymont​

Senior member

Skymont E-core Architecture Explained by Intel Fellow | Talking Tech | Intel Technology​

Golden Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Lifer

Diamond Member

Lifer

Lifer

Senior member

Paradigm shift: 12c .LITTLE Atom Skymont E-cores beats 8C big Lion Cove P-cores !​

Lifer

Lifer

Platinum Member

Senior member

Atom Skymont vs Zen5c​

12c Atom Skymont (Intel Core Ultra 7 265K (20C/20T)​

8c Zen5c (AMD Ryzen AI 9 HX 370 12C/24T)​

Lifer

Architecture All Access: Live at Lunar Lake ITT: Next Gen E-core Skymont

Skymont E-core Architecture Explained by Intel Fellow | Talking Tech | Intel Technology

Paradigm shift: 12c .LITTLE Atom Skymont E-cores beats 8C big Lion Cove P-cores !

Atom Skymont vs Zen5c

12c Atom Skymont (Intel Core Ultra 7 265K (20C/20T)

8c Zen5c (AMD Ryzen AI 9 HX 370 12C/24T)