Question Intel Mont thread

Page 10 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DavidC1

Golden Member
Dec 29, 2023
1,211
1,932
96
According to Chief Architect of Skymont and Intel Fellow Stephen Robinson, the 3x3 setup is slightly larger than a hypothetical 2x4 setup.

Then at least for the particular setup, the increase in decoders isn't a exponential increase.

Decoder sizes:
3-wide = 3^2 = 9
4-wide = 4^2 = 16
5-wide = 5^2 = 25

3x3 = 27
2x4 = 32

So not exponential then. Let's continue?

"Slightly smaller"
3^1.5 x 3 = 15.6
4^1.5 x 2 = 16

3^1.4 x 3 = 13.97
4^1.4 x 2 = 13.93

Too small of a gain for him to make a comment.

3^1.3 x 3 = 12.51
4^1.3 x 2 = 12.12

That's it.

Lion Cove = 8^1.3 = 14.92
Hypothetical 9-wide = 17.4
Hypothetical 4x 3-wide = 16.7
Hypothetical 10-wide = 20
Hypothetical 5x 3-wide = 20.9

7-wide traditional x86 decoder takes up similar transistor count(power/area) as Skymont's 3x3 setup. 10-wide is a little smaller than 5x 3-wide cluster setup.

Looking further:
20-wide traditional = 49.12
12x 3-wide cluster = 50.05
 

soresu

Diamond Member
Dec 19, 2014
3,323
2,599
136
I wonder if this is the overall direction that both Intel and AMD are going with all their CPU µArchs going forward, or if Intel is content to continue Cove and Mont in very separate lanes for the forseeable future.
 

soresu

Diamond Member
Dec 19, 2014
3,323
2,599
136
Looking further:
20-wide traditional = 49.12
12x 3-wide cluster = 50.05
I strongly suspect that at those kind of numbers there would be more overhead involved that significantly increases the area to less than favorable levels.

There is likely a sweet spot to target, and I'd warrant 12x any width is way out of that area.
 

DavidC1

Golden Member
Dec 29, 2023
1,211
1,932
96
Don't you mean 3x3x3 and 2x4x4?

(or 3³ and 2x4² if you want to be more efficient in expression)
No, I'm saying a 3x3 decoder is going to be 27 and 2x4 decoders are going to be 32 if it was exponential but unlikely.
I strongly suspect that at those kind of numbers there would be more overhead involved that significantly increases the area to less than favorable levels.
It probably is but the problem happens on the traditional decoder setup too.

It's a lot of generations between Skymont and a hypothetical 12x cluster decoder so other innovations will likely happen.
 

DavidC1

Golden Member
Dec 29, 2023
1,211
1,932
96
While they have 8 ALUs in Skymont, not all are of the same capability. Stephen Robinson said it was "cheap to add" and helps with peaks.

According to the C&C diagram, still only two of the ALUs are capable of doing INT MUL and INT DIV, the rest are not. So it could be said to add "Simple" ALUs.

Interesting line of thought I had after I read this again:
Skymont’s address generation setup is peculiar because there are four AGUs for store address generation even though the data cache only has enough write bandwidth to service two stores per cycle. Again this feels unbalanced, but having more store AGUs lets Skymont figure out where stores are going sooner. That helps the core figure out whether a load needs to get data from cache or a prior store. Of course Skymont will try to predict whether loads are independent of prior in-flight stores, but figuring out the true result faster helps correct a incorrect prediction before too much work gets wasted.
I think maybe this is something that may be unique/more beneficial to Clustered Decode architectures, because it needs to figure out further ahead.

A great description of Gracemont/Skymont's Clustered Decode and how it really works(it's not just for branches!):
Intel's actual approach is way more clever; They run the branch predictor ahead of the decoders by at least 3 branches (probably more). The branch predictor can spit out a new prediction every cycle, and it just plops them on a queue.

Each of the three decoders pops a branch prediction off the queue and starts decoding there. At any time, all three decoders will each be decoding a different basic block. A basic block that the branch predictor has predicted that the program counter is about to flow through. The three decoders are leap frogging each other. The decoding of each basic block is limited to a throughput of three instructions per cycle, but Skymont is decoding three basic blocks in parallel.

The decoded uops get pushed onto three independent queues, and the re-namer/dispatcher merges these three queues back together in original program order before dispatching to the backend. Each decoder can only push three uops per cycle onto its queue, but the re-namer/dispatcher can pull them off a single queue at the rate of 9 uops per cycle. The other two queues will continue to fill up while one queue is being drained.

The branch prediction result will always land on an instruction boundary, so this design allows the three decoders to combine their efforts and maintain a throughput of 9 uops per cycle, as long as the code is branchy enough. It works on loops too, as far as I'm aware, intel doesn't even have a loop stream buffer on this design; The three decoders will be decoding the exact same instructions in parallel for loop bodies.

But Intel have a neat trick to make this work even on code without branches or loops. The branch predictor actually inserts fake branches into the middle of long basic blocks. The branch predictor isn't actually checking an address to see if it has a branch. Instead it predicts the gap between branches, and they simply have a limit for the size of those gaps. Looks like that limit for Skymont is 64 bytes (was previously 32 bytes for Crestmont)
So you could think the E core architectures have a mindset of breaking down units as simple as possible, be more dedicated and then having many, many of them to be parallelized.

Not having an LSD is interesting too. Not only it's a totally different line of thought from the P cores(dating back since P6 basically), maybe they thought of it as saving transistors.
 
Reactions: moinmoin

DavidC1

Golden Member
Dec 29, 2023
1,211
1,932
96
Wow, Crestmont has more changes than I thought:

• Increased Branch Prediction Bandwidth (128B/cycle max from 32B/cycle on Gracemont).
• Larger Branch Target Buffer (6K entry from 5K) with Enhanced Path Based Branch Prediction.
• Wider allocation width (6-wide from 5-wide).
• Larger second-level TLB and larger dedicated 1GB page TLB.
• 48-bit VA with 52-bit PA used for MKTME keys.
• 2x SIMD integer multiply units, faster integer divide units.
• VEX-based AVX-NE-CONVERT convert, AVX-VNNI-INT8 and AVX-IFMA ISA extension.
• ECC protected Data Cache (in server products).
• Linear address masking (LAM)1, Linear Address Space Separation (LASS)2, Secure Arbitration Mode (SEAM), andTrust Domain Extensions (TDX) ISA extensions.
• Performance Monitoring enhancements include eight general-purpose counters (from six), precise distributionsupport for general-purpose counter 1 (totaling three counters), timed PEBS support, LBR event logging support,and multiple new events.
 
Reactions: 511 and Tlh97

Shivansps

Diamond Member
Sep 11, 2013
3,875
1,530
136
Radxa launched the X4 SBC, a RPI sized SBC with the Intel N100 at RPI5 prices.
It has a 2.5gbe nic and x4 3.0 M.2 slot.




This might be a paper launch because it is out of stock everywhere.
 
Reactions: Tlh97 and gdansk

gdansk

Diamond Member
Feb 8, 2011
3,276
5,186
136
Radxa launched the X4 SBC, a RPI sized SBC with the Intel N100 at RPI5 prices.
It has a 2.5gbe nic and x4 3.0 M.2 slot.




This might be a paper launch because it is out of stock everywhere.
Wow, that really seems to blow away the Raspberry Pi 5. If they can make any for the price they claim.
 

Shivansps

Diamond Member
Sep 11, 2013
3,875
1,530
136
It looks like it is sold separately, it is listed in Arace Tech for $15 https://arace.tech/products/radxa-x4

I particulary like that the M.2 slot is x4 3.0, most boards with the N100 usually have x2 M.2 / PCIe even the ones in ITX/mATX form factor, completely killing the possibility of using a dgpu for gaming.
Not that x4 3.0 will be great, but it is bare minimum you need to get playable perf out of the LP cards like the RX 6400 and A380 LP 6GB
 
Last edited:
Reactions: SteinFG

soresu

Diamond Member
Dec 19, 2014
3,323
2,599
136
Radxa launched the X4 SBC, a RPI sized SBC with the Intel N100 at RPI5 prices.
It has a 2.5gbe nic and x4 3.0 M.2 slot.




This might be a paper launch because it is out of stock everywhere.
Oh huh, never heard of this Amston Lake before.

Basically ADL-N with some upgrades including the 2.5Gbe nic which also has PoE support.
 
Jul 27, 2020
20,902
14,489
146
Basically ADL-N
No, this is Atom 2024

Intel seems to have a problem letting go of legacy brand names.

AMD hasn't looked back since Ryzen, other than the exception of cheap Athlons. However, AMD isn't above sullying the Ryzen brand (7520U in 2024 is just pathetic).

The common culprit seems to be negative wit marketing...
 

Kosusko

Member
Nov 10, 2019
194
175
116
Intel Core Ultra 7 265K @ Kocicak CPU-Z 2.11.2 x64 (Version 17.01.64)

Paradigm shift: 12c .LITTLE Atom Skymont E-cores beats 8C big Lion Cove P-cores !​


source:
Thank you Kocicak alias BoggledBeagle on techpowerup forum.
 
Last edited:

LightningZ71

Golden Member
Mar 10, 2017
1,910
2,260
136
This is going to be HIGHLY dependent on cache residency. If whatever task you are running can live in 1MB of L2, then the Skymont chip should easily beat the P cores. If it spills over 1MB of L2, the Skymont cores will rapidly deteriorate in performance as they fight for L3 access on 4 way shared interfaces and suffer the very high L3 latency that Arrow Lake has.

What I would be interested in is a 2 P core + 24 e core design with two major changes: double the L2 to 8MB for the Skymont clusters and increase the L2 on the P cores to 4MB. That would hide the L3 cache even more and make the E cores more load agnostic.
 

Kosusko

Member
Nov 10, 2019
194
175
116
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |