Intel Pentium N4200 (Apollo Lake/Goldmont) AIDA64 CPUID dump
Intel Pentium N4200 (Apollo Lake/Goldmont) Instruction latency dump
Airmont vs Goldmont:
Caches:
- L1D: same characteristic (24 KB, 6 way, 3 clk latency), but independent 16B load + 16B store ports instead of 16B shared load/store port (according to REP MOVS* throughput)
- L1I: same characteristic (32 KB, 8 way)
- L2: same characteristic (1 MB / 2 core, 16 way, ??? clk latency)
- 4K DTLB doubled (256 -> 512)
ISA extensions:
- SHA
- CLFLUSHOPT
- RDSEED
- PT
- MPE
- SMAP
- FSGSBASE
- no-VEX required ISA extensions e.g. BMI, AVX, F16C
Integer:
- Triple decode, triple issue. Goldmont can sustain 3 basic ALU instruction (MOV, ADD, SUB, CMP, TEST, AND, OR, XOR, NEG, NOT, NOP) per clock
- MOV ellimination works for 32 and 64 bit GPR
- ~2.5x faster 64b IDIV (107 -> 44)
- 2 operand write (I)MUL (7 -> 6)
- REP MOVS 16 -> 32 B/clk peak
- CRC32 6|6 -> 3|1
vector fp:
- Out-of-Order(*), fully pipelined, SP-DP L|T symmetrical
- NOV ellimination works for MOVA|UPS|D
- ADDPS 3|1 -> 3|1
- MULPS 5|2 -> 4|1
- ADDPD 4|2 -> 3|1
- MULPD 7|4 -> 4|1
- shuffle, pack 1|1 -> 1|0.5
- CVT*, EXTRACT*, INSERT* fully pipelined
vector int:
- NOV ellimination works for MOVDQA|U
- shuffle, pack, shift, PMOV* 1|1 -> 1|0.5
- PSHUFB: 5|5 -> 1|1
- Quadword add/sub 4|4-> 2|1
- PMUL* 5|2 -> 4|1
- AESENC/DEC 9|5 -> 6|2
HW SHA instrucions:
SHA1RNDS4 5|2
SHA1NEXTE 3|1
SHA1MSG1 3|1
SHA1MSG2 3|1
SHA256RNDS2 8|4
SHA256MSG1 3|1
SHA256MSG2 3|1
x87:
- FXCH doesn't block
- some uCode faster
ROB size:
- 48 -> ???
Intel removed the 6W TDP Apollo Lake parts from ark.intel.com database (Pentium N4200, Celeron N3450, N3350)
AFAIK Intel Denverton server (Atom C3xxx) and probably Knights Mill and Knights Hill will be based on this core
(*)Current form of the instruction latency dump doesn't implicate the OoO property, but I've seen a written statement about this from an Intel engineer
I hope later I can provide the computing evidence of OoO FPU, L2 latency and ROB size