Intel Data-Centric Innovation Summit

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
Intel Data-Centric Innovation Summit

Videos and PDFs: http://intelstudios.edgesuite.net/180808_dcis/index.html
Intel Newsroom: https://newsroom.intel.com/editorials/data-centric-innovation-summit/

Navin Shenoy: Opening Keynote

*Will give update on roadmap and “how we will win”.
*~50% of revenue is from data-centric and growing double digits
*90% of world’s data created in last 2 years
*example: automotive -> example of end-to-end compute problem (AI, cloud, edge, networking)
*Intel TAM: forecast from IM2017: 2021 $160B data-centric TAM shared // forecast NOW: $200B data-centric https://images.anandtech.com/doci/13192/1533744704225791588519.jpg
*2/3 of cloud is TAM expansion (growing at 40% in H1’18)
*Cloud. 50% of CSP (cloud service provider) CPU volume is a custom CPU, up from 18% in 2013 (“highly customized products for their needs”)
*Networking: cloudification of network (core, access and edge). 20% MSS today of $24B 2022 silicon TAM. ISA optimizations, software, ecosystem, low-power hardware, etc. Example: 1 million base stations sold every year -> imagine Xeons in every one of them.
*AI. $2.5B market in 2017, growing 30% CAGR through 2022 to up to $10B -> investing heavily, biggest incremental investment of Intel
*DC network traffic (e.g. data moving from rack to rack): growing 25% CAGR -> connectivity becoming bottleneck -> full portfolio ($4B TAM growing to $11B in 2022): omni-path fabric, ethernet (#1 MSS, adding “SmartNIC” Cascade Glacier in 2019 combining FPGA and ethernet portfolio), and silicon photonics (Intel’s unique advantage: integrate laser in silicon)
*Moving data: Optane SSD and DIMMs (persistent memory) -> changing memory hierarchy. First QLC consumer and data center solution. Optane SSDs with 40x lower latency and higher endurance. Optane persistent memory.
*Optane persistent memory: $10B 2022 SAM (subset of 2022 DC DRAM market): big performance increase in certain workloads (8-11x). Persistency: from minutes to seconds of reboot times, from three 9s to five 9s of availability. A lot of work required, like software and “Revamp the IMC on the CPU”.
*Google: SAP HANA workload with persistent memory and Cascade Lake
*Announces first production modules of Optane PM shipped yesterday out of the factory -> Optane PM entering production!
*Congratulates engineers working on Optane PM, some working for over a decade to make it reality
*Process everything: Xeon, FPGA and Nervana ASIC.
*20th anniversary of Xeon CPU family -> industry leading processor for data center processing everywher: 220M Xeons, $130B revenue. Tailor Xeon for customer workloads and continue advancements. https://images.anandtech.com/doci/13192/15337460999211370399119.jpg
*Xeon Scalable stats: https://images.anandtech.com/doci/13192/1533746164839496564550.jpg.
*Performance leadership: up to 1.48x per core, 1.72x for L3 packet forwarding, 3.2x Linpack, 1.85x database, 1.45x memory caching
*Reinventing Xeon for AI: inference and training improved by >200x with hardware and software innovations from HSW to SKL
*Through software optimizations in last year: inference improved by 5.4x (INT8) since July’17
*2017: >$1B in revenue from Xeon AI revenue and big growth
*Next-gen: Cascade Lake. Ship in late 2018. New integrated memory controller for Optane PM. Spectre and Meltdown protections. New cache, new instructions.
*Adding new AI extensions: Intel Deep Learning Boost -> extends AVX-512 -> doubles performances (this is already in Knights Mill as VNNI instructions) https://images.anandtech.com/doci/13192/15337465541401312204600.jpg
*Google demo of 11x speed-up from SKX to CSX. A 10x improvement in one year through HW/SW is pretty good.
*Intel Select Solutions: verify configuration: engineer, validate and test solutions at HW and SW level -> makes easy to deploy complex solutions for faster TTM -> for analytics, network, HPC,… Announcement: new Select Solutions around AI, Blockchain and SAP HANA (Optane PM)
*The Intel differentiation: harness all of Intel’s assets (from transistor, architecture, memory, interconnect, security to software) -> welcome Jim Keller to stage, SVP of Engineering
*Came to Intel: amazing opportunity, scale, excellence of engineering, met many people, likes diversity of technology. Working at now: focus on 14nm and 10nm products coming to market, getting up to speed on the process, performance, yield.
*Roadmap talk . End of 2019: Cooper Lake on 14nm with “next-gen DLBoost”: Bfloat16 (already announced by Naveen at the AI day). 2020 (“fast follow-on to CPL”): Ice Lake. So CPL/ICL is one platform. This for instance means CPL has 8ch memory. Note: as was already leaked, Cooper Lake will by a 3-die MCP, I presume this means 3*24 cores for 72 cores. For those taking notes, that is more than Rome’s 64. https://twitter.com/Ashraf__Eassa/status/1027236588911124480
*Cooper Lake has “significantly better generation on generation improvement”.
*Navin: “Maintain our leadership position in the data center.”

Introduces Naveen Rao!

Naveen Rao: Unlocking Data Value with AI

*Want to make AI more broadly adopted
*Market: $2.5B in 2017, $8-10B in 2022: about evenly split between training/inference, but line between inference/training will become more blurred
*2012: 16.000 CPUs for 2 weeks proof of concept, now applying to real problems
*Intel’s heterogenous portfolio for AI from edge to data center, with shared IP portfolio https://images.anandtech.com/doci/13192/15337477631951431458216.jpg
*Reiterates AI is a $1B business for Intel today (only Xeon in data center, number doesn’t include FPGA and IoT)
*AI development lifecycle. 30% of development time is training, done by GPUs today. https://images.anandtech.com/doci/13192/1533748089180597043906.jpg
*AI roadmap. Debunks GPU myth. “Reality is almost all inference in the world runs on Xeon today”. Reiterates 5.4x inference performance since July’17 (and 1.4x for training), and 11x with Cascade Lake -> focus on lower precision (bfloat16 with Cooper Lake)
*NNP for inference: in development.
*NNP for training, coming in late 2019 NNP L-1000: bloat16, 3-4x performance of Lake Crest, high-bandwidth, low-latency interconnects. -> what really matters is performance (time to train) and efficiency https://images.anandtech.com/doci/13192/1533748403816342061948.jpg
*50% of AIPG (AI Products Group) is dedicated to software (libraries like MKL-DNN) -> but wants to make AI accessible to broad enterprise customers -> even higher level of abstraction for edge: OpenVINO toolkit and Movidius SDK https://images.anandtech.com/doci/13192/1533748664844850855111.jpg
*nGraph. Deep learning compiler nGraph: connects all hardware platforms (Xeon, Movidius, FPGA, GPU, etc.) to all software platforms (TensorFlow, etc.) https://images.anandtech.com/doci/13192/15337487256991267737489.jpg
*Example: Novartis (1028x1280x3) drug discovery: much higher resolution than Imagenet (224x224x3) -> >6x increase in performance by going from 1 to 8 CPUs (using omni-path), and 20x improvement with SW improvements
*Xeon has TCO benefit because it can be amortized over other things than just purely AI workload
*AI Builer Program with “vibrant ecosystem” https://images.anandtech.com/doci/13192/1533749058963243119428.jpg
*Open source community: NLP Architect, nGraph, Coach, Distiller
*AI Academy and DevCloud (110K developers, 150k users each month)
*Expanding AI DevCon (introduced in May)

RaeJeanne Skillern (GM of CSP Group): Cloud: Driving Transformation & Growth


*Three learnings: (1) One size does not fit all. (2) More than just better efficiency: differentiation. (3) Long-term partnerships at scale takes time.
*Cloud revenue stats (43% of revenue in H1’18): https://images.anandtech.com/doci/13192/15337495883751562984527.jpg
*Makes sure all of the Xeon compute capabilities are exploited by the CSPs
*In addition to standard roadmap: custom SKUs (>50% are custom). Xeon-D created out of joint innovation with Facebook. Nearly 30 off-roadmap, customized SKUs (for instance custom MCPs, custom logic in-silicon). Example: z1d instance with high-frequency CPU with all-core turbo of 4GHz.
*Semi-custom products: integrate IP of customers, use packaging technologies
*Deliver fully custom ASICs for customers
*See 2015 slide: https://www.fool.com/investing/general/2014/12/08/in-the-datacenter-intel-corporations-business-mode.aspx
*200 HW/SW engineers actively engaged with CSPs -> hand-on side by side engineering, working on 150 CSP projects in 2018
*Example: Tencent WeChat app (1B users) -> allows partner apps to run on top of it -> AI performance is critical -> worked with Tencent to increase performance 16x
*Cloud Insider program with 650+ members -> get access to Intel engineers and sandbox development for new hardware -> 30% revenue uplift from the engagements by designing balanced solution at the platform level
*>$100M/year on co-marketing: https://twitter.com/Ashraf__Eassa/status/1027251771209445377
*Summary: https://images.anandtech.com/doci/13192/1533750777348666280219.jpg

(posting this thread during the break... after the break, there's a talk about the enterprise, about networking and a Q&A... and maybe even more talks after lunch)
 
Aug 11, 2008
10,451
642
126
Nice information, if a bit overwhelming and of course, to be taken with a grain of salt since it came directly from Intel.
I really wish they would just come clean with 10nm. Tell us what they were trying to accomplish, what went wrong, and how they tried/are trying to fix it.
 

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
Ice Lake 2020 confirmed. AMD has one hell of an opportunity here.
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
Lots of marketing.
Lots of artificial segmentation and an over-all scattered product line that is facing competition across the whole sector.
So, lets break this down easily.. I'll work off their own major points :

Navin Shenoy: Opening Keynote
- Data Center/Cloud (Classic compute)

Will face fierce competition from AMD as it's a quite generic/commoditized market ATM.
The only thing that keeps intel established are brand premium and existing relationships.
AMD dominates intel on value and has solidified relationships with OEMs like DELL/etc...
- Optane
While more performant, it is overpriced and proprietary (a game intel can't keep playing). Meanwhile, in the broader market, SSD/NVME tech is accelerating heavily based on open standards and ARM processors. Intel wont dominate this and only will get a niche slice.
- Network hardware boxes
The whole industry has shifted towards L2 in the data center/cloud and new L3 like L2 protocols. The hardware that runs L2 doesn't need Intel's Xeon chips at all. L3 routing is also set for a shakeup but hasn't been fully fleshed out. As a result, there are much lower orders for L3 based equipment and Intel hardware by extension. AMD will begin attacking this market but its a tough one. One of the biggest network hardware companies has already started experimenting with AMD hardware across their equipment. So, Intel is facing headwinds here.-
- Meme Learning

Nothing is promised to anyone. Intel in no way shape or form has any significant portion of this market. It's stretched across GPUs/custom asics/FPGAs and has no solidified standards as the software is still in great flux. Google has spun its own TPUs for (inference) and has no interest in Intel's overpriced HW. The trend has been big cloud/data centers cooperating in open consortiums to disrupt premium charging hardware companies like Intel/etc w/ whitebox chinese H.W solutions based on open standards and non proprietary eco-systems. This will go down like Intel's Embedded systems push (which failed) imo. They acquired Nervana and altera and have been fumbling integration and rollout. They then applied their ever so failing premium pricing and artificial segmentation. Meanwhile, in the broader industry, people are rolling their own FPGAs and tackling the problems themselves at significantly less cost.

Naveen Rao: Unlocking Data Value with AI
Intel has no establishment in AI. The scales are still tilted towards : GPUs/Open compute/custom asic TPUs. Microsoft has been using FPGAs in a customized fashion since 2015 for Azure.

RaeJeanne Skillern (GM of CSP Group): Cloud: Driving Transformation & Growth
The drive is towards cheaper equipment. AMD has the edge here. No one wants to pay insane premiums for hardware anymore. It's become commoditized and cobbled. Intel faces lower margins and lower sales IMO. People are moving towards SAN based flash arrays w/o CPUs (the compute for data access is being pushed down to smart NICs (many core ARM/FPGA on a nic) and custom Asics).

TL;DR : A bunch of marketing... Intel will face a fight for its life in the coming years across all of its market segments from just about everyone that is capable. Storage Access/Network traffic and general I/O Compute is moving away from CPUs down to custom asics and ARM based sub-compute. AMD is waging a palatable war against them across their product line for general purpose computing. HW is becoming commoditized and it will be hard for anyone to charge a premium. Lower margins. Lower sales.. New and existing players will begin eating up their pie. To stay alive, they'll have to cut their prices, cut the proprietary b.s, adopt open standards, and simplify their ridiculously artificially segmented product lines. Great news for the consumer, progress, and innovation. The fire is going to be under everyone's feet.
 
Last edited:

Det0x

Golden Member
Sep 11, 2014
1,059
3,097
136
Intel's Reckoning is at Hand
wccftech said:
Cascade Lake 14nm
The launch will take place at the end of Q4 2018 for top-tier customers which are focused towards hyper-scale compute while general availability is expected in Q1 2019.
.
wccftech said:
Cooper Lake 14nm
Cooper Lake-SP family which will be introduced at the end of 2019.
.
wccftech said:
Ice lake 10nm
Later in "mid" of 2020
, Intel would introduce their first 10nm Scalable processor family, the Ice Lake-SP.
.
And that's Intels best case scenario, if everything goes according to plan fromhereon out.. And thats a big IF

https://wccftech.com/intel-xeon-scalable-family-roadmap-cooper-lake-14nm-2019-ice-lake-10nm-2020/

i will leave with this image

So, yes, Intel, I think the AMD engineers who have developed the Zen architecture from the ground-up would take issue with that. Especially when AMD's "Glued-together" dies actually wipe the proverbial floor with the blue company's chips in power-performance ratios, and deliver much better multi-threaded performance than Intel's offerings. Not bad for a "Glued-together" solution, I'd say.

GL with jerry-rigging the core architecture into a MCP design with Copper and Ice Lake
 
Last edited:

ub4ty

Senior member
Jun 21, 2017
749
898
96
Intel's Reckoning is at Hand



And that's Intels best case scenario, if everything goes according to plan fromhereon out.. And thats a big IF

https://wccftech.com/intel-xeon-scalable-family-roadmap-cooper-lake-14nm-2019-ice-lake-10nm-2020/

i will leave with this image

So, yes, Intel, I think the AMD engineers who have developed the Zen architecture from the ground-up would take issue with that. Especially when AMD's "Glued-together" dies actually wipe the proverbial floor with the blue company's chips in power-performance ratios, and deliver much better multi-threaded performance than Intel's offerings. Not bad for a "Glued-together" solution, I'd say.

GL with jerry-rigging the core architecture into a MCP design with Copper and Ice Lake

TBQ, I have no clue what the hell they were thinking going with a Massive monolithic mesh architecture for General purpose CPU cores. This kind of architecture is only suited for lots of very low feature cores like GPUs w/ a very slim and low feature interconnect. The interconnect is far to heavy for general purpose CPU cores. The insane power utilization of Intel's chips actually comes from the mesh interconnect btw. It always has to be fully hot for potentially flows.

Every one of those green tiles takes up space and power including the redlines that interconnect them. Those are 'inter core' traffic routers and subsequent connection lines. All of those must be hot at all times. It's a rather insanely overcomplicated and inefficient architecture that you can't scale on a monolithic die. They must have been so salty about Nvidia cutting into their market share that they tried to implement a GPU architecture in a CPU and cross over at some point into their turf. So many glaring mistakes on their behalf. MCM is actually the best approach from a cooling/efficiency/production cost and performance standpoint. This is the whole reasoning behind : https://en.wikipedia.org/wiki/Non-uniform_memory_access

You should be splitting your processing into clusters of cores not idiotically letting it get mapped to a flat paradigm of cores. Also, what they don't like specify is that the latency between cores is not uniform in a mesh architecture. If a core at the top left needs to communicate with one on the bottom right, the latency is completely different than best case (neighboring cores). This architecture is a joke for this reason. Instead of sticking with NUMA and MCM, they went full retard with a one off mesh architecture that costs them boatloads to produce, is power inefficient, and a logistic nightmare.

Whoever thought of this was thinking in a bubble and not across the whole range of considerations... Companies are completely lacking in people who think of a broad range of considerations thus why they end up with such billion dollar gaffs.
 
Reactions: Olikan

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,748
14,780
136
Intel's Reckoning is at Hand



And that's Intels best case scenario, if everything goes according to plan fromhereon out.. And thats a big IF

https://wccftech.com/intel-xeon-scalable-family-roadmap-cooper-lake-14nm-2019-ice-lake-10nm-2020/

i will leave with this image

So, yes, Intel, I think the AMD engineers who have developed the Zen architecture from the ground-up would take issue with that. Especially when AMD's "Glued-together" dies actually wipe the proverbial floor with the blue company's chips in power-performance ratios, and deliver much better multi-threaded performance than Intel's offerings. Not bad for a "Glued-together" solution, I'd say.

GL with jerry-rigging the core architecture into a MCP design with Copper and Ice Lake
What angers me about that slide, is that AMD took a scale-able SERVER core and broke it down to be used by desktop, not the other way around. And the "glued together" crap is also ridiculous, they are embarrassing themselves.
 

jpiniero

Lifer
Oct 1, 2010
14,838
5,456
136
Also, what they don't like specify is that the latency between cores is not uniform in a mesh architecture.

https://www.pcper.com/files/imageca...review/2017-06-16/latency-pingtimes-7900x.png

It's extremely close actually. My theory has been that the most of the time is essentially deciding on what route to take; the actual traveling doesn't take long at all. The core latency is probably not much different on the 28 core model.

And in case you haven't noticed, they have been pushing HPC/AI workloads.
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
https://www.pcper.com/files/imageca...review/2017-06-16/latency-pingtimes-7900x.png

It's extremely close actually. My theory has been that the most of the time is essentially deciding on what route to take; the actual traveling doesn't take long at all. The core latency is probably not much different on the 28 core model.

And in case you haven't noticed, they have been pushing HPC/AI workloads.
A 7900x only has 10 cores... Those are actually pretty bad ping times btw and quite noisy(varied0 [which is a big problem].

Compared to :

which gives you distinguished domains of performance. Very long ping latency local to a ccx w/ a tax for crossing a bridge to the other.


To compensate and try to stabilize the wildly varying latencies, intel does the following :


This is an overcomplicated mess and many do not appreciate the huge amount of space and power this big arse mesh network + router at every junction [in blue] takes up on a monolithic die. You don't flatten the cores to a mesh network and then virtualize NUMA.. You instantiate NUMA at the hardware level using MCM/CCX thereby capturing the reduction in production cost/complexity/power utilization at the hardware level necessitated by NUMA. What Intel did is completely backwards thinking : Flatten the core connectivity and then virtualize NUMA... LOL, what's the point of this if you still have to cluster/virtualize? It's better to do 4 core blocks (like AMD did) and connect them w/ a higher level interconnect highway. I hope you can clearly appreciate why AMD spit the cores into 4 core CCX's. Intel went full retard and mesh'd every core together even though it's stupid for cores to talk to each other in such a way. Someone was hitting the crack pipe to hard during and after Intel Xeon Phi
 
Reactions: .vodka and moinmoin

jpiniero

Lifer
Oct 1, 2010
14,838
5,456
136
A 7900x only has 10 cores... Those are actually pretty bad ping times btw and quite noisy(varied0 [which is a big problem].

Except you'll notice that it's still faster than cross-CCX. The problem with the mesh is more that the L3 performance had to be reduced since it's on the mesh, well everything is. There's probably room for improvement for speeding up the mesh beyond just higher clocks, I guess we will have to see.

NUMA I dunno, doesn't seem like it would be that useful with the mesh.

I hope you can clearly appreciate why AMD spit the cores into 4 core CCX's

Save R&D time basically, so it could be reused with Raven Ridge.
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
Except you'll notice that it's still faster than cross-CCX. The problem with the mesh is more that the L3 performance had to be reduced since it's on the mesh, well everything is. There's probably room for improvement for speeding up the mesh beyond just higher clocks, I guess we will have to see.
Intel's mesh has nowhere near the inter-ccx latency (Intel's latency is almost double). Intel has less latency than intra-ccx communication. But this only occurs when you've satured 4 cores in one AMD CCX and need to coordinate with another. The more sound design is for Low latency within a numa cluster and a tax to go outside of it. This is the biggest issue with mesh... You get dinged with an over-all much higher average latency across all cores that still varies depending on what core is trying to reach another.

Mesh is much more of an architectural issue than a proper NUMA HW implementation. This actually has big consequence for software which is why Intel virtualizes NUMA atop mesh.This is basic comp arch and its why I think its such a gaff that intel went this route. This is just on the performance side. On the power utilization side, mesh is an outright power hog. You then have to package this on a monolithic die which is insanely costly and you now hit power envelope issues. Everything about it is a mistake.

NUMA I dunno, doesn't seem like it would be that useful with the mesh.
When you write software across many cores, you need consistency. You can't have wildly ranging performance based upon how things get routed through mesh. So, NUMA most definitely has to be virtualized even atop mesh for any software package. Again, this is why mesh is foolish because you can't break the NUMA paradigm at scale. You'll need to aggregate cores somehow into a logical cluster. A GPU is a completely different paradigm and even they use 'NUMA' in the sense of warp blocks :


So yeah, intel was smoking crack. You learn not to do the shit they just pulled in multi-core architecture 101.

Save R&D time basically, so it could be reused with Raven Ridge.
AMD has the more sound design. It wasn't about R&D time saving.. It was that they had several visionaries who remembered their schooling. AMD's design team wasn't hitting a crack pipe and had people who respected established computer architecture principles. Intel seems like they had far too much R&D money and went full retard with a completely flawed architecture. If anything Intel was trying to save on R&D and actually believed their Xeon Phi architecture to be the future of how they should steer things :


https://insidehpc.com/2017/01/intel-xeon-phi-processor-programming-nutshell/
Xeon Phi was a flaming turd and a nightmare to code for (NUMA strikes again). The same can be said from taking the architecture and scaling it to a monolithic desktop CPU.
 
Last edited:
Reactions: .vodka

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
This for instance means CPL has 8ch memory. Note: as was already leaked, Cooper Lake will by a 3-die MCP, I presume this means 3*24 cores for 72 cores. For those taking notes, that is more than Rome’s 64.

S|A says Cooper Lake will roughly halve the performance gap from Rome compared to Skylake. They also claim 40% faster for Cooper Lake on a 350W, 3-die version. I'm assuming based on that it uses 3, 18 cores, with reduced clocks to fit within the TDP limit.

I actually believe S|A over Intel PPT presentations. They were right on the 10nm delay. They were right on Icelake delay. They were the one to release juicy performance details on Conroe. This is what being on the same architecture for 3 years results in. You'll go from being ahead to being massively behind.

Intel had absolutely terrible management. You can't go on without paying the price forever. Eventually you do. Publicly they've been denying they had problems for years. No doubt that denial extended internally too. To get out of that phase, some sort of a shock is required. A substantial marketshare loss and revenue decline is what's needed to wake them up. Of course they will put up quite a good fight considering what they have.
 
Reactions: .vodka

Vattila

Senior member
Oct 22, 2004
805
1,394
136
Intel's mesh has [much higher latency than intra-CCX] latency (Intel's latency is almost double). Intel has less latency [across the chip] than [inter-CCX] communication.

This needs to be pointed out more often.

While there is a penalty for inter-CCX communication, the intra-CCX latency, with 4 direct-connected cores, can approach the theoretical minimum. While the latency is less uniform than a mesh, CCX clustering may give better or competitive average latency depending on the workload and partitioning of cores.

Thanks for the great technical discussion.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Intel's mesh has nowhere near the inter-ccx latency (Intel's latency is almost double).

Only latencies intra-CCX are low ( and at a size of 4 cores that is a very small unit of computing ). Between CCX'es in same chip the latencies are already high 140 versus 100 in Your posted pics) and higher than for Intel on same chip. Between chips on same socket, latencies are very comparable to what Intel has between sockets ( STH tested 4 socket system, we have 2 socket systems that have less latency in our tests). And sub-numa clustering? It is for very specialized, NUMA aware software. We run multiple JVMs, that are forced on NUMA nodes, but there was no benefit to go to sub-NUMA clustering in our testing. Drawbacks of half memory per node, half memory BW (it's more complicated than that, as ideal OS could/would allocate memory on that sibling NUMA on chip first and maybe have nice perf?, but the way things are for us, it's all or nothing*) are huge compared to tiny gains in latency.

Your blabbering about "quite noisy latencies", Sub-NUMA clustering is like Your other posts here - You read something somewhere and post it as AMD supposed superiority items, when in fact opposite is true.
There are people on this board who actually run server grade hardware ( including me, on smaller scale and not top end items ). And our findings do not agree with Your bs.


*It is all or nothing, cause we use numactl to bind cpus and nodes ( as in numactl --cpunodebind=0 --membind=0 ) and we force JVM to allocate all memory with THP's -XX:+AlwaysPreTouch -XX:+UseTransparentHugePages. We found this workload and setup fits Skylake-EP architecture as glove, as JVMs allocate and work with ton of memory, everything stays local due to large L2 and non-inclusive L3. With previuos architecture we had a lot of problems with response time variability, if there was more than 1 JVM running on CPU and one ( or more) started to garbage collect.
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
Only latencies intra-CCX are low ( and at a size of 4 cores that is a very small unit of computing ).
Small for who exactly? A large number of CPUs (thanks intel) still only have 4 core / 8 thread, it's actually a pretty standardized number for clustering, and Intel's biggest selling sub segment of processors in Xeon Silver is a 4 core. So, a 4core CCX is a perfect building block from desktop to enteprrise. Taking note of the power of AMD baking this directly into their architecture via a CCX whereby you can achieve much lower latencies vs Intel's mesh, in comes one of the best feature's of AMD's approach not the worst as you seem to claim for your boutique jVM all or nothing workload.

Between CCX'es in same chip the latencies are already high 140 versus 100 in Your posted pics) and higher than for Intel on same chip.
Yes, this is the tradeoff of CCX/MCM. Lower cost of production and scalability have their drawbacks. With AMD's very low inter CCX latencies the performance is established. How you scale this is relates to software design which is the whole principle of NUMA. You can't simply speak from an anecdotal developer's perspective and that of a narrow case of software and make the claim that of what kind of architecture should be pursued. This architectural gaff is going to cost intel tens if not hundreds of billions of dollars.

Between chips on same socket, latencies are very comparable to what Intel has between sockets ( STH tested 4 socket system, we have 2 socket systems that have less latency in our tests).
If it isn't broke, don't fix it. With gate shrink, you can pack more cores on a CCX and this downside becomes muted with time. You can spend more time on software development and mute the impact of this high latency cross CCX transaction. Not to mention, as I have stated multiple times, mesh has variable latency depending on which core talks to another and it approaches AMD's MCM latency when this occurs :
Meanwhile the worst case scenario – getting data from the right top node to the bottom left node – should demand around 13 cycles. And before you get too concerned with that number, keep in mind that it compares very favorably with any off die communication that has to happen between different dies in (AMD's) Multi Chip Module (MCM)

While it is still better latency than MCM, the issue is the wide ranging variance. This might not be an issue for your development, but its a nightmare to others.

And sub-numa clustering? It is for very specialized, NUMA aware software.
Welcome to reality and future.

We run multiple JVMs, that are forced on NUMA nodes, but there was no benefit to go to sub-NUMA clustering in our testing. Drawbacks of half memory per node, half memory BW (it's more complicated than that, as ideal OS could/would allocate memory on that sibling NUMA on chip first and maybe have nice perf?, but the way things are for us, it's all or nothing*) are huge compared to tiny gains in latency.

I'm in the bare metal C/ASM camp. I don't deny heightened latency in the AMD approach but it is also far more consistent and performant for my use cases. There's a segment of benchmarks that Ryzen actually beats Intel. There's a reason for this and its the future IMO.

Your blabbering about "quite noisy latencies", Sub-NUMA clustering is like Your other posts here - You read something somewhere and post it as AMD supposed superiority items, when in fact opposite is true.
There are people on this board who actually run server grade hardware ( including me, on smaller scale and not top end items ). And our findings do not agree with Your bs.
There's nothing blabbering about my post. I posted sub-numa clustering for a reason. Do you deny why it exists? Or are you going to pretend there's no use case? You run JVM on server grade hardware? I work at a much lower level. The findings of people relates to a broad scope of software packages that are more performant on Intel's architecture. I have a different set of considerations and Ryzen is more performant. If mesh weren't such a gaff, Intel wouldn't be reorganizing for MCM architectures. In my prior work, I worked at the hardware level on custom asics and solutions that have throughput and performance requirements that make even the highest end Xeon look like a joke. AMD tapped into this ecosystem and understood its potential to scale a CCX. Intel went off into la-la land and yet again arrived at a rather expensive architecture that hits performance requirements for a segment at a ridiculous cost point and with many cons. The whole eco-system of computing and even HPC is shifting wildly to a new architecture beyond AMD and Intel's influence. Storage is becoming very low latency and is taking a front seat. Software design will evolve with it. Your particular company and others found better performance with Intel's current class of server chips. Congrats ! It is going to cost you a pretty penny to settle in on it which is why Intel stays in business and server sales will be a hard thing for AMD to crack into. That's a completely tangential discussion than the technicals I highlighted. Everything isn't glory with Intel's mesh architecture. It is an expensive monolithic design with a large number of short comings beyond performance.. and in some cases it underperforms AMD's design. So there is 0% blabbering in my post. Your use case isn't everyone's and I am likely far more seasoned on both the hardware/low-level software than you are which is why I cut right to the point and beyond the bullshit. Not everything is about your performance use case which is why far bigger companies are investing in EPYC platforms.

*It is all or nothing, cause we use numactl to bind cpus and nodes ( as in numactl --cpunodebind=0 --membind=0 ) and we force JVM to allocate all memory with THP's -XX:+AlwaysPreTouch -XX:+UseTransparentHugePages. We found this workload and setup fits Skylake-EP architecture as glove, as JVMs allocate and work with ton of memory, everything stays local due to large L2 and non-inclusive L3. With previuos architecture we had a lot of problems with response time variability, if there was more than 1 JVM running on CPU and one ( or more) started to garbage collect.

https://www.intel.com/content/www/us/en/products/processors/xeon/scalable/silver-processors.html
Xeon silver makes a lot of sales especially Silver 4112. What's the core count? 4 cores...
What's the core count in a CCX?

You thought I was someone w/o knowledge or industry experience. You're wrong. I dont need to drill into super technical details because I'm seasoned enough to pick off the relevant ones and talk about them at a high level. I re-state... Intel's Mesh architecture is an absolute gaff from an over-all approach and baseline for chip design. It's why they're completely re-aligning resources to pursue an MCM solution.

I have no doubt that some boutique software groups are comparing every nano-second of performance they can squeeze out of a particular chip and have no concern about how much it will cost them to invest in such a platform. I'm speaking about the broader and evolving market and much more serious compute loads that AMD's architecture actually beats Intel's.

You tried to insult someone whose in a completely different league and you're factually wrong in a number of ways.
https://www.amd.com/system/files/2018-03/AMD-Optimizes-EPYC-Memory-With-NUMA.pdf
 
Last edited:
Reactions: wahdangun

Mopetar

Diamond Member
Jan 31, 2011
8,005
6,453
136
I wonder if we’re going to see a core-count war similar to the MHz race in the past. I don’t know how well Intel’s 72 core chip will perform, but it’s clear that they’re really pushing it.

I know that Rome is slated to be 64, but didn’t AMD have that future design with CPU, GPU, and HBM all on a massive interposer that had some even more ridiculous number of cores?
 

Abwx

Lifer
Apr 2, 2011
11,167
3,862
136
Wall of text with random tangents like discussing 4 core, when people discuss server CPUs in data center thread etc. Good that this forum has ignore

Since the discussiopn is about the relevancy of a 4C building block as basis for bigger CPUs not sure who is going tengential if not completely out of phase...

What the guy says is that if intel get to MCM designs they will eventually have to adopt a nearby solution to AMD s CCX, somewhat like their Hypertransport copycat QPI, aka Quick Plagiarism to Innovation...
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
Wall of text with random tangents like discussing 4 core, when people discuss server CPUs in data center thread etc. Good that this forum has ignore
A wall of structured retorts to you wall of rambling to unequivocally nullify your baseless rant.
Ending with a sponsored white paper that backs every single thing I stated.
I mentioned the Xeon Silver 4112 because its actually one of the chips moving in volume. You see, beyond all of the yapping you're doing, people in the industry don't pony up for Xeon gold or Platinum nor the higher core counts within the lower tiers. You'd know this if you were informed as you claim to be. So yes, a 4core Xeon Silver actually moves a tremendous amount of volume because that's what you optimally need in storage server.

I am happy for the ignore feature. I have put it to use once recently and now am putting it to use for a second time.
If you had any valid arguments beyond your anecdotal performance preferences using an all or nothing JVM deployment, you would have provided them in response.

TL;DR : https://www.amd.com/system/files/2018-03/AMD-Optimizes-EPYC-Memory-With-NUMA.pdf
EPYC is the preferred solution for a serious HPC workload and data center density due to the fact that it outstrips Intel's Xeon Skylake line in I/O and overall performance capability.

Meanwhile, Rambo has his anecdotes and adds people to his ignore list when he's put in his place.
Reality is hard to stomach for many people. I'm sure Intel is going to have a tough go of the reality that they'll face in the coming years even as their loyal bronze/silver Xeon customers scream they have the best performance for serious workloads. Who'd of thought, a very expensive architectural decision that results in absurd pricing results in people buying the lower tier product lines they can afford.

Your Mr.Big Time and intel is the future ....
https://www.cnbc.com/2018/08/06/int...t-downgrades-chipmaker-due-to-competitio.html

Image removed by moderator

* reaches out for the ignore button


*Reaches for his mod button.
We don't allow personal attacks, or
insulting memes in the tech forums.

Put the person on ignore if you wish,
and leave it at that.

AT Mod Usandthem
 
Last edited by a moderator:

ub4ty

Senior member
Jun 21, 2017
749
898
96
What the guy says is that if intel get to MCM designs they will eventually have to adopt a nearby solution to AMD s CCX, somewhat like their Hypertransport copycat QPI, aka Quick Plagiarism to Innovation...
Somebody knows their history ...

Also the big elephant in the room beyond the tech talk is cost/physics/production capability :
The EPYC package consists of four 213 mm2 Zeppelin die. The aggregate of those four die is 852 mm2 of silicon area per package. That die size is actually too big to build using today’s optical lithography techniques. AMD estimates that if EPYC was built as a (hypothetical) monolithic die, it could remove some of the inter-die IF and PHY, and some additional logic for a ~10% size savings. Removing about 10% from the 852 mm2 theoretical die reduces it to about 777 mm2 , which can fit inside an optical reticle. Still the 777 mm2 die would have relatively low yields, because there is an inverse-exponential reduction in yield with larger die size. Using AMD’s historical yield model and production defect density, AMD estimated that the four smaller die were less than 60% of the cost of the one large die (Figure 2). Using multiple smaller die has an inherent higher yield and, thereby, a cost advantage.
source : https://www.amd.com/system/files/2018-03/AMD-Optimizes-EPYC-Memory-With-NUMA.pdf

So there is no escape from MCM. Meanwhile, Intel has priced itself out of sales due to a ridiculously expensive and low yielding process. Intel "hit it out of the park" and neither them nor their customers can afford it.
 
Last edited:

jpiniero

Lifer
Oct 1, 2010
14,838
5,456
136
So there is no escape from MCM. Meanwhile, Intel has priced itself out of sales due to a ridiculously expensive and low yielding process. Intel "hit it out of the park" and neither them nor their customers can afford it.

You should read this patent. This is pretty much where Intel is going. Until then, you will see multiple dies connected via EMIB, yes.

https://patents.google.com/patent/US20160092396A1/en

The problem with Skylake-SP really more than anything the bloat caused by the extra AVX-512 unit and the giant L2. And of course now compounded by the inability to fix 10 nm.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
The problem with Skylake-SP really more than anything the bloat caused by the extra AVX-512 unit and the giant L2. And of course now compounded by the inability to fix 10 nm.

This^^. Giant L2 is actually OK, without it and with non-inclusive L3 things would suffer. Intel used to have bandaids like "Cache partitioning blabla" technologies that were insanely hard to setup, right now SKL-EP works out of box, memory heavy workload on one core will not destroy working set on others.

AVX512 the way it was done was a mistake on 14nm. Instruction set is great and well designed, but noone really asked for that extra unit (and those who did are running Voltas anyway i guess). AVX512 clocks being equal to AVX clocks, would mean it would have probably 60% of full 2 unit performance if not more.

I agree that functioning 10nm would have fixed everything automagically for Intel, allowing for inclusion of AVX512 on client and "cheapening" the costs on server ( probably by throwing more L3 and cores at the problem ).
 

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
You should read this patent. This is pretty much where Intel is going. Until then, you will see multiple dies connected via EMIB, yes.

https://patents.google.com/patent/US20160092396A1/en

The problem with Skylake-SP really more than anything the bloat caused by the extra AVX-512 unit and the giant L2. And of course now compounded by the inability to fix 10 nm.
"Bloat". In terms of die area, it is way more efficient to do AVX-512 then to double the core count. So for vector workloads, AVX-512 is the way to go. The VNNI instructions will double deep learning performance yet again.
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
You should read this patent. This is pretty much where Intel is going. Until then, you will see multiple dies connected via EMIB, yes.

https://patents.google.com/patent/US20160092396A1/en

The problem with Skylake-SP really more than anything the bloat caused by the extra AVX-512 unit and the giant L2. And of course now compounded by the inability to fix 10 nm.

I am familiar w/ EMIB and it looks promising.
They claim it is cheaper than the industry standard 2.5D interposer methodology and they claim that it affords them far more flexibility with respect to connecting different process nodes on the same chip. Lots of claims about cost reductions/etc. They've already put it to use in various shipped products. The issue is the pricing. No doubt they have stellar performance and a very interesting roadmap ahead. I'd hope so. The question for me on the consumer side is, what these features do for me and at what cost, compatibility, and long term supportability. Also, what kind of platform am I investing in? Will I encounter lots of artificial segmentation and a whammy of configurations that cause me to have to make brand new considerations and purchase decisions? Will the end result be what a flaming pile of dung like Xeon Phi which this mesh architecture and more is based on? What cost will that have if I become married to such a platform vs what the broader more open standard industry is trending towards?

The problem here is whether or not Intel can K.I.S.S and provide good performance/value for their own sake in terms of process cost and for the sake of whether or not their solutions appeal to their consumers. You have to make a lot more decisions about what road to take beyond just engineering, there are cost tradeoffs, complexity tradeoffs, and a whole slew of business considerations that have heavy weight on the success of a particular approach. So, they had EMIB and had it for some time... Why the wait in terms of using it BROADLY? Why did AMD beat them to market in a big way? I look at Intel like a number of giant tech companies that start losing focus/shooting blanks and engaging in a number of costly moves that don't translate to their customers simply because they have too much money/success and this results in a lack of streamlined focus.

Outside of their approach is a broader industry in a great deal of flux. There are big changes happening in the data center and architectures. A good deal of tasks originally needing a CPU are being pushed down to open standard protocols on sub compute modules closer to things like storage. FPGAs are ubiquitous and everyone's using them everywhere. Outside intel, the industry standard open fabs have better processes. HBM2 is being used everywhere. So, the question ultimately comes down to : can they deliver/compete/deliver better value/performance. They're no longer a premium brand IMO. They no longer are even a leader in a particularly innovative approach. Non-intel fabs are going to ship a smaller process before them. I have no doubt about their huge baskets of competitive technology. The issue is : They've slowly become an island whereas the industry is ever more combined on standardized and even more performant/better processes, designs, and protocols.

Mesh + EMIB seems like Intel had this dream that they'd be the only game in town creating this heavily proprietary, scalable, and chained SOC paradigm down the road. This is 100% wrong. Competition has actually increased substantially and the industry is consolidating on open standards and interconnects.

So, where does that leave intel? Behind the ball on just about everything and a huge price premium...

"Bloat". In terms of die area, it is way more efficient to do AVX-512 then to double the core count. So for vector workloads, AVX-512 is the way to go. The VNNI instructions will double deep learning performance yet again.

https://www.xcelerit.com/computing-...l-xeon-scalable-processor-vs-nvidia-v100-gpu/
https://news.ycombinator.com/item?id=15987796
If you look at Skylake-SP architecture vs. recent GPUs, the chip design at first glance doesn't seem so different anymore between these two, CPUs are just much less focused, which pays a 2x price in theoretical performance for the same die space, even using Intel's superior process technology. Now that being said, I think the GPU/SIMT model of vector computing is just much smarter. Why let me jump through all these hoops of masking and compiler optimizations if all I want is a branch and an early exit for a specific set of values? GPU schedulers and drivers make this easy to use and with somewhat predictable performance results. Furthermore (and probably more importantly), why is Intel putting this amount of compute power on a CPU without significantly upgrading memory bandwidth? A 28 core Skylake-SP using full vectorization now has 3x (!) the FLOP/Byte system balance compared to NVIDIA P100. Seriously? System balance was once an argument against GPUs, but not anymore apparently...

Fun fact: AVX-512 came from the design that is now known as Xeon Phi.

The Flaming turd Phi meme lives on.

I changed some Golang code to AVX in my last project. In isolation that code ran like 2-4x faster but as part of the full program, the program was 5% slower overall.Could never make a sense of it. Any thoughts on how to determine the cause?

AVX code is known to make the CPU run way hotter than usual. Perhaps that caused throttling that made general code running at the same time, or within a short span thereafter, perform worse?
The famed inferno performance of AVX

Running AVX-512 on all cores will lower all core operating frequency, in order to maintain the TDP of the processor.
When you run it in isolation, there's more headroom since only one core can run at the higher freq. and use AVX512 registers.
Also the power license ensures the cores running AVX512 code runs at a lower frequency.
Make sure to guard your AVX512 code block with VZEROUPPER when exiting it


At some point, intel is going to have to get its head out of the clouds as supported by many years of success and vaults of cash and look to the practicality of things and the broader trend of the open and collaborative market. I wish them the best, but they're nowhere on the radar of consideration in the years ahead. There is far too much happening elsewhere that is more performant, practical, supported, and cheaper. They're becoming a meme... A premium meme in every aspect of their business and process.
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |