adroc_thurston
Diamond Member
- Jul 2, 2023
- 5,225
- 7,286
- 96
Cuz it's better.If AMD can compete with N3P on client, why would they pay for N2?
Cuz it's better.If AMD can compete with N3P on client, why would they pay for N2?
Unified Memory rules.Of course having lots of VRAM would be better than having CXL backed VRAM.
Do you need CXL for that? I think in the nvidia control panel you can set an option that is functionally equivalent, probably in llama.cpp they document what to set so when you run a model too large for a gpu it will not report out of memory exception but rather spill to system memory, but I would need to dig through the docu to double check if I am not hallucinating anything.The concept is much simpler. In CXL.mem, the memory of the CXL device can be mapped into another, so it appears as memory in its address space
Currently there is no cheap but fast solutionUnified Memory rules.
CXL allows normal memory mapping without no software tricks, works for any GPU workloadDo you need CXL for that? I think in the nvidia control panel you can set an option that is functionally equivalent, probably in llama.cpp they document what to set so when you run a model too large for a gpu it will not report out of memory exception but rather spill to system memory, but I would need to dig through the docu to double check if I am not hallucinating anything.
Do you have a link someone is running this on DT CPUs with Client GPUs?
Guilty of being exclusively red .This is what I have been able to find quickly, but I have not tried to check cuda docu to see what are the limitations. I guess CXL is much more flexible.
Every other client SoC of worth is using N2 in some form in the same timeframe.Not sure I agree that N2 is worth the cost bump over N3 for all applications. If AMD can compete with N3P on client, why would they pay for N2?
Of course having lots of VRAM would be better than having CXL backed VRAM.
But it does not swap. The memory access at any address mapped to another device is simply routed, It is just like any NUMA access.
This is not me explaining the standard is like that.
Also most competitive models these days use MoE and the entire model is not activated. Case in point is DeepSeek. There are hotspots but you dont need to access everything, it is just that the entire model has to be loaded, a fraction of the entire model depending upon the expert being engaged gets accessed
The main advantage is being able to load the full model without getting distilled to oblivion.
NV has had proprietary implementations of a UVM since before the AI craze started. Though I was under the impression that you wanted/needed NVLink in hardware to make it work.Do you need CXL for that? I think in the nvidia control panel you can set an option that is functionally equivalent, probably in llama.cpp they document what to set so when you run a model too large for a gpu it will not report out of memory exception but rather spill to system memory, but I would need to dig through the docu to double check if I am not hallucinating anything.
I didn't say anything about total mac vs Windows business market share. All I said was that software development on Windows sucks.Ahh... I am the head of a software department (and hardware). Exactly what percentage of computer sales do you think people like you and I represent?
In my office, we have mostly people using windows PC's .... even in development. Note, we are heavily embedded vs cloud and mobile apps.
Mobile group does use mostly Mac (because Apple is such a PITA and won't cross compile like EVERY other OS on the planet). THIS and THIS alone is why our mobile developers use a Mac. Cloud group is a mix of Mac, Windows and Linux. Embedded is 100% PC and Client computing is about 80% Windows.
Still, the VAST majority of business computer users are Windows. Look it up (around 85% IIRC).
AFAIK (could be wrong here) that should require CPU to be used, where as with CXL device can talk to it directly bypassing CPU completely.Do you need CXL for that? I think in the nvidia control panel you can set an option that is functionally equivalent
I do not believe there will EVER be solution that will be "cheap" and "fast".Currently there is no cheap but fast solution
1024b wide APU to attach 1TB+ needed for the like of full blown DeepSeek Models ❌ not AMD style, too small GPU on APU
MI325X style interconnect to get 768GB ❌ too expensive
before CXL can be bottlenecked by the PCIe5 x16 BW of 256 GB/s, the system RAM bottleneck is already hit 128GB/s on 8000MT/s DDR5.
Need wider buses
Poor Man LLM is obviously poor in performance
CXL allows normal memory mapping without no software tricks, works for any GPU workload
On MI300C/X with matching Turin it should spill normally to RAM (it has IF 4.0), if not fitting on HBM cache. It has Unified Memory after all.
Do you have a link someone is running this on DT CPUs with Client GPUs?
All I have seen is purely CPU or running super distilled stuff
Cuz it's better.
If AMD felt this strategy was a good path, Zen 5 desktop would be on N3B, not N4P.Every other client SoC of worth is using N2 in some form in the same timeframe.
N3 is N20, N2 is N16, it is betterer and gooder with no downsides, perf/$ is at least equal.
N3B was delayed (and was crap too), they had to derisk and change to N4, that's most likely why Zen 5 fallen behind expected gains, it would have been great on N3EIf AMD felt this strategy was a good path, Zen 5 desktop would be on N3B
N3b is silly and uneconomical.If AMD felt this strategy was a good path, Zen 5 desktop would be on N3B, not N4P
C design had to use N3 as otherwise it won't be dense enough, since it had to be exactly the same arch design they could not implement new stuff in it vs N4 version, they shrunk what they had on hand, which was design that had to work on N4.But how about Zen5c which supposedly uses N3E?
Favela parts have more comp pressure from ARM communists.But how about Zen5c which supposedly uses N3E?