I think for future models they should separate out "Intelligence" and "Knowledge" parts, rather than ever increasing model sizes. Knowledge part can be in terabytes and should not require training, just only need processing to a machine optimized format/ database. Something like Brain + Library/...
Ryzen 5950X does 7 tokens/sec for 8B and 4 tokens/sec for 14B models. LM Studio seems to limit CPU thread to 16. As I remember earlier I was able to set 32.
I think AMD missed a big free marketing opportunity with the delay of RDNA4. I would assume RDNA4 would have run circles around 7900XTX running local LLMs.
I don't disagree with your statement. However I guess what we view as acceptable speed is different :) . For 8B model, I am getting 40Tokens/sec, for 14B, 7 tokens/sec.
Thing I most love about these models is not the actual answer, but the the thought process. It is very insightful.
I guess you used x64 Guest OS on x64Host OS. For the most part this will run at native/near native speed (it is virtualization, not emulation). However it is not same when you run x64 OS on top of ARM OS, where x64 instructions need to be translated to ARM instructions.
AMD should go one up on NVidia by adding custom fixed function hardware to do 8x frame-gen :) . This will do similar damage to NVidia as Intel's QuickSync did to AMD's hopes for APUs for video compression.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.