Is there a nice article that shows the performance dependence of the number of parameters in the model? Like what's the smallest reasonable model?
I don't know of an article.
https://old.reddit.com/r/LocalLLaMA/ is very active and if you read it for a bit you get used to the lingo and the latest models.
On performance. One of the other folk mentions coding below. There is a series of models dedicated towards coding, the Qwen 2.5 Coder. They come in sizes from 0.5b up to 32b. 32b is quite good. Folk are running it on a single RX 3090 at a decent q (see below) to fit into the vram limit with some context as well.
One aspect of performance is the tokens generated per second. Inference is mostly a memory bandwidth limited workload. The entire model needs to exercised for every token. So a 32b model at q4 (
https://huggingface.co/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF) needs to process roughly 20GB of weights for every token. So a rough rule of thumb is that the theoretically upper limit to performance is vram membw / model size = tokens/s. There is overhead ofcause and then the optimisation of the actual backend you are using. There is also prompt generation which needs to run before tokens starts to be generated.
I think there is a lot of exaggeration out there. From what I've read they are just combining already known efficiency techniques. It probably requires a lot more human labor to set it up, but being in China where human labor is cheaper and AI HW restricted, it's a perfect fit. In the west they may keep doing more brute force.
Nothing to do with human labour. Its entire due to the rumour that DeepSeek is a side project from a Chinese quant company. Lots of very very smart math heavy folk in the company with access to some nice gpu's that are idle during night time. The company is already profitable from their quant trading and so don't need a business model for their AI models. The reported number of gpu hours is so low that if you were to rent from a cloud provider to train the latest DeepSeek R1 it would cost $5.5mil. But this is a rumour. Could just be a brag.
There is a cut down DeepSeek 7 Billion parameter LLM model that will fit in 16GB, just as there are other cut down models...
Commonly models are not run with the weights in BF16 format, but rather in 8 bits or lower. The weights are quantised. So 7b model can run in 7GB of vram + overhead + context. Or at q4 for 3.5GB vram but not so good quality. It depends on what you need.
The rule of thumb is that the larger the model, the smaller the quant you can run without losing quality. So 72b models are often run at q4 spread across a couple of gpu's. A 3b model at q4 may perform terrible... ie generate nonsense, get lost in its own narrative or lose sight of the context and prompt the user passed in.
DeepSeek enables smaller but very capable models to run on very few GPUs. Someone on X/Twitter said he now runs his own instance locally in his 4090 that does code gen. It's nothing short of a miracle for a developer. No internet connection needed. No subscription needed. And no query limits. Like it's totally free. And as capable as o1.
This is nothing new. Folk have been running AI models locally for a long while. To be able to run the full DeepSeek R1 at home, you would need a serious amount of hardware and then be happy with a serious power bill. Most likely they are running a smaller distilled version which doesn't have R1's full quality but is worth running. But software developers have been using AI models for a while to help them scale their own personal productivity.
I've been reading LocalLLaMA every day for the last 6 months or so, and AI models leap frog eachother every month in capability terms. China is very strong in AI. But so is the West. The Nvidia bubble was always going to pop at some point and it seems that a mere rumour of a $5.5mil training cost was enough to question cap expenditure with Nvidia hardware.
Seems investors are skittish. Good.