Looking at this another way, how helpful is SME in a consumer benchmark? Helpful as in, "Is CPU Y faster in these workloads than CPU X?".
- Where does Geekbench use SME?
- Do users commonly use those workloads?
- How often do those workloads activate the CPU?
- How much "better" is a CPU that can run these workloads 2x or 10x or 100x faster?
Question 1: Where does Geekbench use SME?
Geekbench 6.3 (finally current) uses Arm's SME on these subtests: Photo Library, Object Detection, and Background Blur. An important corollary is whether Geekbench's usage of SME is representative of consumer applications & OSes usage of SME: that I do not know.
Question 2: Do users commonly use those workloads?
On-device object detection, photo library classification, and background blur do get used frequently, especially in mobile. These are not rare workloads:
- Photo capture: face recognition, scene recognition, object tracking
- Security: facial recognition
- Videoconferencing: background blur, object detection
- Photo library classification
- Apple's "remove background" feature
Question 3: How often do those workloads activate the CPU (vs GPU or NPU)?
I don't know and this is the crucial piece, IMO. For some reference, on both Android and iOS, even as we've had NPUs for generations, the CPU still remains a part of the puzzle. But everyone is a little vague about how much they rely on it and on the face of it, we'd expect much of it to shifted to NPUs already. The little info I've found:
Qualcomm's take: (for short bursts of small models & seemingly lower latency, use the CPU)
As previously mentioned, most generative AI use cases can be categorized into on-demand, sustained, or pervasive. For on-demand applications, latency is the KPI since users do not want to wait. When these applications use small models, the CPU is usually the right choice. When models get bigger (e.g., billions of parameters), the GPU and NPU tend to be more appropriate.
A personal assistant that offers a natural voice user interface (UI) to improve productivity and enhance user experiences is expected to be a popular generative AI application. The speech recognition, LLM, and speech models must all run with some concurrency, so it is desirable to split the models between the NPU, GPU, CPU, and the sensor processor. For PCs, agents are expected to run pervasively (always-on), so as much of it as possible should run on the NPU for performance and power efficiency.
Apple's take: background apps or GPU-intensive tasks, use the CPU (why not CPU+NPU, aka ANE?...I don't know)
Use MLComputeUnits.cpuOnly
to restrict the model to the CPU, if your app might run in the background or runs other GPU intensive tasks.
Notably, Apple's Core ML
requires the CPU to be allowed to run the workload and cannot be excluded; developers can selectively
exclude the GPU and NPU, however.
I'd really like to see Google's take with Android (as Android will be the majority ML workload by volume / # of users & workloads). Unfortunately, there is not much I've found. For now,
this older 2019 paper (that uses even older CPUs) is only slightly helpful:
Even though most mobile inference workloads run on CPUs, optimizations of ML workloads with accelerators hordes most of the attention. There is a lot of room for optimizations on mobile CPUs to enable ML applications across different mobile platforms. [
based on this even older 2018 data]
CPUs provide both the worst energy-efficiency as well as the worst throughput among all components. Still, they are critical for inferencing because they are commonly present across all mobile devices. Low-end mobile SoCs would lack accelerators like NPU. They may contain a low-end GPU, but maybe missing OpenCL support and thereby lack any inferencing capability. Network inference on CPU is inevitable and demands optimization considerations.
Question 4: How much "better" is a CPU that can run these workloads 2x or 10x or 100x faster?
Another crucial question and I don't know. It depends on the answer for #3. If it's only used once every 100 workloads (let's call one workload = one action in an app), being 100x faster is entering into irrelevance. Particularly with Geekbench seemingly weighing each subtest equally, would we make the same call? Is Photo Library really the same weight as HTML5? I don't think so.