llama.cpp is 26.8% faster than ollama. I have upgraded both, and using the same settings, I am running the same DeepSeek R1 Distill 1.5B on the same hardware. It's an Apples to Apples comparison.
On device AI reasoning ODA-R using speculative decoding with draft model DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-32B. DSPy compiler for reasoning prompts in math, engineering, code...
Training a model to reason in the continuous latent space based on Meta's Coconut. If it all works will apply it on the MiniCPM-o SVD-LR. Endgame is a multimodal, adaptive, and efficient foundational on device AI model.