DeepSeek V4, `llama.cpp` Q4_K_M, & Ollama Ryzen APU Guide Boost Local LLM

#ai #llm #selfhosted

DeepSeek V4, `llama.cpp` Q4_K_M, & Ollama Ryzen APU Guide Boost Local LLM

Today's Highlights

New benchmarks showcase DeepSeek V4 Flash's extreme token generation with MTP self-speculation and W4A16+FP8 quantization. Additionally, llama.cpp gains Q4_K_M support for DeepSeek V4 Pro, complemented by a practical Ollama setup guide for Ryzen APUs.

DeepSeek-V4-Flash Achieves 85 tok/s at 524k Context with MTP Self-Speculation (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1t9em98/deepseekv4flash_w4a16fp8_with_mtp_selfspeculation/

This report details impressive local inference performance for the DeepSeek-V4-Flash model, achieving a remarkable 85.52 tokens/second at a massive 524k context window. For single-stream inference at 128k context, speeds soared to approximately 111 tokens/second. These results were obtained using a combination of W4A16+FP8 quantization and MTP (Multi-Tentacle Parallel) self-speculation, leveraging the processing power of two RTX PRO 6000 Max-Q GPUs.

The integration of advanced quantization, particularly pasta-paul's DeepSeek-V4-Flash-W4A16-FP8 quant, significantly reduces the model's memory footprint while maintaining high inference quality. MTP self-speculation further boosts performance by predicting future tokens, allowing the GPU to process multiple potential outputs in parallel, thereby accelerating generation speed. This benchmark highlights the potential for running large, high-performance models locally with extended context windows, pushing the boundaries of what's achievable on dedicated consumer-grade or prosumer hardware for local AI enthusiasts.

Comment: Achieving over 85 tok/s at half-million context is incredible; MTP self-speculation combined with FP8 quantization is clearly a game-changer for high-throughput local inference, especially with models like DeepSeek-V4-Flash.

Detailed review and guide from my testing of local ollama setup with DeepSeek models (Ryzen APU's only) (r/Ollama)

Source: https://reddit.com/r/ollama/comments/1t94txm/detailed_review_and_guide_from_my_testing_of/

A comprehensive guide and review has been published detailing a local Ollama setup specifically tailored for DeepSeek models running on Ryzen APUs. This resource addresses a common challenge for users with integrated graphics, offering a step-by-step methodology to configure Ollama effectively. The guide focuses on optimizing performance and stability, providing practical advice for users looking to leverage their Ryzen APU's capabilities for local AI inference without dedicated GPUs.

The detailed walkthrough, available on GitHub (https://github.com/solutionstack/linux-ollama-stack-apu/blob/main/README.md), covers installation, model deployment, and performance tuning, making it an invaluable asset for self-hosters and hobbyists. It highlights how to extract maximum efficiency from a consumer-grade system, enabling users to run sophisticated models like DeepSeek locally. This initiative significantly lowers the barrier to entry for local AI, particularly for a segment of the user base often overlooked by GPU-centric guides, emphasizing the versatility and accessibility of open-weight models with tools like Ollama.

Comment: Finally a solid, detailed guide for Ollama and DeepSeek on Ryzen APUs! This is exactly what many of us without dedicated GPUs need to get started with local inference.

I have DeepSeek V4 Pro at home (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1t94ito/i_have_deepseek_v4_pro_at_home/

A user has successfully implemented and shared details on running DeepSeek V4 Pro locally, specifically highlighting the use of a slightly modified llama.cpp CUDA repository. This modification adds crucial Q4_K_M conversion support, a popular quantization technique that balances model size reduction with inference quality, making larger models feasible on consumer-grade hardware. The underlying llama.cpp-deepseek-v4-flash-cuda repository (linked in the original post) provides the necessary framework for leveraging CUDA acceleration with DeepSeek V4.

This development is particularly significant for the llama.cpp community, as it expands its compatibility with new, high-performing open-weight models like DeepSeek V4 Pro and introduces specific quantization capabilities. Users interested in replicating this setup can explore the linked CUDA repo, which offers the tools to convert and run DeepSeek V4 with optimized performance on NVIDIA GPUs. It underscores the continuous efforts within the open-source community to adapt and improve local inference tools for the latest LLMs.

Comment: Great to see DeepSeek V4 Pro getting llama.cpp CUDA support with Q4_K_M quantization! This is a practical win for running a powerful model on local GPUs with good efficiency.

DEV Community

DeepSeek V4, `llama.cpp` Q4_K_M, & Ollama Ryzen APU Guide Boost Local LLM