A user on r/LocalLLaMA reported on May 12 that an Optane local LLM desktop build ran Moonshot’s Kimi K2.5 at about 4 tokens per second using discontinued Intel Optane Persistent Memory, a 12GB RTX 3060, and llama.cpp.
The hardware, model family, and software path are all documentable from official sources. The 4 tokens per second figure is not vendor-confirmed — it comes from the builder’s own report.
Intel Optane Persistent Memory powers a local trillion-parameter run
The builder said the system ran Kimi K2.5 locally and described it as a 1 trillion parameter model. VentureBeat reported that Moonshot did not publicly disclose K2.5’s parameter count, but said the model is based on the earlier Kimi K2 lineage, which it described as a 1 trillion total parameter mixture-of-experts model with 32 billion activated parameters.
Moonshot’s official Hugging Face page for Kimi-K2.5 confirms the model is real and provides deployment guidance, but the model card does not make the same desktop performance claim. It says K2.5 is available through Moonshot’s API and recommends running it with vLLM, SGLang, or KTransformers.
That makes the clean reporting split pretty simple: the Optane local LLM setup and K2.5 model family are supported by public docs, while the exact “~4 tok/s” result is a user-reported benchmark.
The build pairs a 12GB RTX 3060 with 768GB of Optane and DDR4
According to the Reddit post, the machine used an Intel Xeon Gold 6246, an RTX 3060 12GB, 6×32GB DDR4 ECC RDIMMs for 192GB of DRAM, and 6×128GB Intel Optane DCPMM modules for 768GB of Optane capacity.
Intel’s product page for the Intel Optane Persistent Memory 100 Series 128GB Module identifies those sticks as persistent memory modules and lists their status as Discontinued. Intel also lists the 128GB module launch date as Q2 2019.
Intel’s 200 Series product brief describes Optane PMem as a way to provide very large memory pools, including up to 6TB total memory per socket. That brief covers a newer generation than the one in the reported build, but it matches the basic role the builder described: adding a lot more addressable memory than a typical desktop would have.
The Reddit post said the machine used Optane in Memory Mode. In that mode, the PMem appears to the system as RAM while conventional DRAM acts as a cache layer in front of it.
The model and software stack are both real, but the speed claim is user-reported
The most interesting part of the Optane local LLM report is that the software knobs named by the builder are real llama.cpp options, and they line up with the placement strategy described.
The official llama.cpp completion README documents:
-
--override-tensorfor overriding tensor buffer type -
--cpu-moeor-cmoeto keep Mixture of Experts weights on the CPU -
--gpu-layersor-nglwith exact counts,auto, orall
The builder said attention weights, the dense layer, shared experts, and routing components fit on the 12GB RTX 3060 using --override-tensor, while the sparse expert weights stayed largely in PMem/DRAM and were processed from there as needed. They also said they got workable results using -ngl auto and -cmoe without manually placing every tensor.
That description matches an MoE setup where only a subset of weights is active for each token. If you want the broader background on related inference tricks, NovaKnown has also covered speculative decoding, though that is a separate technique from the one reported here.
What official sources confirm about Optane local LLM components
Intel confirms that the memory modules in question are real Optane PMem hardware and that the product line is discontinued. Moonshot confirms that Kimi K2.5 is an officially published model. llama.cpp confirms that the command-line flags named in the post exist and are meant for the kinds of tensor placement the builder described.
What official sources do not confirm is the exact benchmark outcome for this Optane local LLM machine. Neither Intel, Moonshot, nor the llama.cpp docs publish a result showing Kimi K2.5 on a desktop with 768GB of Optane and a 12GB GPU at ~4 tok/s.
That gap matters mostly because it tells you what is sourced where. The hardware status is from Intel. The model and quantization guidance are from Moonshot. The inference controls are from llama.cpp. The throughput figure is from a user test shared publicly on Reddit.
Moonshot’s model page also says K2.5 uses the same native int4 quantization method as Kimi-K2-Thinking, which fits with the builder’s account of running a quantized version locally. NovaKnown recently covered the next model in the family in Kimi K2.6.
Key Takeaways
- A Reddit user reported an Optane local LLM desktop build running Kimi K2.5 at about 4 tokens per second on May 12.
- The reported hardware used 768GB of Intel Optane Persistent Memory, 192GB of DDR4 ECC, and an RTX 3060 12GB.
- Intel’s product page confirms the 128GB Optane PMem 100 Series module is a real persistent memory module and is discontinued.
- Official
llama.cppdocs confirm the--override-tensor,--cpu-moe, and--gpu-layersoptions cited in the build report. - Moonshot confirms Kimi K2.5 and its quantization approach, but the reported desktop throughput remains a user-reported result, not a vendor benchmark.
Further Reading
- Intel Optane Persistent Memory 100 Series 128GB Module — Intel’s product page listing the 128GB PMem module and its discontinued status.
- Intel Optane Persistent Memory 200 Series Brief — Intel’s overview of Optane PMem as a large-capacity memory tier.
-
llama.cpp Completion README — Official documentation for
override-tensor,cpu-moe, andgpu-layers. - moonshotai/Kimi-K2.5 Model Page — Moonshot’s official model page with deployment and quantization notes.
- VentureBeat on Kimi K2.5 — Reporting on K2.5 and the earlier K2 model’s 1 trillion parameter lineage.
Originally published on novaknown.com
Top comments (0)