DEV Community

Cover image for Speech, search, and Stable Diffusion — calling HuggingFace from C#
Jung Hyun, Nam
Jung Hyun, Nam

Posted on

Speech, search, and Stable Diffusion — calling HuggingFace from C#

TL;DR. I wrote DotNetPy, a small C# library that calls into CPython via the C API.
Version 0.6.0 ships three working samples — semantic search with sentence-transformers, speech-to-text with Whisper, and text-to-image with Stable Diffusion Turbo — and a verification matrix that runs them under classic CPython 3.13 and the new free-threaded builds (3.13t / 3.14t).
The whole thing is Native AOT-compatible, and the per-call isolation that PEP 703 needs is exposed as a one-liner: Python.CreateIsolated().
Code: https://github.com/rkttu/dotnetpy.
NuGet: dotnet add package DotNetPy.


The problem

Every few months I run into the same pattern: I need a HuggingFace model — Whisper for transcripts, a sentence-transformer for retrieval, sometimes Stable Diffusion — and I'm working in C#.

The usual escape hatches all have downsides:

  • Convert to ONNX. Works for many vision/encoder models. Doesn't work for newer architectures, doesn't work for diffusion pipelines without a lot of effort, and the conversion itself is a separate project.
  • Stand up a Python micro-service. Now you've got two processes, two deployment stories, and a network hop in your hot path.
  • Call an external API. Costs money, requires internet, and your data leaves the box.
  • Use pythonnet or CSnakes. Solid choices. But pythonnet doesn't currently support Native AOT, and CSnakes pushes you into a Source Generator workflow. Neither has a public story for the free-threaded CPython builds yet.

I wanted a thinner option: write Python inline as a string from C#, pass arrays in, get JSON-shaped results back, and have the whole thing AOT-compile to a single binary. That's what DotNetPy is. The three samples below all run end-to-end on a Windows 11 laptop with no GPU.

Sample 1 — Semantic search with sentence-transformers

The first sample encodes a small corpus, embeds a query, and returns the top-K most similar sentences as a DotNetPyValue — a JSON-document wrapper that gets you back into the .NET world with GetString(), GetInt32(), GetDouble(), and path-based property access.

using DotNetPy;
using DotNetPy.Uv;

using var project = PythonProject.CreateBuilder()
    .WithProjectName("dotnetpy-ml-embeddings")
    .WithPythonVersion("==3.12.*")
    .AddDependencies(
        "sentence-transformers==2.7.0",
        "transformers==4.40.2",
        "torch>=2.2,<2.5")
    .Build();

await project.InitializeAsync();
var executor = project.GetExecutor();

executor.Execute(@"
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
");

var corpus = new[]
{
    "Python is a popular programming language for data science.",
    "C# and .NET are great for building enterprise applications.",
    "Rust offers memory safety without garbage collection.",
    "Pizza is delicious with various toppings.",
    // …
};
var query = "Tell me about programming languages";

using var hits = executor.ExecuteAndCapture(@"
corpus_emb = model.encode(corpus, normalize_embeddings=True)
query_emb = model.encode([query], normalize_embeddings=True)[0]
sims = corpus_emb @ query_emb
top_idx = np.argsort(-sims)[:3]
result = [
    {'rank': int(rank + 1), 'score': float(sims[i]), 'text': corpus[int(i)]}
    for rank, i in enumerate(top_idx)
]
", new Dictionary<string, object?> { { "corpus", corpus }, { "query", query } });

foreach (var hit in hits!.RootElement.EnumerateArray())
{
    Console.WriteLine($"  {hit.GetProperty("rank").GetInt32()}. " +
                      $"[{hit.GetProperty("score").GetDouble():F3}] " +
                      $"{hit.GetProperty("text").GetString()}");
}
Enter fullscreen mode Exit fullscreen mode

Actual output:

  1. [0.578] Python is a popular programming language for data science.
  2. [0.370] C# and .NET are great for building enterprise applications.
  3. [0.203] Rust offers memory safety without garbage collection.
Enter fullscreen mode Exit fullscreen mode

The interesting bit is the boundary: corpus is a .NET string[] and query is a .NET string, both arriving in Python as native lists/strings. The scored results come back as a JSON document that the .NET side reads with the usual JsonElement API.

Sample 2 — Speech-to-text with Whisper

Same shape, different modality. Drop a .wav/.flac file in, get back text plus chunk-level timestamps. The audio bytes never cross the .NET ↔ Python boundary — Python opens the file directly, the boundary only carries the structured transcript.

var executor = project.GetExecutor();
executor.Execute(@"
from transformers import pipeline
import torch
asr = pipeline(
    'automatic-speech-recognition',
    model='openai/whisper-base.en',
    chunk_length_s=30,
    return_timestamps=True,
    torch_dtype=torch.float32,
)
");

using var transcript = executor.ExecuteAndCapture(@"
out = asr(audio_path)
chunks = [
    {'start': float(c['timestamp'][0]), 'end': float(c['timestamp'][1]),
     'text': c['text'].strip()}
    for c in out.get('chunks', [])
    if c['timestamp'][0] is not None and c['timestamp'][1] is not None
]
result = {'text': out['text'].strip(), 'chunks': chunks}
", new Dictionary<string, object?> { { "audio_path", audioPath } });

Console.WriteLine($"\"{transcript!.GetString("text")}\"");
foreach (var c in transcript.RootElement.GetProperty("chunks").EnumerateArray())
    Console.WriteLine($"  [{c.GetProperty("start").GetDouble():F2}s → " +
                      $"{c.GetProperty("end").GetDouble():F2}s] " +
                      $"{c.GetProperty("text").GetString()}");
Enter fullscreen mode Exit fullscreen mode

Run against a public-domain JFK clip that ships with the sample:

"And so my fellow Americans, ask not what your country can do for you,
 ask what you can do for your country."

  [0.00s → 11.00s] And so my fellow Americans, ask not what your country can
                   do for you, ask what you can do for your country.
Enter fullscreen mode Exit fullscreen mode

whisper-base.en is 290 MB, transcription of an 11-second clip takes about 7 seconds on CPU on my machine. Subsequent runs reuse the cached model and the venv — the only first-run cost is the initial download.

Sample 3 — Text-to-image with Stable Diffusion Turbo

stabilityai/sd-turbo is a one-step diffusion model. On CPU you get a 512×512 image in ~30 seconds; on a recent GPU you get one in ~2 seconds. The .NET side never sees the image bytes — Python writes the PNG to disk and hands back metadata.

executor.Execute(@"
import torch
from diffusers import AutoPipelineForText2Image
pipe = AutoPipelineForText2Image.from_pretrained(
    'stabilityai/sd-turbo',
    torch_dtype=torch.float32,
    safety_checker=None, requires_safety_checker=False,
)
pipe.set_progress_bar_config(disable=True)
");

using var meta = executor.ExecuteAndCapture(@"
import time, os
t0 = time.time()
img = pipe(prompt=prompt, num_inference_steps=1, guidance_scale=0.0).images[0]
elapsed = time.time() - t0
out_path = os.path.join(out_dir, 'generated.png')
img.save(out_path)
result = {
    'path': out_path,
    'width': img.size[0],
    'height': img.size[1],
    'size_bytes': os.path.getsize(out_path),
    'elapsed_seconds': elapsed,
}
", new Dictionary<string, object?>
{
    { "prompt", "a serene mountain lake at sunset, oil painting style" },
    { "out_dir", outDir },
});

Console.WriteLine($"  Saved:   {meta!.GetString("path")}");
Console.WriteLine($"  Size:    {meta.GetInt32("width")}×{meta.GetInt32("height")} px, " +
                  $"{meta.GetInt32("size_bytes"):N0} bytes");
Console.WriteLine($"  Inference: {meta.GetDouble("elapsed_seconds"):F2}s");
Enter fullscreen mode Exit fullscreen mode

Output:

  Saved:   .../samples/ml-image-gen/output/generated.png
  Size:    512×512 px, 434,242 bytes
  Inference: 31.19s
Enter fullscreen mode Exit fullscreen mode

This is the pattern I want to highlight: only structured data crosses the boundary. The PNG bytes (~400 KB), the embedding matrices, the float32 tensors — they all stay Python-side. The .NET side sees prompts going in and small JSON objects coming out.

Install + first run

The library is just a NuGet package:

dotnet add package DotNetPy --version 0.6.0
Enter fullscreen mode Exit fullscreen mode

If you want to follow the samples literally, samples/ in the repo has the three above plus a native-aot consumer that drives the AOT-published native DLL through its C exports — the path for embedding DotNetPy into C/C++/Rust hosts.

The ML samples use uv to provision Python + the HuggingFace stack declaratively from C#:

using var project = PythonProject.CreateBuilder()
    .WithProjectName("my-app")
    .WithPythonVersion("==3.12.*")
    .AddDependencies("transformers==4.40.2", "torch>=2.2,<2.5")
    .Build();

await project.InitializeAsync();
Enter fullscreen mode Exit fullscreen mode

That's enough to get a working executor. No separate Python install, no manual venv juggling.

The part I actually care about: PEP 703 free-threaded Python

The interesting cliff for an interop library lands in 2025–26. CPython 3.13 introduced free-threaded builds (the t suffix: python3.13t). The GIL goes away, and concurrent threads can really run Python code in parallel for the first time. This is fantastic for ML serving — you want multiple inference workers sharing one process — and it breaks a lot of implicit invariants in libraries written against the classic GIL.

Specifically, pythonnet has been working through this. PR #2721 catalogues five categories of work needed:

  1. Refcount layout changes (ob_refcnt is now a split structure)
  2. Concurrent type/object cache races
  3. Reflection.Emit thread safety
  4. GCHandle slot ownership atomicity
  5. Finalizer / Py_Finalize race

When DotNetPy 0.6.0 happened, I used pythonnet's PR as the audit lens. Four of those five categories don't apply to DotNetPy by design — it doesn't bridge .NET and Python type systems, doesn't subclass Python types from CLR, doesn't use Reflection.Emit, doesn't expose GCHandle slots to Python, and doesn't call Py_Finalize. The fifth (finalizer / shutdown) was mitigated with an explicit PyGILState_Ensure guard around Py_DecRef in SafeHandle.ReleaseHandle.

What did surface from the audit, and got fixed in 0.6.0:

  • Internal scratch names in shared __main__ globals. Every helper variable (_json_result, _is_valid, …) is now minted per-call via Interlocked.Increment, so two concurrent callers don't race on the same slot.
  • Evaluate leaking a shared result global. Same fix — per-call unique sink, cleaned up in finally.
  • The two __main__-globals fixes interact in a subtle way: even after they shipped, the existing user-variable injection (the variables: parameter on Execute / ExecuteAndCapture) still wrote into shared __main__ globals. Two concurrent callers using the same user name would still collide.

The fix for that — and the most user-visible 0.6.0 addition — is a factory:

using var iso = Python.CreateIsolated();
iso.Execute("import json");
iso.Execute("data = {'k': 1}");   // only this executor sees `data`
Enter fullscreen mode Exit fullscreen mode

CreateIsolated() produces an executor that owns its own Python dict, pre-populated with __builtins__. Each isolated executor coexists with the shared singleton and with other isolated executors; nothing leaks between them.

That makes the concurrent ML pattern obvious:

Parallel.For(0, Environment.ProcessorCount, threadId =>
{
    using var iso = Python.CreateIsolated();
    iso.Execute("import torch; from transformers import pipeline");
    iso.Execute(@"
asr = pipeline('automatic-speech-recognition',
               model='openai/whisper-base.en')
");
    using var r = iso.ExecuteAndCapture(@"
out = asr(audio_path)
result = {'text': out['text']}
", new Dictionary<string, object?> { { "audio_path", path } });

    Console.WriteLine(r?.GetString("text"));
});
Enter fullscreen mode Exit fullscreen mode

On a free-threaded CPython build, that loop runs truly in parallel — every worker has its own asr pipeline and its own Python namespace. On the classic GIL build the same code is correct but serializes at the interpreter (and you'd hit the same wall regardless of which interop library you used).

I verified the matrix on three builds:

Python build Unit tests Native AOT consumer
CPython 3.13 (GIL, auto-discovered) 209 / 1 / 0 8 / 8 ✅
CPython 3.13.13t (free-threaded) 205 / 5 / 0 8 / 8 ✅
CPython 3.14.4t (free-threaded) 205 / 5 / 0 8 / 8 ✅

The full audit lives in docs/FREETHREADED-AUDIT.md. It's deliberately a public document — when I claim "verified", you can read what that means.

Caveats, honestly

A few things to be straight about:

  • DotNetPy is 0.6.0. Experimental, not production-stable yet. Lots of patterns are still being worked out.
  • The Python ML stack itself isn't fully free-threaded yet. Torch's FT support is in active migration. NumPy 2.1+ supports PEP 703. transformers and diffusers work, but their underlying C extensions are mixed. Until the upstream stack catches up, you'll get correctness under free-threaded Python from DotNetPy's interop layer but Python-side ML performance may still serialize through library locks.
  • Native AOT publishing requires the platform C toolchain. On Windows that means the Visual Studio C++ build tools; on Linux you need clang/lld. Same constraint as any AOT'd .NET app.
  • JSON marshalling is the data plane. Every result variable is serialized in Python and deserialized in .NET via System.Text.Json. This is a deliberate trade-off for Native AOT compatibility. For workloads where this dominates (very large result objects), batch results into a single capture call and only return the small structured summary.

Where to go from here

If you've been wondering "how do I run a current HuggingFace model from C#" — I hope this is a useful answer. Comments, issues, and PRs welcome.

Top comments (0)