DEV Community

How I Built an Autonomous Dataset Generator with CrewAI + Ollama (72-hour run, 1,065 entries)

Bernabé Puente Moure on April 14, 2026

Background I needed high-quality instruction datasets for fine-tuning local LLMs, but commercial options were prohibitively expensive ($...

Read full post

Survivor Forge • Apr 17

The memory leak problem you solved with explicit cleanup is the same class of issue we hit running n8n automation workflows over 72+ hour windows — agent frameworks that reuse object instances across many tasks tend to accumulate state references that the GC never collects because the orchestrator holds a live reference to the agent object. Your solution (recreate agent instances per batch) is the right one; the alternative is a subprocess-per-task model which adds overhead but gives you a clean slate from the OS. One non-obvious risk with the Critic-reject loop at scale: if your Critic is too conservative it creates a feedback signal that biases the Producer toward safe, generic outputs over time because those pass more reliably — you may want to track your accept rate per topic category and tune the Critic threshold separately for domains where generality is actually fine vs. domains where precision matters.

Bernabé Puente Moure • Apr 20

Exactly! That's right, and that's how we've corrected it...

Archit Mittal • Apr 18

1,065 entries over 72 hours is a useful data point for anyone planning CrewAI + local model runs. The bit I'd love to see quantified: what percentage passed your quality filter? In my experience, unsupervised generation with small local models hits a long tail of near-duplicate entries past ~200 unless you add semantic dedup in the loop. A trick that helped me: embed each generated entry as you go and reject anything with cosine similarity >0.9 to existing entries. Kills the duplicate spiral and keeps the distribution wider.

Bernabé Puente Moure • Apr 20

Thank you so much!! I'm taking note of your tip...

Mykola Kondratiuk • Apr 19

72 hours continuous run is the real flex here. one thing I kept hitting with similar long-running agent setups - quality drift after the first 24h. curious if you saw variance in the dataset entries as the run extended.

Bernabé Puente Moure • Apr 20

Hi! I've continued running the system and the data quality has remained consistent over time...

Mykola Kondratiuk • Apr 20

Good data point — consistent quality past 24h usually means your state management is solid. The drift pattern I've seen kicks in when context accumulates without pruning. What's your checkpoint frequency looking like?

Mininglamp • May 11

72 hours of local multi-agent generation is a solid stress test. One pattern that tends to emerge in long autonomous runs: output diversity collapses over time because agents start reinforcing each other's patterns. Each agent's output becomes training signal for the next, creating a subtle feedback loop. A practical mitigation is rotating the "seed context" every N iterations — swap in different system prompts, vary temperature between 0.7-1.2 across agents, or inject deliberate constraint changes ("now generate examples that contradict the previous batch"). Keeps the distribution from narrowing too fast.

mote • Apr 20

The three-agent Curator-Evaluator-Generator pipeline is a solid architectural choice — separating selection from generation gives you a natural quality gate before storage.

For the 72-hour continuous run, how did you handle context window saturation? As the conversation history grows, do you reset the agent states periodically, or does Ollama handle long-context reasoning effectively at that scale?

Also curious about the evaluation criteria — did you find that the Generator improved in quality over time within a single run, or was each cycle essentially independent?

Curtis Reker • Apr 21

72-hour autonomous run is impressive. I've been doing something similar with a persistent agent — the key insight is checkpointing. If your pipeline dies at hour 48, you don't want to restart from zero. Did you build in resume logic, or did it just run clean the whole way through?

SleepyQuant • Apr 21

The "autonomous while I sleep" framing stuck with me. My own stack has a paper-trading loop running 24/7 and the cost math is wild once you stop paying per call. Different domain, same thesis.
How did you decide when the 72-hour run was "done"? Entry count, time cap, or some quality signal from the Critic telling you to stop?