I Let AI Write My Git Commits for 30 Days — The Data Was Messy

#ai #automation #experiment #productivity

I stopped writing git commit messages on March 1, 2026.

It sounds lazy. Maybe it is. But I was tired of the cognitive load. You know the feeling. You fix a obscure race condition in the WebSocket handler at 11 PM. Your brain is fried. You type git commit -m "fix stuff" because you just want to go to sleep.

Three months later, you look at the history and have no idea what "stuff" means.

We have all been there. I decided to test if the new generation of local LLMs could do this better than my exhausted self. I didn't use a cloud API. I used a quantized 7B parameter model running locally on my M3 MacBook Pro. Privacy matters, and I didn't want my proprietary code leaking to some training set.

The goal was simple. Automate the commit message generation based on the git diff. Measure the quality. Track the time saved. See if it actually helped or just created more noise.

Here is what happened over 30 days and 412 commits.

The Setup: Local First, Always

I didn't want a complex pipeline. If the tool added more than 5 seconds to my workflow, I would quit.

I used a simple pre-commit hook script written in Python. It calls llama.cpp to run Qwen2.5-Coder-7B-Instruct. This model is small enough to run fast on consumer hardware but smart enough to understand context.

The prompt was strict. No fluff. No "Here is the commit message." Just the message.

import subprocess
import sys

def get_diff():
    return subprocess.check_output(["git", "diff", "--staged"]).decode("utf-8")

def generate_message(diff):
    # Prompt engineering is key here
    prompt = f"""
    You are a senior developer. Write a conventional commit message for this diff.
    Format: <type>(<scope>): <description>
    Keep description under 50 chars.
    Do not add any other text.

    Diff:
    {diff[:2000]} 
    """
    # Call local LLM via CLI
    result = subprocess.run(
        ["./llama-cli", "-m", "./models/qwen2.5.q4_k_m.gguf", "-p", prompt, "-n", "50"],
        capture_output=True,
        text=True
    )
    return result.stdout.strip()

if __name__ == "__main__":
    diff = get_diff()
    if len(diff) > 10:
        msg = generate_message(diff)
        print(msg)
    else:
        sys.exit(0)

I aliased this to git ai-commit. The workflow became: git add ., then git ai-commit. The script would output the message, I would review it, and then manually paste it into the final commit command.

I did not let it auto-commit. That felt too dangerous. I wanted to remain the editor, not just the passenger.

The First Week: Pure Chaos

The first ten commits were terrible.

The model hallucinated features I hadn't built. It claimed I refactored the database schema when I only changed a variable name in a CSS file. It loved using the word "optimize" for everything.

On March 3, I spent 20 minutes editing a single commit message because the AI insisted I had "enhanced user experience" by fixing a typo in a JSON key.

I almost scrapped the experiment on day four. The friction was higher than just writing the messages myself. I felt like I was babysitting a junior dev who lied about their work.

But I tweaked the temperature setting. I dropped it from 0.7 to 0.2. I also added few-shot examples to the prompt, showing it exactly what good Conventional Commits looked like.

feat(auth): add jwt refresh token rotation
fix(ui): prevent button overflow on mobile

Suddenly, it clicked. The model stopped being creative and started being boring. Boring is good for git logs.

The Data: Time vs Accuracy

By week three, I had settled into a rhythm. I tracked every commit. I noted whether I accepted the AI's suggestion as-is, edited it, or rewrote it completely.

Here is the breakdown of the 412 commits generated during March 2026.

Category	Count	Percentage	Avg Edit Time (secs)
Accepted As-Is	298	72.3%	0
Minor Edits	84	20.4%	12
Total Rewrite	30	7.3%	45
Failed/Empty	0	0%	N/A

The numbers surprised me. I expected to rewrite half of them. Instead, nearly three-quarters were usable without changes.

The "Minor Edits" category was usually just trimming verbosity. The AI loved to say "Update the function to handle null values correctly." I would shorten it to "fix: handle null values in parser."

The "Total Rewrite" cases happened mostly with large diffs. If I changed more than five files, the context window got muddy. The model would pick one change and ignore the others. For those, I reverted to manual writing.

💡 Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.