Simran Shaikh

Posted on May 13

I Built an Offline AI Compliance Auditor for Indian Factories Using Gemma 4 — Here's Everything I Learned

#gemma4 #googleai #localai #machinelearning

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

I am a final-year Computer Science student in Vadodara, India, currently doing an AI/ML internship at a manufacturing company. Before Gemma 4, every AI project I built had the same dirty secret: the actually interesting part happened on someone else's server.

My casting defect inspection system? ResNet-50 runs locally, but the moment I wanted a natural language explanation of the defect — cloud. The WhatsApp order management dashboard I built for kirana stores? Logic local, language model — cloud. Every project had this dependency I could never quite shake, because no open model was good enough at the things that mattered: reasoning, vision, regional languages, and running on real hardware.

Then Gemma 4 landed. I spent three weeks building AuditMind — a fully offline compliance auditor for Indian SME manufacturers that interviews factory managers in Hindi, reads scanned documents using vision, checks findings against Indian law, and generates a bilingual audit report. Zero cloud. Zero API keys. Zero data leaving the factory.

This article is everything I learned: how to set it up, which model to pick, how the technical capabilities actually work, and why I think this changes something real about who gets to use AI.

Part 1: The Model Family — Which Gemma 4 Is Right for You?

Before you write a single line of code, you need to pick the right model. Gemma 4 is not one model — it is four models across three fundamentally different architectures. Getting this wrong wastes hours.

┌─────────────────────────────────────────────────────────────────────────┐
│                        GEMMA 4 MODEL FAMILY                            │
├──────────────┬────────────┬──────────────┬──────────────────────────── ┤
│   Model      │  Params    │   VRAM       │  Best for                   │
├──────────────┼────────────┼──────────────┼─────────────────────────────┤
│  E2B (edge)  │  2B eff.   │  ~1.5 GB     │  Raspberry Pi, Android,     │
│              │            │              │  iPhone, browser (WebGPU)   │
├──────────────┼────────────┼──────────────┼─────────────────────────────┤
│  E4B (edge)  │  4B eff.   │  ~3.3 GB     │  Laptop CPU, Chrome tab,    │
│              │            │              │  offline mobile apps        │
├──────────────┼────────────┼──────────────┼─────────────────────────────┤
│  31B Dense   │  31B       │  ~20 GB (Q4) │  Best quality, fine-tuning  │
│              │            │              │  single H100 (bfloat16)     │
├──────────────┼────────────┼──────────────┼─────────────────────────────┤
│  26B MoE     │  26B total │  ~15 GB (Q4) │  High-throughput server,    │
│   ↳ 4B active│  4B/pass   │              │  agentic pipelines, speed   │
└──────────────┴────────────┴──────────────┴─────────────────────────────┘

The Decision Framework

Use E2B / E4B if:

Your app runs on a phone, Raspberry Pi, or in a browser
You have zero internet (field deployments, remote areas)
You need audio or video processing on-device — these are the only open models that do this at this size
Battery life and latency matter more than peak quality

Use 31B Dense if:

You are fine-tuning for a domain-specific task (the dense architecture trains more cleanly with LoRA)
You need the highest possible quality per inference
You have a single large GPU (80GB H100 runs it unquantized)

Use 26B MoE if:

You are building a production service or agentic pipeline
You need speed — the MoE activates only 4B parameters per forward pass, so it is dramatically faster than a 31B dense model while delivering near-equivalent quality
You want the best quality/cost ratio for high-throughput workloads

For my AuditMind project, I chose the 26B MoE because it needs to run 4 agents sequentially — interviewer, document reader, compliance checker, report writer — and the MoE's inference speed means the whole pipeline completes in minutes, not half an hour. The 256K context window on the MoE also means I can feed the entire audit context (all interview answers + all document extractions) into the report writer in a single call.

Part 2: Local Setup — From Zero to Running in 10 Minutes

The fastest path to running Gemma 4 locally is Ollama. Here is the exact process.

Step 1: Install Ollama

# Linux / Mac
curl -fsSL https://ollama.com/install.sh | sh

# Windows — download the installer from ollama.com
# Then add to PATH and restart terminal

Verify it works:

ollama --version
# ollama version 0.x.x

Step 2: Pull Your Model

# Pick ONE based on your hardware:

ollama pull gemma4:2b        # Raspberry Pi, very low RAM — ~1.5 GB download
ollama pull gemma4:4b        # Laptop, 8 GB RAM — ~3.3 GB download  ← start here
ollama pull gemma4:27b-moe   # Server, 16+ GB VRAM — ~15 GB download
ollama pull gemma4:31b       # Maximum quality, 20+ GB VRAM

Tip for Indian developers: If your internet is slow, pull gemma4:4b first. It's genuinely impressive for its size and you can always upgrade later. The quality jump from 4B to MoE is real but the 4B handles most tasks well.

Step 3: First Run — Text Generation

# Interactive chat
ollama run gemma4:4b

# Single prompt
ollama run gemma4:4b "Explain EPF compliance for Indian manufacturers in 3 bullet points"

Step 4: Python Integration

# pip install requests
import requests

def ask_gemma(prompt: str, model: str = "gemma4:4b") -> str:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
            "options": {"temperature": 0.2, "num_predict": 1024}
        },
        timeout=120
    )
    response.raise_for_status()
    return response.json()["response"].strip()

# Test it
print(ask_gemma("What is the penalty for late GST filing in India?"))

Step 5: Vision (Multimodal) — This is Where It Gets Interesting

import base64

def ask_gemma_vision(prompt: str, image_path: str, model: str = "gemma4:4b") -> str:
    with open(image_path, "rb") as f:
        img_b64 = base64.b64encode(f.read()).decode()

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "images": [img_b64],
            "stream": False,
            "options": {"temperature": 0.1}
        },
        timeout=120
    )
    return response.json()["response"].strip()

# Read a scanned GST invoice
result = ask_gemma_vision(
    prompt=(
        "This is an Indian GST invoice. Extract the following as JSON: "
        "invoice_number, date, supplier_gstin, recipient_gstin, "
        "hsn_codes (list), taxable_value, cgst, sgst, igst, total_amount. "
        "Return ONLY valid JSON."
    ),
    image_path="invoice.jpg"
)
print(result)
# {"invoice_number": "INV-2024-1042", "supplier_gstin": "24AABCU9603R1ZX", ...}

I use this exact pattern in AuditMind's Doc Reader agent. It processes scanned wage registers, factory licences, fire NOCs — documents that are often handwritten, printed in multiple fonts, and sometimes in Hindi — and extracts structured JSON reliably.

Part 3: Deep Technical Breakdown — The Capabilities That Actually Matter

Capability 1: Multimodal Input — Not Just Images

Most "multimodal" models mean images. Gemma 4 means something broader.

                    GEMMA 4 MULTIMODAL INPUT STACK

  ┌──────────────────────────────────────────────────────┐
  │  INPUT MODALITIES                                    │
  │                                                      │
  │   📝 Text  ──────────────────────┐                  │
  │   🖼️  Image (variable res.) ──────┤                  │
  │   🎬 Video frames ───────────────┼──► Gemma 4 Core  │
  │   🎙️  Audio* ─────────────────────┤                  │
  │                                  │                  │
  │  * E2B and E4B natively          │                  │
  │    Larger models via tools       │                  │
  └──────────────────────────────────────────────────────┘

Why variable-resolution images matter: Most vision models resize everything to 224×224 or 448×448. When you resize a scanned A4 invoice to 224px, you lose the text. Gemma 4 processes images at their native resolution, which is why it can actually read handwritten text, printed tables, and small-font documents.

I tested this directly: I scanned a handwritten factory wage register (messy, in Hindi, small text) and fed it to both Gemma 4 and a fixed-resolution vision model. Gemma 4 extracted all the numbers correctly. The fixed-res model hallucinated half of them.

Audio on E2B/E4B: This is the capability that no other open model at this size has. The edge models can process audio input natively. For AuditMind, this means the interviewer agent could theoretically receive a worker's voice note and process it directly — without a separate Whisper STT step. I still use Whisper because I need more control over Indian language dialects, but the native audio capability means you can build a complete voice pipeline with just Gemma 4.

Capability 2: 128K / 256K Context Window

HOW MUCH FITS IN 256K TOKENS?
────────────────────────────────────────────────────────
A typical novel (80,000 words)          →  ~107K tokens
A full year of GST returns              →  ~40K tokens
An entire Python codebase (medium)      →  ~60K tokens
50 research papers (abstracts only)     →  ~25K tokens
A complete SME audit (all documents)    →  ~80K tokens  ← fits in one call
────────────────────────────────────────────────────────

For my use case, 256K context means I can run this agent call in the Report Writer:

# This is the actual Report Writer agent call in AuditMind.
# We feed EVERYTHING into one call — Gemma 4 synthesizes it all.

def generate_report(ctx: AuditContext, gemma: GemmaClient) -> str:

    # Build the full context string
    interview_text = "\n".join(
        f"Q: {qa['q']}\nA: {qa['a']}" 
        for qa in ctx.interview_qa
    )

    documents_text = "\n".join(
        f"[{doc.doc_type.upper()}]\n{doc.extracted_text}"
        for doc in ctx.documents
    )

    findings_text = "\n".join(
        f"- {f.rule_id}: {f.status.upper()} | {f.severity} severity | "
        f"Penalty: {f.penalty_estimate}\n  Evidence: {f.evidence}"
        for f in ctx.findings
    )

    prompt = f"""
You are a senior compliance auditor preparing an official audit report.

COMPANY: {ctx.company_name}
AUDIT TYPE: {ctx.audit_type}

=== INTERVIEW RESPONSES ===
{interview_text}

=== DOCUMENTS REVIEWED ===
{documents_text}

=== COMPLIANCE FINDINGS ===
{findings_text}

Generate a complete bilingual audit report (English headings, Hindi explanations).
Include: Executive Summary, Finding-by-Finding Analysis, Penalty Estimates in INR,
Priority Recommendations, and a Compliance Score (0-100).
"""

    # With 256K context, this entire prompt fits — no chunking, no summarization.
    # The model sees the complete picture and generates a coherent report.
    return gemma.ask(prompt, max_tokens=4096)

Without a large context window, I would have needed to run multiple summarization passes and lose detail. With 256K, the model sees the complete audit picture and the report is genuinely better.

Capability 3: Reasoning Mode

Gemma 4 supports configurable thinking mode — internal chain-of-thought reasoning before the final answer.

WITHOUT REASONING MODE:
  Prompt → [immediate answer]

WITH REASONING MODE:
  Prompt → [think: let me consider...] → [think: actually...] → [Final Answer]

In practice, this matters for compliance checking. When I ask Gemma 4 to assess whether a company's EPF practices are compliant, I want it to reason through the evidence before giving a verdict. Without thinking mode, it sometimes gives confident wrong answers. With thinking mode, it shows its work and the compliance classifications are significantly more accurate.

To enable it in Ollama:

# Enable reasoning mode via system prompt
# (Gemma 4 thinking mode activation varies by implementation — 
#  check the official Ollama docs for your version)

system_prompt = """
You are a compliance expert. Before giving any compliance verdict,
think through the evidence carefully. Consider edge cases and ambiguities.
Only then state your final classification.
"""

result = ask_gemma(
    prompt="Is this EPF practice compliant? [evidence here]",
    system=system_prompt  # add system param to your ask_gemma function
)

Capability 4: 140+ Languages — What This Actually Means

I want to be specific about this because "140 language support" sounds like marketing until you see what it means in practice.

Most multilingual models are trained primarily on English and have surface-level support for other languages — they can translate, but they cannot reason well in those languages. Gemma 4 is different because it was trained with genuine multilingual depth.

I tested this by asking compliance questions in Hindi, Gujarati, and Marathi without any English in the prompt. The responses were grammatically correct, contextually appropriate, and the reasoning quality was comparable to English. This is not translation — it is thinking in the language.

For the 600 million Indians who use AI in a non-English language, and for the billions of non-English speakers globally, this is the difference between AI being useful and AI being inaccessible.

Part 4: Integrating Gemma 4 into a Real Project — AuditMind

Here is the actual multi-agent architecture I built:

                        AUDITM IND — ON-PREMISE PIPELINE

  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
  │ Voice Input │   │  Documents  │   │ Text Query  │
  │ (Hindi/Gu.) │   │ (Scans/PDF) │   │             │
  └──────┬──────┘   └──────┬──────┘   └──────┬──────┘
         │                 │                  │
         └─────────────────┼──────────────────┘
                           ▼
              ┌─────────────────────────┐
              │       ORCHESTRATOR      │
              │  Gemma 4 26B MoE        │
              │  Routes · State mgmt    │
              └────────────┬────────────┘
                           │
          ┌────────────────┼────────────────┐
          ▼                ▼                ▼                ▼
  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
  │ Interviewer  │ │  Doc Reader  │ │  Compliance  │ │Report Writer │
  │ Agent        │ │  Agent       │ │  Agent       │ │ Agent        │
  │ Whisper STT  │ │ Gemma Vision │ │ ChromaDB RAG │ │ 256K context │
  │ gTTS output  │ │ JSON extract │ │ Rule match   │ │ PDF + voice  │
  └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
                           │
                           ▼
              ┌─────────────────────────┐
              │     AUDIT REPORT        │
              │ PDF · Voice · Findings  │
              └─────────────────────────┘

  ████ ENTIRE SYSTEM RUNS OFFLINE. ZERO DATA LEAVES THE FACTORY. ████

The full compliance checker — the part that queries ChromaDB for relevant Indian laws and asks Gemma 4 to classify compliance — looks like this:

# agents/compliance.py

from tools.vector_db import search_rules
from agents.base import BaseAgent, AuditContext, ComplianceFinding

CLASSIFICATION_SYSTEM = """
You are a senior Indian compliance auditor with expertise in GST law,
the Factories Act 1948, the Minimum Wages Act, EPF Act, and ESIC Act.

Given evidence from a factory audit, classify the compliance status as:
- compliant: fully meets the legal requirement
- violation: clearly does not meet the requirement
- partial: partially meets the requirement  
- not_checked: insufficient evidence to determine

Always respond in JSON: {"status": "...", "evidence_summary": "...", "recommendation": "..."}
"""

class ComplianceAgent(BaseAgent):
    name = "compliance"

    def run(self, ctx: AuditContext) -> AuditContext:
        # Build evidence from interview + documents
        evidence = self._build_evidence_summary(ctx)

        # Get all relevant rules via semantic search (RAG)
        all_findings = []
        for category in ["gst", "labour", "safety"]:
            rules = search_rules(
                query=f"{ctx.company_name} {category} compliance status",
                category=category,
                n=6
            )

            for rule in rules:
                # Ask Gemma 4 to classify this specific rule
                result = self.gemma.ask_json(
                    prompt=(
                        f"RULE: {rule['text']}\n\n"
                        f"EVIDENCE FROM AUDIT:\n{evidence}\n\n"
                        f"Classify compliance with this rule."
                    ),
                    system=CLASSIFICATION_SYSTEM
                )

                finding = ComplianceFinding(
                    rule_id=rule["id"],
                    rule_name=rule["text"].split(":")[0],
                    category=category,
                    status=result.get("status", "not_checked"),
                    evidence=result.get("evidence_summary", ""),
                    severity=rule["metadata"]["severity"],
                    penalty_estimate=rule["metadata"]["penalty"],
                    recommendation=result.get("recommendation", ""),
                )
                all_findings.append(finding)

        ctx.findings = all_findings
        ctx.overall_risk = self._calculate_overall_risk(all_findings)
        return ctx

This works. I ran it on mock audit data for a 50-person Vadodara-based casting manufacturer and it correctly identified 8 compliance issues including a missing fire NOC, underpayment of EPF contributions, and invoices without HSN codes — all from scanned documents and voice interview responses.

Part 5: Fine-Tuning Gemma 4 for a Specific Domain

The base Gemma 4 models are excellent, but if you want the model to deeply understand a specific domain — say, Indian GST regulations, medical diagnosis, or legal contracts — fine-tuning is the path.

LoRA Fine-Tuning with Unsloth (Recommended)

Unsloth makes LoRA fine-tuning 2–5x faster than vanilla HuggingFace and uses ~40% less memory.

pip install unsloth transformers trl datasets

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import Dataset

# Step 1: Load Gemma 4 with 4-bit quantization (QLoRA)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-4-4b-it",  # use 4b for fine-tuning on consumer GPU
    max_seq_length=4096,
    dtype=None,          # auto-detect
    load_in_4bit=True,   # QLoRA — fits in 12 GB VRAM
)

# Step 2: Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                # LoRA rank — higher = more capacity but more memory
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj"],
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

# Step 3: Prepare your dataset
# Format: {"instruction": "...", "input": "...", "output": "..."}
train_data = [
    {
        "instruction": "Classify the GST compliance status of this invoice.",
        "input": "Invoice without HSN code. Supplier GSTIN present. No IRN.",
        "output": "VIOLATION: Missing HSN code (mandatory under GST Rule 46). "
                  "Missing IRN (mandatory for turnover >5 crore). "
                  "Penalty: ₹25,000 per incorrect invoice."
    },
    # ... hundreds more examples
]

def format_for_gemma(sample):
    return {
        "text": (
            f"<start_of_turn>user\n"
            f"{sample['instruction']}\n\n{sample['input']}"
            f"<end_of_turn>\n"
            f"<start_of_turn>model\n"
            f"{sample['output']}"
            f"<end_of_turn>"
        )
    }

dataset = Dataset.from_list(train_data).map(format_for_gemma)

# Step 4: Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=4096,
    args=TrainingArguments(
        output_dir="./gemma4-gst-expert",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        save_strategy="epoch",
    ),
)
trainer.train()

# Step 5: Save and convert to Ollama GGUF
model.save_pretrained_gguf("gemma4-gst-expert", tokenizer, quantization_method="q4_k_m")

After training, you can convert to GGUF and run it directly in Ollama:

# Create Modelfile
echo 'FROM ./gemma4-gst-expert-Q4_K_M.gguf' > Modelfile
echo 'SYSTEM "You are a GST compliance expert for Indian SMEs."' >> Modelfile
ollama create gemma4-gst-expert -f Modelfile
ollama run gemma4-gst-expert

Part 6: What Does This Actually Mean?

I want to step back from the technical details for a moment, because I think what Gemma 4 represents is genuinely important and I have been thinking about it a lot.

I am an AI developer in India. The tools I build are not for Silicon Valley startups. They are for kirana store owners who manage inventory on WhatsApp, for factory managers who need compliance help but cannot afford consultants, for ASHA health workers in rural areas who have basic Android phones and no internet.

Every time I have tried to build something genuinely useful for these users, I have hit the same wall: the AI that is good enough to help them requires cloud connectivity, which they often do not have. It requires English, which they may not speak fluently. It costs API money that my users definitely do not have.

Gemma 4 does not solve all of this. But it moves the boundary.

The E2B model running on a Raspberry Pi with native audio input and 140-language support is not a benchmark number. It is a doctor's office in rural Rajasthan that can get a medical pre-assessment without sending patient data to a foreign cloud. It is a construction worker in Mumbai who can photograph their wage slip and hear their rights read back to them in Marathi. It is a small manufacturer in Surat who can run a compliance audit without hiring a consultant or trusting their financial documents to someone else's server.

The combination of properties that makes this possible — offline-capable, multilingual, multimodal, genuinely open (Apache 2.0) — is not a coincidence. It reflects a design philosophy that says edge devices and non-English speakers are first-class citizens of the AI ecosystem.

I have spent three weeks building with Gemma 4. The capabilities are real. The performance at small model sizes is genuinely surprising. The multilingual reasoning quality in Hindi and Gujarati is better than I expected.

There is still a gap between what Gemma 4 can do and what the large proprietary models can do. The gap is real and it matters for some tasks. But for the applications that matter most to the users I am building for — private, offline, regional language, real hardware — Gemma 4 is the first open model that makes those applications actually possible.

That is worth paying attention to.

Quick Reference: Setting Up AuditMind

If you want to run the project I have been describing:

# 1. Clone / set up project
git clone https://github.com/SimranShaikh20/auditm-ind
cd auditm-ind

# 2. Install dependencies
pip install -r requirements.txt

# 3. Pull Gemma 4
ollama pull gemma4:27b-moe   # or gemma4:4b for laptops

# 4. Create a documents/ folder and add your scanned files
mkdir documents
# Copy your invoices, wage registers, licences to documents/

# 5. Launch
streamlit run app.py

The system will interview you, read your documents, check compliance against 20+ Indian laws, and produce a PDF audit report — all without an internet connection.

Summary

Topic	Key Takeaway
Which model	E2B/E4B for offline/edge, MoE for speed+quality, 31B for fine-tuning
Local setup	Ollama pull + run, 10 minutes, no cloud
Multimodal	Variable-res images, native audio on E-series, not a bolt-on
Context	256K tokens = entire audit/codebase/year of documents in one call
Reasoning	Thinking mode dramatically improves multi-step compliance/logic tasks
Languages	Genuine Hindi/Gujarati/Tamil reasoning, not just translation
Fine-tuning	LoRA on 4B fits in 12 GB VRAM with Unsloth in hours
License	Apache 2.0 — commercially usable, modifiable, no permission needed

Resources

📦 Gemma 4 on Hugging Face
📊 Gemma 4 on Kaggle
🖥️ Ollama — easiest local setup
🎓 Google AI Studio — free API, no download
⚡ Unsloth — fast LoRA fine-tuning
📝 Official Gemma docs

All code in this article is from my AuditMind project, tested on a Windows 11 machine with a single RTX 4070. The compliance rules described are for educational purposes — consult a professional for actual legal advice.

If this was useful, I would genuinely love to hear what you are building with Gemma 4 — especially if you are working on problems outside the English-speaking world.

DEV Community