Sreejit Pradhan

Posted on May 17

I Tested Gemma 4 E4B vs 31B on 50 Real Student Career Queries — The Results Surprised Me

#devchallenge #gemmachallenge #gemma #opensource

Gemma 4 Challenge: Write about Gemma 4 Submission

I'm building PathForge AI — a career guidance platform for Indian students. The pitch is simple: AI-powered counselling for students who can't afford a human counsellor. The engineering problem underneath is not simple at all.

When Gemma 4 dropped in April 2026, I had a decision to make. The family ships four models. I had two obvious candidates:

E4B (~4.5B effective params): runs locally on a mid-range phone, free, completely private
31B Dense: server-side via API, costs real money per query, much slower

The conventional wisdom was clear: small model for quick tasks, big model for complex reasoning, route intelligently. Done.

Except I didn't trust the conventional wisdom. So I ran 50 real queries through both models — actual queries from PathForge AI's private beta — and measured everything: output quality, schema compliance, latency, and cost per query.

The results were not what I expected.

The Setup

What I was testing: Career guidance queries from real Indian students (anonymised). Not clean test prompts. Messy, code-switched, emotionally loaded, often under-specified — exactly the way real users type.

Query categories (50 total):

Category	Count	Example
Simple eligibility check	15	"Can I apply for NSP if family income is 2.8L?"
Single-path career question	15	"PCB student, 78%, interested in AI field, what options?"
Multi-constraint planning	12	"JEE rank 52000, budget 4L/year, prefer Karnataka, open to abroad if full scholarship"
Ambiguous / emotional	8	"parents want CA but I want game dev, marks average, what should I do honestly"

Scoring rubric (blind, three evaluators, averaged):

Dimension	Max
Constraint compliance — did it actually honour all stated constraints?	3
Schema fidelity — valid parseable JSON matching our output spec?	2
Practical accuracy — is the career/institution advice actually correct?	3
Tone — does it read like a counsellor, not a Wikipedia article?	2
Total	10

Infrastructure:

E4B: Q4_K_M quantised GGUF, llama.cpp, laptop (16GB RAM, no dedicated GPU). Simulating a real developer machine serving requests.
31B Dense: Gemma 4 31B endpoint via Gemini API. Server-side, billed per token.

The Numbers

Headline results first. Details follow.

Overall Score (out of 10)

Category	E4B	31B Dense	Winner
Simple eligibility	8.7	8.4	E4B
Single-path career	7.2	8.9	31B
Multi-constraint planning	5.1	9.1	31B
Ambiguous / emotional	8.1	7.6	E4B
Overall average	7.3	8.5	31B

Latency

Model	Median	P95
E4B (local, no GPU)	3.1s	6.8s
31B Dense (API)	9.4s	17.2s

Cost per Query

Model	Cost
E4B local	₹0
31B via Gemini API	~₹0.13 (~$0.0015 USD)

At projected 50,000 queries/month, running 31B for everything = ₹6,500/month. E4B for everything = essentially ₹0.

Sounds like an obvious choice. Here's why it isn't.

Category 1: Simple Eligibility — E4B Wins

Expected result. Give both models a bounded factual question and the smaller one handles it fine. What I didn't expect was how E4B won.

Query:

"my family income is 2.8 lakhs, i'm in 11th, can i get NSP scholarship? SC category"

E4B output (9/10):

{
  "eligible": true,
  "scheme": "NSP Post-Matric Scholarship",
  "income_cutoff_met": true,
  "category_benefit": "SC category qualifies for higher scholarship amount",
  "next_step": "Register on scholarships.gov.in after Class 12 results",
  "caution": "Student is in Class 11 — Post-Matric applies from Class 12 onwards. Apply in first month of Class 12 admission.",
  "confidence": "high"
}

That caution flag — catching that the student is currently in Class 11 so the application timing is wrong — wasn't prompted for. E4B inferred it from the grade level stated in the query. Proactive, correct, and actually useful.

31B output (8/10): Correct, but added an unrequested "general_advice" field outside our schema, requiring stripping in post-processing. Small thing. At 50,000 queries/month it's not small.

Takeaway: E4B is faster, free, and more schema-compliant on simple queries. Route these locally.

Category 2: Single-Path Career Questions — 31B Wins Clearly

Query:

"PCB student, 78% in boards, interested in AI and biotech both, family income 6L, can spend upto 8L for full course, which way should I go?"

E4B output (6/10):
Identified Bioinformatics correctly as the overlap zone. Listed three institution options. One problem: it recommended a private university in Pune at ₹3.2L/year — which is ₹12.8L for four years. Over the stated ₹8L total budget. E4B either didn't do the multiplication or lost the budget constraint between the reasoning step and the institution selection step.

31B output (9/10):

{
  "recommended_path": "B.Sc. Bioinformatics → M.Tech. Computational Biology",
  "rationale": "Sits at the intersection of PCB background and AI interest. Growing field with government push under BioE3 Policy 2023.",
  "institutions": [
    {
      "name": "University of Hyderabad",
      "type": "Central University",
      "annual_fee": 18000,
      "total_cost_estimate": 72000,
      "budget_check": "PASS — well within ₹8L total",
      "entrance": "CUET-UG"
    },
    {
      "name": "Amrita Vishwa Vidyapeetham",
      "type": "Private (NAAC A++)",
      "annual_fee": 185000,
      "total_cost_estimate": 740000,
      "budget_check": "PASS — within ₹8L total",
      "entrance": "AEEE / Direct admission"
    }
  ],
  "scholarship_flags": [
    "CSIR-UGC JRF eligible post-graduation",
    "DST INSPIRE — check if boards percentage qualifies for top 1% state cutoff"
  ],
  "confidence": "high"
}

The 31B computed total cost (annual fee × 4), checked it against the budget, and labelled each result PASS/FAIL without being asked. It also flagged DST INSPIRE proactively — exactly the counselling behaviour that makes the difference between a generic AI answer and a useful one.

The pattern: E4B loses the thread of a constraint when it has to maintain it across multiple reasoning steps inside a single output. 31B doesn't.

Category 3: Multi-Constraint Planning — Biggest Gap (E4B: 5.1 vs 31B: 9.1)

Query:

"JEE mains rank around 52000, family income 3.8 lakhs, want to stay in south india preferably karnataka, open to abroad only if full scholarship no loan, interested in CS or ECE, parents want govt job security, I want startup eventually"

Six simultaneous constraints, including one internal tension (parents vs student). This is representative of a real complex session.

E4B output (4/10):

Listed NIT Surathkal as a viable option. NIT Surathkal CS closing rank in 2025 was approximately 8,500. The student's rank is 52,000. That's not close. The model generated a confident recommendation that was factually wrong about a piece of information that's directly determinative of whether the advice works.

It also listed two German universities as abroad options (correct — near-zero tuition) but added: "loan may be required for living costs" — directly contradicting the "no loan" constraint stated in the query.

A student acting on E4B's output for this query would apply to a college they have near-zero chance of getting into, and might take a loan they said they couldn't take.

31B output (9/10):

{
  "jee_rank_analysis": {
    "rank": 52000,
    "nit_flagship_verdict": "UNLIKELY — NIT Surathkal/Trichy/Warangal CS/ECE cutoffs are 5,000–15,000. Do not apply without checking current year cutoffs.",
    "viable_nits": [
      "NIT Goa CS (~45,000–58,000)",
      "NIT Puducherry ECE (~50,000–62,000)"
    ],
    "state_alternative": "Karnataka KCET is entirely separate from JEE — strong option with equivalent state rank. Check KCET rank via conversion tool."
  },
  "abroad_assessment": {
    "constraint": "Full scholarship, zero loan — hard limit",
    "germany": "DAAD scholarships exist and cover living costs — viable but competitive. Include only if student is top academic performer overall.",
    "excluded_with_reason": {
      "USA": "Full cost-of-attendance scholarship near-impossible at this academic profile",
      "UK": "Tuition fees too high to cover without loan given income level"
    }
  },
  "parent_student_tension": {
    "observation": "Government security preference and startup aspiration are reconcilable — not opposing paths.",
    "framing": "DRDO, C-DAC, and ISRO R&D tracks increasingly resemble startup environments. Bengaluru government R&D → startup transition is a well-documented career arc. CS + ECE builds infrastructure for both.",
    "recommended_narrative": "Frame to parents as government R&D with startup optionality, not startup instead of stability."
  }
}

The tension-handling section got a unanimous "this is exactly right" from all three evaluators. It didn't pick the student's side or the parents' side. It reframed the conflict as a phased path. That's what a good counsellor does. E4B didn't attempt to address the tension at all.

Category 4: Ambiguous / Emotional Queries — E4B Wins Unexpectedly

This was the genuine surprise.

Query:

"parents want CA but I want game dev, marks are average, what should I do honestly"

I expected 31B to win here because nuance requires capacity. It didn't.

E4B output (8/10):
Short. Direct. No hedging. Acknowledged the conflict in one sentence, gave a concrete middle-ground (BBA + game dev certification track), named two Indian studios that hire from non-CS backgrounds (Nodding Heads, Rockstar India Pune), and closed with:

"Both paths are real. The question is which regret you can live with more."

31B output (7/10):
Structurally excellent. Balanced. Full of caveats. Longer. One evaluator wrote: "technically correct, emotionally inert."

This is a real pattern: large dense models over-optimise for completeness and under-optimise for voice on short emotional queries. E4B, with its smaller output budget, was forced to be direct. The directness worked. For a stressed 17-year-old reading this at midnight, "technically correct, emotionally inert" is a failure mode that matters.

The Routing Logic I'm Now Running

QUERY ROUTING — PathForge AI v2

if query_type == "eligibility_check":
    → E4B local          # Fast, free, more schema-compliant

elif query_type in ["emotional", "ambiguous"]:
    → E4B local          # Brevity is a feature, not a limitation

elif query_type == "single_path" and constraint_count <= 3:
    → E4B local          # Handles 80% correctly; retry on parse error → 31B

elif query_type == "multi_constraint" or constraint_count > 3:
    → 31B Dense via API  # ₹0.13/query, worth it

elif query_type == "final_plan_generation":
    → 31B Dense + 128K context
    # Full profile + institution corpus + scholarship ruleset loaded in one pass
    # No RAG, no retrieval miss, full coherence

At our projected query mix — roughly 60% eligibility/emotional, 40% complex planning — this routing brings API cost from ₹6,500/month to ~₹1,800/month. A 72% reduction. With no quality drop on complex queries and a genuine quality improvement on emotional ones.

The Three Things I Didn't Expect

E4B's schema compliance was better than 31B's on simple queries. The 31B over-explains easy questions — like a person who writes three paragraphs when one sentence was asked for. At 50,000 queries, extra fields in the output are a post-processing tax.

E4B handles Hinglish far better than benchmarks suggest. Queries like "maths mein weak hoon but PCB strong, AI side jaana hai" were processed correctly without preprocessing. Standard English benchmarks tell you nothing about this. Test with your actual users' actual language.

The quality gap between E4B and 31B disappears — and reverses — on emotional queries. This is the finding I'd most want another developer building for real users to know. Don't assume bigger = better for the queries where tone matters most.

What I'd Tell Another Developer

Don't benchmark with clean prompts.

In my case, real queries were Hinglish, emotionally loaded, under-specified, and carried six simultaneous constraints in a single run-on sentence. Clean-prompt benchmarks would have told me to use 31B for everything. Real queries told me E4B is better for 60% of my volume.

The Gemma 4 family isn't a ladder where you climb as high as hardware allows. It's a toolkit. The routing decision is the engineering. And if you're building for a market where ₹0.13 per query actually matters — where the difference between ₹1,800/month and ₹6,500/month determines whether a student platform is financially viable at all — that routing decision is the whole business.

Reproducibility

The anonymised 50-query test set (categories labelled, personal details stripped) and the scoring rubric are available on request. Drop a comment if you're building career or education AI for Indian or emerging-market users — happy to share. Real benchmark data from production-adjacent queries is rare enough in this space that it's worth pooling.

What was the biggest gap between benchmark performance and real user query performance in your Gemma 4 work? Comments below — the interesting stuff lives in that gap.

DEV Community

I Tested Gemma 4 E4B vs 31B on 50 Real Student Career Queries — The Results Surprised Me

The Setup

The Numbers

Overall Score (out of 10)

Latency

Cost per Query

Category 1: Simple Eligibility — E4B Wins

Category 2: Single-Path Career Questions — 31B Wins Clearly

Category 3: Multi-Constraint Planning — Biggest Gap (E4B: 5.1 vs 31B: 9.1)

Category 4: Ambiguous / Emotional Queries — E4B Wins Unexpectedly

The Routing Logic I'm Now Running

The Three Things I Didn't Expect

What I'd Tell Another Developer

Reproducibility

Top comments (0)