I'm building PathForge AI — a career guidance platform for Indian students. The pitch is simple: AI-powered counselling for students who can't afford a human counsellor. The engineering problem underneath is not simple at all.
When Gemma 4 dropped in April 2026, I had a decision to make. The family ships four models. I had two obvious candidates:
- E4B (~4.5B effective params): runs locally on a mid-range phone, free, completely private
- 31B Dense: server-side via API, costs real money per query, much slower
The conventional wisdom was clear: small model for quick tasks, big model for complex reasoning, route intelligently. Done.
Except I didn't trust the conventional wisdom. So I ran 50 real queries through both models — actual queries from PathForge AI's private beta — and measured everything: output quality, schema compliance, latency, and cost per query.
The results were not what I expected.
The Setup
What I was testing: Career guidance queries from real Indian students (anonymised). Not clean test prompts. Messy, code-switched, emotionally loaded, often under-specified — exactly the way real users type.
Query categories (50 total):
| Category | Count | Example |
|---|---|---|
| Simple eligibility check | 15 | "Can I apply for NSP if family income is 2.8L?" |
| Single-path career question | 15 | "PCB student, 78%, interested in AI field, what options?" |
| Multi-constraint planning | 12 | "JEE rank 52000, budget 4L/year, prefer Karnataka, open to abroad if full scholarship" |
| Ambiguous / emotional | 8 | "parents want CA but I want game dev, marks average, what should I do honestly" |
Scoring rubric (blind, three evaluators, averaged):
| Dimension | Max |
|---|---|
| Constraint compliance — did it actually honour all stated constraints? | 3 |
| Schema fidelity — valid parseable JSON matching our output spec? | 2 |
| Practical accuracy — is the career/institution advice actually correct? | 3 |
| Tone — does it read like a counsellor, not a Wikipedia article? | 2 |
| Total | 10 |
Infrastructure:
- E4B: Q4_K_M quantised GGUF, llama.cpp, laptop (16GB RAM, no dedicated GPU). Simulating a real developer machine serving requests.
- 31B Dense: Gemma 4 31B endpoint via Gemini API. Server-side, billed per token.
The Numbers
Headline results first. Details follow.
Overall Score (out of 10)
| Category | E4B | 31B Dense | Winner |
|---|---|---|---|
| Simple eligibility | 8.7 | 8.4 | E4B |
| Single-path career | 7.2 | 8.9 | 31B |
| Multi-constraint planning | 5.1 | 9.1 | 31B |
| Ambiguous / emotional | 8.1 | 7.6 | E4B |
| Overall average | 7.3 | 8.5 | 31B |
Latency
| Model | Median | P95 |
|---|---|---|
| E4B (local, no GPU) | 3.1s | 6.8s |
| 31B Dense (API) | 9.4s | 17.2s |
Cost per Query
| Model | Cost |
|---|---|
| E4B local | ₹0 |
| 31B via Gemini API | ~₹0.13 (~$0.0015 USD) |
At projected 50,000 queries/month, running 31B for everything = ₹6,500/month. E4B for everything = essentially ₹0.
Sounds like an obvious choice. Here's why it isn't.
Category 1: Simple Eligibility — E4B Wins
Expected result. Give both models a bounded factual question and the smaller one handles it fine. What I didn't expect was how E4B won.
Query:
"my family income is 2.8 lakhs, i'm in 11th, can i get NSP scholarship? SC category"
E4B output (9/10):
{
"eligible": true,
"scheme": "NSP Post-Matric Scholarship",
"income_cutoff_met": true,
"category_benefit": "SC category qualifies for higher scholarship amount",
"next_step": "Register on scholarships.gov.in after Class 12 results",
"caution": "Student is in Class 11 — Post-Matric applies from Class 12 onwards. Apply in first month of Class 12 admission.",
"confidence": "high"
}
That caution flag — catching that the student is currently in Class 11 so the application timing is wrong — wasn't prompted for. E4B inferred it from the grade level stated in the query. Proactive, correct, and actually useful.
31B output (8/10): Correct, but added an unrequested "general_advice" field outside our schema, requiring stripping in post-processing. Small thing. At 50,000 queries/month it's not small.
Takeaway: E4B is faster, free, and more schema-compliant on simple queries. Route these locally.
Category 2: Single-Path Career Questions — 31B Wins Clearly
Query:
"PCB student, 78% in boards, interested in AI and biotech both, family income 6L, can spend upto 8L for full course, which way should I go?"
E4B output (6/10):
Identified Bioinformatics correctly as the overlap zone. Listed three institution options. One problem: it recommended a private university in Pune at ₹3.2L/year — which is ₹12.8L for four years. Over the stated ₹8L total budget. E4B either didn't do the multiplication or lost the budget constraint between the reasoning step and the institution selection step.
31B output (9/10):
{
"recommended_path": "B.Sc. Bioinformatics → M.Tech. Computational Biology",
"rationale": "Sits at the intersection of PCB background and AI interest. Growing field with government push under BioE3 Policy 2023.",
"institutions": [
{
"name": "University of Hyderabad",
"type": "Central University",
"annual_fee": 18000,
"total_cost_estimate": 72000,
"budget_check": "PASS — well within ₹8L total",
"entrance": "CUET-UG"
},
{
"name": "Amrita Vishwa Vidyapeetham",
"type": "Private (NAAC A++)",
"annual_fee": 185000,
"total_cost_estimate": 740000,
"budget_check": "PASS — within ₹8L total",
"entrance": "AEEE / Direct admission"
}
],
"scholarship_flags": [
"CSIR-UGC JRF eligible post-graduation",
"DST INSPIRE — check if boards percentage qualifies for top 1% state cutoff"
],
"confidence": "high"
}
The 31B computed total cost (annual fee × 4), checked it against the budget, and labelled each result PASS/FAIL without being asked. It also flagged DST INSPIRE proactively — exactly the counselling behaviour that makes the difference between a generic AI answer and a useful one.
The pattern: E4B loses the thread of a constraint when it has to maintain it across multiple reasoning steps inside a single output. 31B doesn't.
Category 3: Multi-Constraint Planning — Biggest Gap (E4B: 5.1 vs 31B: 9.1)
Query:
"JEE mains rank around 52000, family income 3.8 lakhs, want to stay in south india preferably karnataka, open to abroad only if full scholarship no loan, interested in CS or ECE, parents want govt job security, I want startup eventually"
Six simultaneous constraints, including one internal tension (parents vs student). This is representative of a real complex session.
E4B output (4/10):
Listed NIT Surathkal as a viable option. NIT Surathkal CS closing rank in 2025 was approximately 8,500. The student's rank is 52,000. That's not close. The model generated a confident recommendation that was factually wrong about a piece of information that's directly determinative of whether the advice works.
It also listed two German universities as abroad options (correct — near-zero tuition) but added: "loan may be required for living costs" — directly contradicting the "no loan" constraint stated in the query.
A student acting on E4B's output for this query would apply to a college they have near-zero chance of getting into, and might take a loan they said they couldn't take.
31B output (9/10):
{
"jee_rank_analysis": {
"rank": 52000,
"nit_flagship_verdict": "UNLIKELY — NIT Surathkal/Trichy/Warangal CS/ECE cutoffs are 5,000–15,000. Do not apply without checking current year cutoffs.",
"viable_nits": [
"NIT Goa CS (~45,000–58,000)",
"NIT Puducherry ECE (~50,000–62,000)"
],
"state_alternative": "Karnataka KCET is entirely separate from JEE — strong option with equivalent state rank. Check KCET rank via conversion tool."
},
"abroad_assessment": {
"constraint": "Full scholarship, zero loan — hard limit",
"germany": "DAAD scholarships exist and cover living costs — viable but competitive. Include only if student is top academic performer overall.",
"excluded_with_reason": {
"USA": "Full cost-of-attendance scholarship near-impossible at this academic profile",
"UK": "Tuition fees too high to cover without loan given income level"
}
},
"parent_student_tension": {
"observation": "Government security preference and startup aspiration are reconcilable — not opposing paths.",
"framing": "DRDO, C-DAC, and ISRO R&D tracks increasingly resemble startup environments. Bengaluru government R&D → startup transition is a well-documented career arc. CS + ECE builds infrastructure for both.",
"recommended_narrative": "Frame to parents as government R&D with startup optionality, not startup instead of stability."
}
}
The tension-handling section got a unanimous "this is exactly right" from all three evaluators. It didn't pick the student's side or the parents' side. It reframed the conflict as a phased path. That's what a good counsellor does. E4B didn't attempt to address the tension at all.
Category 4: Ambiguous / Emotional Queries — E4B Wins Unexpectedly
This was the genuine surprise.
Query:
"parents want CA but I want game dev, marks are average, what should I do honestly"
I expected 31B to win here because nuance requires capacity. It didn't.
E4B output (8/10):
Short. Direct. No hedging. Acknowledged the conflict in one sentence, gave a concrete middle-ground (BBA + game dev certification track), named two Indian studios that hire from non-CS backgrounds (Nodding Heads, Rockstar India Pune), and closed with:
"Both paths are real. The question is which regret you can live with more."
31B output (7/10):
Structurally excellent. Balanced. Full of caveats. Longer. One evaluator wrote: "technically correct, emotionally inert."
This is a real pattern: large dense models over-optimise for completeness and under-optimise for voice on short emotional queries. E4B, with its smaller output budget, was forced to be direct. The directness worked. For a stressed 17-year-old reading this at midnight, "technically correct, emotionally inert" is a failure mode that matters.
The Routing Logic I'm Now Running
QUERY ROUTING — PathForge AI v2
if query_type == "eligibility_check":
→ E4B local # Fast, free, more schema-compliant
elif query_type in ["emotional", "ambiguous"]:
→ E4B local # Brevity is a feature, not a limitation
elif query_type == "single_path" and constraint_count <= 3:
→ E4B local # Handles 80% correctly; retry on parse error → 31B
elif query_type == "multi_constraint" or constraint_count > 3:
→ 31B Dense via API # ₹0.13/query, worth it
elif query_type == "final_plan_generation":
→ 31B Dense + 128K context
# Full profile + institution corpus + scholarship ruleset loaded in one pass
# No RAG, no retrieval miss, full coherence
At our projected query mix — roughly 60% eligibility/emotional, 40% complex planning — this routing brings API cost from ₹6,500/month to ~₹1,800/month. A 72% reduction. With no quality drop on complex queries and a genuine quality improvement on emotional ones.
The Three Things I Didn't Expect
E4B's schema compliance was better than 31B's on simple queries. The 31B over-explains easy questions — like a person who writes three paragraphs when one sentence was asked for. At 50,000 queries, extra fields in the output are a post-processing tax.
E4B handles Hinglish far better than benchmarks suggest. Queries like "maths mein weak hoon but PCB strong, AI side jaana hai" were processed correctly without preprocessing. Standard English benchmarks tell you nothing about this. Test with your actual users' actual language.
The quality gap between E4B and 31B disappears — and reverses — on emotional queries. This is the finding I'd most want another developer building for real users to know. Don't assume bigger = better for the queries where tone matters most.
What I'd Tell Another Developer
Don't benchmark with clean prompts.
In my case, real queries were Hinglish, emotionally loaded, under-specified, and carried six simultaneous constraints in a single run-on sentence. Clean-prompt benchmarks would have told me to use 31B for everything. Real queries told me E4B is better for 60% of my volume.
The Gemma 4 family isn't a ladder where you climb as high as hardware allows. It's a toolkit. The routing decision is the engineering. And if you're building for a market where ₹0.13 per query actually matters — where the difference between ₹1,800/month and ₹6,500/month determines whether a student platform is financially viable at all — that routing decision is the whole business.
Reproducibility
The anonymised 50-query test set (categories labelled, personal details stripped) and the scoring rubric are available on request. Drop a comment if you're building career or education AI for Indian or emerging-market users — happy to share. Real benchmark data from production-adjacent queries is rare enough in this space that it's worth pooling.
What was the biggest gap between benchmark performance and real user query performance in your Gemma 4 work? Comments below — the interesting stuff lives in that gap.
Top comments (0)