DEV Community

sameer zubair
sameer zubair

Posted on

WhiteboardIQ: From Blurry Whiteboard Photo to Structured Action Items with Gemma 4 E4B

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

WhiteboardIQ — snap a photo of any whiteboard and get back a clean, structured list of action items, owners, deadlines, and priorities in seconds.

Every team has been there: 45 minutes of productive planning, three whiteboards full of tasks and names, then someone takes a blurry phone photo and that's "the notes." Two days later nobody remembers who owned what.

WhiteboardIQ fixes that. It reads the whiteboard image with Gemma 4's native vision and returns:

  • Action items with owner, deadline, and priority (inferred from visual cues — circles = High, boxes = Medium, plain text = Low)
  • 🏛️ Decisions made during the session
  • Open questions and blockers
  • 📋 Full verbatim transcription of the whiteboard
  • 📝 2–3 sentence executive summary

Demo

project. -->

Code

Code
🔗 Web app + backend: github.com/samirzubair/GEMMA4
🔗 Edge Gallery skill: samirzubair.github.io/GEMMA4/SKILL.md
Project structure
whiteboardiq/
├── backend/
│ ├── main.py # FastAPI — POST /extract, serves frontend
│ ├── model.py # Gemma 4 via Ollama REST API (no SDK needed)
│ └── formatter.py # JSON → Markdown / CSV
└── frontend/
├── index.html # Drag-and-drop upload UI
├── style.css # Dark-mode design system
└── app.js # Fetch, render, copy, download

whiteboardiq-skill/ # Google AI Edge Gallery skill
├── SKILL.md # Skill instructions for Gemma 4
├── scripts/
│ └── index.html # run_js entry point
└── assets/
└── webview.html # Renders action items card in-app
The Gemma integration — no SDK, just Ollama REST
def extract_from_image_bytes(image_bytes: bytes, mime_type="image/jpeg") -> dict:
payload = {
"model": "gemma4:e4b",
"prompt": EXTRACTION_PROMPT,
"images": [base64.b64encode(image_bytes).decode()],
"stream": False,
"options": {"temperature": 0.2, "num_predict": 4096},
}
req = urllib.request.Request(
"http://localhost:11434/api/generate",
data=json.dumps(payload).encode(),
headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req, timeout=120) as resp:
return parse_json(json.loads(resp.read())["response"])
temperature: 0.2 keeps extraction grounded — higher values caused the model to hallucinate owners or deadlines not on the board.

How I Used Gemma 4

Native multimodal vision. Gemma 4 handles image + text in a single inference call. No separate OCR pipeline, no two-model stitching. The whiteboard photo goes in as a base64 blob alongside the structured prompt, and JSON comes out.
Visual reasoning, not just OCR. Raw OCR gives you text. Gemma 4 understands context. It sees that a circled word is higher priority than plain text. It infers that a name written beside a task is the owner. It recognises that an arrow between two items implies a dependency. That's the difference between a transcription and an action list.
Speed that feels real-time. E4B at Q4_K_M quantization runs in ~8 seconds on a MacBook for a typical whiteboard photo. The 27B Dense model gives marginally better handwriting recognition on very messy boards — but for a live demo and real-world enterprise use, E4B hits the sweet spot of accuracy vs. latency.
Privacy — the killer feature for enterprise. Meeting content is sensitive. With Gemma 4 running locally via Ollama, whiteboard photos never leave the machine. No closed API, no data retention policy to worry about. The entire app runs without an internet connection.
128K context window. Not used in the MVP, but the obvious next step: pass all whiteboard photos from a multi-hour session in one call and get unified, deduplicated action items across all boards. Only possible because of the large context.
The prompt engineering
The extraction prompt has three key rules that made the difference:

  • Extract EVERY action item, even implicit ones (e.g., "John → DB migration" → task for John)
  • Infer priority from visual cues: circled/starred/underlined = High, boxed = Medium, plain = Low
  • A name written directly beside a task = that person is the owner Without the implicit task rule, ~40% of action items were missed. Without the visual cue rule, all priorities came back as Medium. Gemma 4's instruction-following is strong enough to respect these rules reliably across very different whiteboard styles and handwriting quality. Edge Gallery skill The skill uses Gemma 4's agent mode via the run_js tool:
  • User sends whiteboard photo in Edge Gallery chat
  • Gemma reads the image with native vision
  • Gemma calls run_js with structured JSON (action items, decisions, questions)
  • scripts/index.html passes data to assets/webview.html via URL params
  • A dark-mode card renders inline in the chat with priority badges and owner chips ## Instructions (from SKILL.md)

Call the run_js tool using index.html and a JSON string for data with:

  • action_items: Array with task, owner, deadline, priority, notes
  • decisions: Array of strings
  • questions: Array of strings
  • meeting_context, summary The skill is live at https://samirzubair.github.io/GEMMA4/SKILL.md — installable in any Edge Gallery instance in seconds.






Top comments (0)