The Problem
Every SOC analyst and MSP team I've talked to has the same complaint:
"We get 200 alerts a day. Maybe 10 are real. But someone has to check all 200."
That's alert fatigue. And it's not a small problem — the average analyst spends 3-5 hours daily on manual triage. Most of that time is wasted on false positives.
I decided to build something to fix this. Two weeks later, I had a working MVP. Here's exactly how I built it.
The Architecture
The system has 4 main components:
Alert Input (Defender/SentinelOne/JSON)
↓
Alert Normalizer
↓
LangGraph Triage Agent
├── Enrich Node (VirusTotal + MITRE ATT&CK)
├── Analyze Node (LLM risk scoring)
└── Human-in-the-Loop Node (Critical alerts)
↓
Output (Risk Score + Slack + Audit Log)
Step 1: Alert Normalizer
The first challenge: every security tool outputs alerts in a different format. Defender looks different from SentinelOne, which looks different from a generic SIEM.
I built a normalizer that takes any alert format and converts it to a single internal structure:
@dataclass
class NormalizedAlert:
alert_id: str
source: str # defender / sentinelone / generic
severity: str # Low / Medium / High / Critical
title: str
timestamp: str
mitre_technique: Optional[str]
hostname: Optional[str]
username: Optional[str]
source_ip: Optional[str]
raw: dict # Original alert for audit
This means the rest of the system doesn't care where the alert came from. It always works with the same format.
Step 2: LangGraph State Machine
I used LangGraph to build the agent as a state machine. Each step in the triage process is a separate node:
class TriageState(TypedDict):
alert: dict
enrichment: Optional[dict]
risk_score: Optional[int]
risk_level: Optional[str]
explanation: Optional[str]
recommendation: Optional[str]
needs_human: Optional[bool]
error: Optional[str]
The graph flows like this:
enrich → analyze → [human_review if score >= 70] → format_output
Why LangGraph instead of a simple chain? Because real triage isn't linear. You need conditional routing — a Critical alert should follow a different path than a Low one. LangGraph makes this explicit and debuggable.
Step 3: Enrichment Tools
Before the LLM sees the alert, two tools run automatically:
VirusTotal IP Lookup
def check_ip(ip: str) -> IPReputation:
url = f"https://www.virustotal.com/api/v3/ip_addresses/{ip}"
headers = {"x-apikey": api_key}
response = requests.get(url, headers=headers, timeout=10)
# Returns malicious_votes, country, as_owner, is_known_bad
Why this matters: An alert marked "Low severity" came in for SSH login attempts. The source IP had 4 malicious votes on VirusTotal. The system automatically escalated it to High. Without enrichment, that alert would have been ignored.
MITRE ATT&CK Context
Instead of hitting an API for every request, I built a local database of the most common techniques:
MITRE_DB = {
"T1059.001": MitreTechnique(
"T1059.001", "PowerShell", "Execution",
"Adversaries use PowerShell to execute commands, often with encoded payloads...",
"high"
),
"T1486": MitreTechnique(
"T1486", "Data Encrypted for Impact (Ransomware)", "Impact",
"Adversary encrypts data to disrupt availability...",
"high"
),
}
This context goes directly into the LLM prompt — giving the model real knowledge about what each technique means and how dangerous it is.
Step 4: The LLM Analysis
The Triage Agent sends the enriched alert to Groq (Llama 3.3 70B) with a structured prompt that returns JSON:
{
"risk_score": 95,
"risk_level": "Critical",
"explanation": "The source IP is flagged as MALICIOUS by 17 VirusTotal engines...",
"recommendation": "Block IP immediately and isolate the device.",
"needs_human": true
}
Key design decision: temperature 0.1. Security analysis needs consistency, not creativity.
Step 5: Human-in-the-Loop
For any alert with risk score >= 70, the system sends a Slack notification and waits for human approval. AI assists — humans decide on critical actions.
Step 6: REST API with FastAPI
@router.post("/triage", response_model=TriageResponse)
def triage_alert(alert_request: AlertRequest):
normalized = normalize_alert(alert_request.model_dump(exclude_none=True))
result = run_triage(normalized)
return TriageResponse(...)
Microsoft Defender can now send a webhook to POST /triage and get back a full analysis in ~3 seconds.
Real Results
Running 6 sample alerts through the system:
- A "Low severity" SSH alert was escalated to High because VirusTotal flagged the source IP (4 malicious votes)
- A data exfiltration alert scored 95/100 Critical — destination IP had 17 VirusTotal votes, known Tor exit node used for C2
Tech Stack
- Python 3.12 + LangGraph + FastAPI
- Groq (Llama 3.3 70B) — free tier
- VirusTotal API — free tier (500 req/day)
- Slack Webhooks — notifications
Total cost for MVP: $0
Key Lessons
- Enrich before you analyze — LLM without real threat intel is just guessing
- LangGraph over simple chains — conditional routing requires a proper state machine
- Human-in-the-Loop is not optional — never automate critical security decisions
- Start with the data — understanding real alerts before coding saved hours
Currently looking for MSP and SOC teams for a free 2-week pilot.
If your team deals with alert fatigue — comment below or DM me.

Top comments (0)