Sesank Munukutla (Naga)

Posted on Feb 13

Event-Driven EC2 Isolation in AWS: Building a Minimal Cloud SOAR Without Buying One

#aws #cloudsecurity #devsecops #incidentresponse

Detection without response is operational noise.

GuardDuty alerts are valuable — but if a human has to read, decide, and manually isolate an instance, your blast radius window is still open.

I wanted high-confidence findings to trigger automatic containment.

So I built a minimal AWS-native SOAR pipeline.

No third-party tooling.

No overengineering.

Just deterministic, event-driven response.

🎯 Objective

Build an automated containment workflow that:

Responds only to high-severity GuardDuty findings
Automatically isolates compromised EC2 instances
Preserves forensic access
Avoids recursive execution
Is observable and debuggable

All event-driven. No polling. No manual trigger.

🏗 Architecture Overview

GuardDuty Finding
↓
EventBridge Rule (severity >= 7)
↓
Lambda Function (Isolation Logic)
↓
Modify EC2 Security Group → Quarantine SG
↓
SNS Notification (Visibility Layer)

Minimal. Deterministic. Cheap.

Filtering at the Event Layer (Not Inside Lambda)

Instead of checking severity inside the Lambda function, I filtered directly in EventBridge.

Why this matters:

Reduces unnecessary Lambda invocations
Makes response criteria explicit
Improves audit clarity
Lowers operational cost

Example event pattern:

{
  "detail-type": ["GuardDuty Finding"],
  "detail": {
    "severity": [ { "numeric": [">=", 7] } ]
  }
}

Only high-confidence findings trigger automation.

Everything else remains visible — but not auto-remediated.

Quarantine Security Group Design

Containment is not termination.

Terminating an instance destroys forensic evidence.

My quarantine security group:

❌ No outbound internet
❌ No inbound from public IP ranges
✅ Allow only SOC bastion IP
✅ Allow forensic collection host
✅ Optional: allow VPC Flow Logs / monitoring endpoint

The goal is isolation with controlled investigation access.

Isolation Logic (Lambda Example)

Core logic:

import boto3

ec2 = boto3.client('ec2')

def isolate_instance(instance_id, quarantine_sg_id):
    ec2.modify_instance_attribute(
        InstanceId=instance_id,
        Groups=[quarantine_sg_id]
    )

Additional safeguards added:

Check instance state before modification
Tag instance Quarantined=true
Exit if already isolated
Log original security groups for rollback

Containment must be idempotent.

Idempotency: Preventing Recursive Triggers

When Lambda modifies security groups, CloudTrail events may fire.

Without safeguards, you risk infinite loops.

Mitigation:

Tag check before modification
Structured event filtering
Explicit function logging
DLQ configured for failure cases

Automation that can repeat blindly is dangerous.

Failure Modes I Modeled

Automation amplifies mistakes.

I explicitly accounted for:

IAM permission drift
Partial security group modification
Concurrent findings on same instance
Cross-region GuardDuty setup
High-volume alert bursts

Mitigations:

Dead Letter Queue
Lambda concurrency limits
CloudWatch error metrics + alarms
Explicit structured logs (JSON format)
Permission boundary controls

Automation without observability becomes silent failure.

Impact

This reduced:

MTTR from minutes to seconds
Human triage fatigue
Decision bottlenecks
Inconsistent containment actions

But the real improvement was consistency.

Humans improvise during incidents.
Code executes predictably.

Trade-Offs & Risks

Auto-isolating compute is not trivial.

You must consider:

False positives at high severity
Production-critical workloads
Stateful applications
Already-compromised lateral movement
Multi-account architecture

Severity threshold tuning took longer than writing the Lambda function.

That surprised me.

Lessons Learned

Detection maturity does not equal response maturity.
Event-driven architecture scales better than polling remediation.
Idempotency is mandatory.
Multi-account containment becomes architecture work.
Automation exposes operational blind spots you didn’t know existed.

Next Iterations

If I evolve this into a more mature Cloud SOAR pattern:

Step Functions for multi-stage workflows
Automated EBS snapshot before isolation
Memory capture integration
Slack/Jira enrichment with context
Cross-account orchestration via AWS Organizations
GuardDuty central delegated admin integration

At that point, it becomes a response framework — not a script.

Final Thought

You don’t need a commercial SOAR platform to start automating response.

Start with:

Deterministic triggers
Guardrails
Observability
Explicit blast radius control

If detection isn’t wired to action, it’s just telemetry.

Top comments (2)

Harsh • Feb 13

This is exactly the kind of content I look for! "Detection without response is operational noise" — absolutely true. Love how you've implemented event-driven isolation without relying on paid SOAR tools. Definitely trying this in my AWS environment. Thanks for sharing!

Sesank Munukutla (Naga) • Feb 14

Thanks a lot, Harsh! Really glad it resonated