DEV Community

Cover image for Event-Driven EC2 Isolation in AWS: Building a Minimal Cloud SOAR Without Buying One
Sesank Munukutla (Naga)
Sesank Munukutla (Naga)

Posted on

Event-Driven EC2 Isolation in AWS: Building a Minimal Cloud SOAR Without Buying One

Detection without response is operational noise.

GuardDuty alerts are valuable — but if a human has to read, decide, and manually isolate an instance, your blast radius window is still open.

I wanted high-confidence findings to trigger automatic containment.

So I built a minimal AWS-native SOAR pipeline.

No third-party tooling.

No overengineering.

Just deterministic, event-driven response.


🎯 Objective

Build an automated containment workflow that:

  • Responds only to high-severity GuardDuty findings
  • Automatically isolates compromised EC2 instances
  • Preserves forensic access
  • Avoids recursive execution
  • Is observable and debuggable

All event-driven. No polling. No manual trigger.


🏗 Architecture Overview

GuardDuty Finding

EventBridge Rule (severity >= 7)

Lambda Function (Isolation Logic)

Modify EC2 Security Group → Quarantine SG

SNS Notification (Visibility Layer)
Enter fullscreen mode Exit fullscreen mode

Minimal. Deterministic. Cheap.


Filtering at the Event Layer (Not Inside Lambda)

Instead of checking severity inside the Lambda function, I filtered directly in EventBridge.

Why this matters:

  • Reduces unnecessary Lambda invocations
  • Makes response criteria explicit
  • Improves audit clarity
  • Lowers operational cost

Example event pattern:

{
  "detail-type": ["GuardDuty Finding"],
  "detail": {
    "severity": [ { "numeric": [">=", 7] } ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Only high-confidence findings trigger automation.

Everything else remains visible — but not auto-remediated.

Quarantine Security Group Design

Containment is not termination.

Terminating an instance destroys forensic evidence.

My quarantine security group:

  • ❌ No outbound internet

  • ❌ No inbound from public IP ranges

  • ✅ Allow only SOC bastion IP

  • ✅ Allow forensic collection host

  • ✅ Optional: allow VPC Flow Logs / monitoring endpoint

The goal is isolation with controlled investigation access.

Isolation Logic (Lambda Example)

Core logic:

import boto3

ec2 = boto3.client('ec2')

def isolate_instance(instance_id, quarantine_sg_id):
    ec2.modify_instance_attribute(
        InstanceId=instance_id,
        Groups=[quarantine_sg_id]
    )
Enter fullscreen mode Exit fullscreen mode

Additional safeguards added:

  • Check instance state before modification

  • Tag instance Quarantined=true

  • Exit if already isolated

  • Log original security groups for rollback

Containment must be idempotent.

Idempotency: Preventing Recursive Triggers

When Lambda modifies security groups, CloudTrail events may fire.

Without safeguards, you risk infinite loops.

Mitigation:

  • Tag check before modification

  • Structured event filtering

  • Explicit function logging

  • DLQ configured for failure cases

Automation that can repeat blindly is dangerous.

Failure Modes I Modeled

Automation amplifies mistakes.

I explicitly accounted for:

  • IAM permission drift

  • Partial security group modification

  • Concurrent findings on same instance

  • Cross-region GuardDuty setup

  • High-volume alert bursts

Mitigations:

  • Dead Letter Queue

  • Lambda concurrency limits

  • CloudWatch error metrics + alarms

  • Explicit structured logs (JSON format)

  • Permission boundary controls

Automation without observability becomes silent failure.

Impact

This reduced:

  • MTTR from minutes to seconds

  • Human triage fatigue

  • Decision bottlenecks

  • Inconsistent containment actions

But the real improvement was consistency.

Humans improvise during incidents.
Code executes predictably.

Trade-Offs & Risks

Auto-isolating compute is not trivial.

You must consider:

  • False positives at high severity

  • Production-critical workloads

  • Stateful applications

  • Already-compromised lateral movement

  • Multi-account architecture

Severity threshold tuning took longer than writing the Lambda function.

That surprised me.

Lessons Learned

  1. Detection maturity does not equal response maturity.

  2. Event-driven architecture scales better than polling remediation.

  3. Idempotency is mandatory.

  4. Multi-account containment becomes architecture work.

  5. Automation exposes operational blind spots you didn’t know existed.

Next Iterations

If I evolve this into a more mature Cloud SOAR pattern:

  • Step Functions for multi-stage workflows

  • Automated EBS snapshot before isolation

  • Memory capture integration

  • Slack/Jira enrichment with context

  • Cross-account orchestration via AWS Organizations

  • GuardDuty central delegated admin integration

At that point, it becomes a response framework — not a script.

Final Thought

You don’t need a commercial SOAR platform to start automating response.

Start with:

  • Deterministic triggers

  • Guardrails

  • Observability

  • Explicit blast radius control

If detection isn’t wired to action, it’s just telemetry.

Top comments (2)

Collapse
 
harsh2644 profile image
Harsh

This is exactly the kind of content I look for! "Detection without response is operational noise" — absolutely true. Love how you've implemented event-driven isolation without relying on paid SOAR tools. Definitely trying this in my AWS environment. Thanks for sharing!

Collapse
 
sesank_naga_m_01 profile image
Sesank Munukutla (Naga)

Thanks a lot, Harsh! Really glad it resonated