DEV Community

Cover image for Telemetry-Driven Network Automation: From Metrics to Action
beefed.ai
beefed.ai

Posted on • Originally published at beefed.ai

Telemetry-Driven Network Automation: From Metrics to Action

  • Collect and Normalize: Build a Single Source of Network Telemetry Truth
  • From Signals to Decisions: Design Alerting, Policies, and Risk Models
  • Implement Closed-Loop Automation: Safe Automated Remediation
  • Scale and Control Costs: Telemetry Pipelines, Storage, and Tradeoffs
  • Practical Application: Playbooks, Checklists, and Example Code

Network telemetry is the nervous system for modern networks; collecting counters without turning them into decisions simply creates noise and cost. You need a streaming telemetry backbone, a normalized model layer, and a decision plane that turns observability into action — fast, auditable, and safe.

The friction you feel is familiar: hundreds of device-specific counters, multiple flow protocols, alert storms, long MTTR, and hand-operated remediation that either takes too long or causes collateral damage. Teams waste cycles stitching vendor formats together and end up making conservative change decisions or reverting to risky manual fixes when a high-severity alert arrives. Observability without a consistent data model and decision logic delivers neither confidence nor speed. The best practice is to treat telemetry as data you can operate on — not as a notification stream to be archived.

Collect and Normalize: Build a Single Source of Network Telemetry Truth

You must collect from diverse sources — counter metrics, flow streams, and model-driven state — and convert them into a consistent schema before analytics or automation can consume them at scale.

  • Sources you will encounter

    • Model-driven streaming (gNMI/OpenConfig): Push-oriented, rich state and config; ideal for operational telemetry and device state. gNMI/OpenConfig defines subscription semantics and standardized schema so you don’t have to parse vendor CLI output.
    • Flow records (IPFIX/NetFlow): Flow-level records for top-talkers and traffic engineering; useful for DDoS detection, capacity planning, and application-level analytics. IPFIX is the standards-based flow export format.
    • Packet-sampling (sFlow): Low-cost, high-speed statistical sampling useful for aggregate traffic patterns and DDoS detection at wire speed.
    • Traditional SNMP / syslog: Still valuable for basic counters and alarms; useful where streaming agents are not available.
  • Normalize with an explicit model

    • Adopt OpenConfig / YANG where possible so telemetry streams share node names, paths, and semantics across vendors. Use gNMI subscriptions to stream the OpenConfig sensor paths you care about. That makes downstream rule writing (and automation) stable across platforms.
    • Use an intermediate collector/adapter (examples: gnmic, pygnmi, telegraf gNMI plugin, OpenTelemetry Collector) to translate native device payloads into normalized metrics, JSON events, or Prometheus metrics. These tools let you perform early transformations (drop, rename, aggregate) at ingestion time so you never store every device counter verbatim.
  • On-device and edge preprocessing

    • Push aggregation and on-change subscriptions to devices where the hardware supports them (dial-out telemetry or ON_CHANGE subscriptions). That reduces network and collector load and keeps high-resolution telemetry for only the signals that change. Vendor guides and modern NOSs support dial-out streaming with configurable sensor paths and ON_CHANGE modes.
    • Use the collector to apply sampling, rollups, and label normalization. For Prometheus-style consumers, convert complex state into numeric gauges or counters that Prometheus understands; for analytics clusters, convert telemetry into structured events.

Important: Normalize early — the costs of chasing dozens of ad-hoc device-specific metrics balloon as pipelines and dashboards multiply. Instrument once at ingestion and use consistent labels downstream.

From Signals to Decisions: Design Alerting, Policies, and Risk Models

Telemetry becomes useful when it reliably drives decisions — not when it triggers pages ad infinitum.

  • Architect a decision plane, not just alerts

    • Separate detection (signal processing) from decision (policy). Detection produces candidate incidents (anomalies, threshold breaches). Decision applies context: maintenance windows, SLO impact, recent configuration changes, and change freeze policies. Tie detection outputs to a risk score before remediation is allowed. This avoids reflex automation on noisy signals.
    • Encode policies as machine-readable rules: severity labels, remediation tags, and allowed actions. Keep runbook links and remediation playbook identifiers in alert annotations so the decision engine can select the correct workflow.
  • Practical alert design (what works)

    • Use multi-window detection: short-window spikes + medium-window sustained thresholds + baseline/anomaly checks. An alert that requires a short spike OR sustained violation is a recipe for either instability or silence — combine both tests in rules. Prometheus-style alerting supports for and grouped rules that reduce noise.
    • Control cardinality: do not create labels with high-cardinality values unless you will query on them. Cardinality explosions kill query performance and memory in Prometheus-style systems. Apply relabeling, label value bucketing, or drop high-cardinality labels at ingestion.
  • Example of policy attributes (kept as labels/annotations)

    • severity, remediation: auto, remediation: human, maintenance_window_allowed, service_slo_impact, rollback_playbook_id.

Implement Closed-Loop Automation: Safe Automated Remediation

Closed-loop automation takes a detection -> decision -> action -> verify -> audit path and makes it repeatable, observable, and reversible.

  • The canonical closed-loop sequence

    1. Detect using streaming telemetry and analytics.
    2. Score the incident (risk + SLO impact + change context).
    3. Decide: abort, human-in-loop, or auto-remediate (with throttles).
    4. Act: call the automation engine (Ansible, Nornir, Napalm, or a gNMI client) through an orchestrator that enforces idempotency and transactional semantics.
    5. Verify: read back the same telemetry that triggered the action to confirm remediation.
    6. Rollback automatically on failed verification or escalate to human ops.
    7. Audit: store the telemetry + action + verification as an immutable run record.
  • Safety-first implementation patterns

    • Use canaries and scope-limits. If a rule would shut multiple devices, require progressive application (canary one device, validate, then scale).
    • Require multi-signal confirmation for disruptive actions (e.g., combine interface error counters + packet drops + syslog entries before shutting a link).
    • Keep idempotent playbooks and include dry-run and check modes in your automation. Use netconf/gNMI transactional semantics where available.
    • Add time fences: perform auto-remediations only outside strict change freezes or within approved maintenance windows.
  • Example architecture choices for action execution

    • Use Alertmanager webhook → orchestration service (small HTTP microservice or Kubernetes Job) → automation executor (Ansible, AWX/Tower, Nornir, or direct pygnmi calls). Prometheus Alertmanager supports webhook receivers natively; webhook receivers can trigger jobs, Kubernetes jobs, or Ansible runs.
  • Minimal, practical remediation example

    • Use telemetry to detect a sustained interface error spike.
    • Decision layer verifies no maintenance window and that multiple telemetry signals agree.
    • Orchestrator runs a pre-validated playbook that (1) disables spanning-tree flapping features or (2) briefly bounces the port (with canary and rollback). Always verify via the same telemetry stream before marking the incident resolved.

Scale and Control Costs: Telemetry Pipelines, Storage, and Tradeoffs

Scaling telemetry is not just a technical problem; it’s a financial one. The three levers you control are resolution, cardinality, and retention.

Choice Typical behavior Cost/scale note
High-frequency, high-cardinality metrics in Prometheus TSDB Excellent realtime alerting and dashboards Memory and CPU scale with active series; cardinality is the dominant cost.
Push + long-term storage (Thanos/Cortex) Remote-write into a cluster that stores into object storage with downsampling Enables long retention and global queries but needs receive/ingest and compaction components; use for capacity planning and postmortems.
Kafka/message bus as buffer Durable decoupling between collectors and processors Good for large, variable ingest; useful when many downstream consumers (analytics, security, automation).
Flow/sFlow collectors Low-latency traffic visibility with sampling Low resource on devices but sample-rate affects accuracy; use for DDoS detection and top-talkers.
  • Cardinality is the primary scaling risk

    • Each unique label combination becomes a time series in Prometheus-style systems; uncontrolled cardinality leads to memory exhaustion and slow queries. Use relabeling, bucketing, and label whitelists at ingestion to control active series.
    • Consider tiering: keep high-resolution recent metrics in Prometheus heads for 7–30 days; remote-write to Thanos/Cortex for long-term storage with downsampling and longer retention to reduce cost.
  • Pipeline patterns that buy scale

    • Gateway Collectors / OTel Gateways: run collectors as gateways and do sampling, filtering, and routing there so that backends only see what they need. The OpenTelemetry Collector supports pipelines that receive, process, and export multiple telemetry types.
    • Message bus (Kafka) between collectors and processors when ingestion bursts are large or you have many consumers — it decouples the system and provides back-pressure handling and replayability.
    • Adaptive metrics: track which metrics are actually used for alerts/dashboards and automatically reduce retention or lower resolution for unused series. This is becoming a standard approach to cost control.

Practical Application: Playbooks, Checklists, and Example Code

This section gives concrete steps, safety checklists, and compact examples to get a working observability-driven automation flow running in weeks — not quarters.

Checklist — minimum viable observability-driven automation

  • Inventory devices and available telemetry (gNMI/OpenConfig, SNMP, NetFlow/IPFIX, sFlow).
  • Map each operational concern (errors, utilization, BGP flaps, packet drops) to a telemetry signal and an SLO or threshold.
  • Select a normalization layer (OpenConfig/gNMI where available; OTel Collector or gnmic for transform).
  • Implement detection rules and classify alerts by actionable tag (auto, human, investigate).
  • Build a decision engine that checks maintenance windows, recent changes, and SLO impact before permitting remediation.
  • Create idempotent automation playbooks and test them in a sandbox. Add automated rollback and verification steps.
  • Add audit trails: log who/what triggered a run, the telemetry that caused it, and the verification metrics post-action.

Step-by-step protocol (short)

  1. Enable gNMI streaming for target sensor paths and route to your collector (or configure gnmic/telegraf to subscribe). Use OpenConfig paths for vendor-neutral naming.
  2. In the collector, apply processors:
    • normalization (rename paths → stable metric names)
    • deduplication
    • relabeling (drop or bucket risky labels)
    • aggregation/downsample for long-term storage.
  3. Send time-series metrics to Prometheus for realtime alerting and remote-write to a Thanos/Cortex cluster for retention and analytics.
  4. Implement PromQL rules that emit alerts with annotations carrying remediation and playbook_id.
  5. Configure Alertmanager to route alerts to a webhook that hits your orchestrator. Use a webhook receiver that can instantiate a Kubernetes Job or call AWX/Tower.
  6. Orchestrator validates policy gates (no maintenance window, risk acceptable) and either queues a human review or triggers automation agents (Ansible / pygnmi).
  7. Automation performs remediation, then the orchestrator reads back telemetry to confirm success. On failed verification, automatically run rollback or escalate to on-call.

Example — Prometheus rule (YAML)

groups:
- name: network.rules
  rules:
  - alert: InterfaceHighErrorRate
    expr: >
      increase(interface_input_errors_total{job="gnmi_collectors"}[5m]) > 1000
    for: 5m
    labels:
      severity: critical
      remediation: 'auto-shutdown'
    annotations:
      summary: "Interface {{ $labels.interface }} on {{ $labels.device }} exceeded error threshold"
      runbook: "https://runbooks.example.com/interface-errors"
Enter fullscreen mode Exit fullscreen mode

(Use conservative for windows and multi-signal checks in the decision layer to avoid action on transient spikes.)

Example — Alertmanager webhook receiver (snippet)

receivers:
- name: automation-webhook
  webhook_configs:
  - url: 'https://orchestrator.company.local/api/v1/alerts'
    send_resolved: true
Enter fullscreen mode Exit fullscreen mode

Alertmanager sends structured JSON to an orchestrator which applies policy checks (maintenance windows, recent config changes) before running a remediation.

Example — minimal orchestration webhook (conceptual, Python)

# conceptual excerpt - validate inputs, apply policy gates, then trigger playbook
from flask import Flask, request
import subprocess, threading

app = Flask(__name__)

@app.route('/api/v1/alerts', methods=['POST'])
def webhook():
    payload = request.json
    alerts = payload.get('alerts', [])
    for a in alerts:
        labels = a.get('labels', {})
        # Basic policy gate example: only auto-run if remediation label present
        if labels.get('remediation') == 'auto-shutdown':
            device = labels['device']; interface = labels['interface']
            # enqueue an ansible run with extra-vars; orchestrator must do further checks
            threading.Thread(target=subprocess.call, args=([
                'ansible-playbook','remediate_interface.yml',
                '--extra-vars', f"device={device} interface={interface}"
            ],)).start()
    return '', 202
Enter fullscreen mode Exit fullscreen mode

Prefer job queues and asynchronous execution; never block the webhook handler.

Example — using pygnmi to set a simple config (conceptual)

from pygnmi.client import gNMIclient

target = ('10.0.0.10', 57400)
with gNMIclient(target=target, username='admin', password='REDACTED', insecure=True) as gc:
    update = [(
      '/interfaces/interface[name=Ethernet1]/config/enabled',
      False
    )]
    resp = gc.set(update=update)
    print(resp)
Enter fullscreen mode Exit fullscreen mode

Use pygnmi for direct, model-driven changes where the device supports gNMI and the change is part of your tested playbook.

Safety callout: Always include verification steps that use the same telemetry path that detected the problem. Automations must be reversible and logged; never assume a single telemetry signal is the sole truth.

Sources:
gNMI specification (OpenConfig) - Defines the gNMI protocol and subscription semantics used for model-driven streaming telemetry and configuration.

Prometheus Alerting & Configuration - Prometheus/Alertmanager rule and webhook formats, best practices for alert routing and receivers.

RFC 7011 — IP Flow Information Export (IPFIX) - Standards document describing flow export format for NetFlow/IPFIX telemetry.

Junos Telemetry Interface (JTI) — Juniper Networks - Vendor guidance on streaming telemetry modes and data models (gNMI, gRPC, UDP).

Thanos Receive / Architecture - Long-term storage options for Prometheus via remote-write, downsampling, and scaling considerations.

Grafana Labs — Observability Survey & State of Observability (2025) - Industry survey findings on Prometheus/OpenTelemetry adoption, alert fatigue, and cost control priorities.

OpenTelemetry Collector (Documentation) - Collector architecture for receiving, processing, and exporting telemetry; patterns for scaling pipelines.

Cardinality Control — Prometheus best practices (Compile N Run) - Practical guidance on why and how to reduce metric cardinality.

Ansible network NETCONF & netconf_config module docs - How to use Ansible network modules for device configuration and NETCONF connections.

Confluent — Monitoring and Observability for Kafka Clusters - Using Kafka as a durable buffer for telemetry pipelines and patterns for monitoring Kafka itself.

pygnmi — Python gNMI client (GitHub / PyPI) - Python client for gNMI get, set, and subscribe RPCs; useful for model-driven remediation.

NetFlow vs sFlow — Kentik Blog - Comparison of flow telemetry formats and their scalability/accuracy tradeoffs.

OpenConfig data models (OpenConfig project) - The OpenConfig YANG model library and schema documentation for consistent telemetry names.

alertmanager-webhook-receiver (example GitHub) - Example of a webhook receiver that converts Alertmanager webhooks into jobs (pattern for automation orchestration).

Top comments (0)