丁久

Posted on May 12 • Originally published at dingjiu1989-hue.github.io

Monitoring Tools: Grafana vs Datadog vs New Relic

#developertools #productivity

This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.

Monitoring Tools: Grafana vs Datadog vs New Relic

Introduction

Effective monitoring is the difference between discovering incidents through user complaints and catching them proactively through dashboards and alerts. The three dominant platforms in the observability space--Grafana, Datadog, and New Relic--each take distinct approaches to metrics, logging, tracing, and alerting. This article provides a technical comparison to guide your selection.

Dashboarding Capabilities

Grafana

Grafana excels at visualization with support for dozens of data sources:

{

  "dashboard": {

    "title": "Production Overview",

    "panels": [

      {

        "title": "HTTP Request Rate",

        "type": "timeseries",

        "datasource": "Prometheus",

        "targets": [{

          "expr": "sum(rate(http_requests_total[5m])) by (service)",

          "legendFormat": "{{ service }}"

        }]

      },

      {

        "title": "Service Latency (p99)",

        "type": "stat",

        "datasource": "Tempo",

        "targets": [{

          "query": "{.name = \"HTTP GET\"} | stats p99(duration_ms) as p99 by service"

        }]

      },

      {

        "title": "Error Budget",

        "type": "gauge",

        "datasource": "Prometheus",

        "targets": [{

          "expr": "(1 - (sum(rate(http_requests_total{status=~\"5..\"}[30d])) / sum(rate(http_requests_total[30d])))) * 100"

        }],

        "thresholds": {

          "steps": [

            {"value": null, "color": "green"},

            {"value": 99.9, "color": "yellow"},

            {"value": 99.99, "color": "red"}

          ]

        }

      }

    ]

  }

}

Datadog

Datadog provides a more opinionated dashboarding experience with integrated template variables:

{

  "title": "Service Overview",

  "widgets": [{

    "definition": {

      "type": "timeseries",

      "requests": [{

        "q": "avg:http.requests{service:payment} by {endpoint}.as_rate()",

        "display_type": "line",

        "style": {"palette": "warm"}

      }],

      "yaxis": {"scale": "linear", "min": "auto"}

    }

  }]

}

New Relic

New Relic uses NRQL, a SQL-like query language for dashboards:

-- NRQL query

SELECT percentile(duration, 99) AS 'p99'

FROM Transaction

WHERE appName = 'Payment Service'

TIMESERIES auto

SINCE 1 hour ago

-- Error rate query

SELECT count(*) AS 'errors'

FROM TransactionError

WHERE appName = 'Payment Service'

FACET error.message

LIMIT 10

Alerting Configuration

Grafana Alerting

# Grafana managed alert rule

apiVersion: grafana/v1

kind: AlertRule

metadata:

  name: HighErrorRate

spec:

  for: 5m

  annotations:

    summary: "Error rate above threshold for Payment Service"

    runbook_url: "https://runbooks.internal/payment-high-errors"

  labels:

    severity: critical

    team: platform

  data:

    - ref: A

      datasourceUid: prometheus

      model:

        expr: |

          sum(rate(http_requests_total{

            service="payment", status=~"5.."

          }[5m])) / sum(rate(http_requests_total{

            service="payment"

          }[5m])) > 0.05

    - ref: B

      datasourceUid: prometheus

      model:

        expr: "1"

    - ref: C

      datasourceUid: __expr__

      model:

        expression: "$A && $B"

        type: math

Datadog Monitors

# Datadog monitor via API

monitor:

  name: "[Payment] High Latency Alert"

  type: metric alert

  query: "avg(last_5m):p99:trace.servlet.request.duration{service:payment} > 1"

  message: |

    {{#is_alert}}

    Payment service p99 latency is {{value}}s (threshold: 1s)

    @slack-alerts

    {{/is_alert}}

  options:

    thresholds:

      critical: 1.0

      warning: 0.5

    notify_no_data: true

    evaluation_delay: 60

    new_group_delay: 300

APM and Distributed Tracing

Datadog APM

from ddtrace import tracer, patch_all

# Auto-instrument supported libraries

patch_all()

# Custom instrumentation

@tracer.writer(service_name="payment-service")

def process_payment(order_id, amount):

    with tracer.trace("payment.charge") as span:

        span.set_tag("order_id", order_id)

        span.set_metric("amount", amount)

        result = gateway.charge(amount)

        span.set_tag("transaction_id", result.id)

        return result

New Relic APM

import newrelic.agent

# Custom transaction

@newrelic.agent.background_task()

def process_refund(transaction_id):

    wi

Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.

Found this useful? Check out more developer guides and tool comparisons on AI Study Room.

DEV Community

Monitoring Tools: Grafana vs Datadog vs New Relic

Monitoring Tools: Grafana vs Datadog vs New Relic

Introduction

Dashboarding Capabilities

Grafana

Datadog

New Relic

Alerting Configuration

Grafana Alerting

Datadog Monitors

APM and Distributed Tracing

Datadog APM

New Relic APM

Top comments (0)