Aisalkyn Aidarova

Posted on May 11

Prometheus & Grafana: Complete Senior DevOps/SRE Curriculum (6+ YOE)

Prometheus and Grafana are the backbone of modern observability stacks. Whether you're targeting a Senior DevOps or SRE role, this guide covers everything from fundamentals to production-grade operations — written for engineers aiming at 6+ years of experience level.

Chapter 1: Foundations & Philosophy

The Observability Triad

Observability is the ability to understand the internal state of a system from its external outputs. The three pillars:

Metrics — Numeric time-series data. Cheapest to store, best for dashboards and alerting. Prometheus specializes here.
Logs — Timestamped records of events. High cardinality, expensive but rich in detail. (ELK, Loki)
Traces — Request-level journey across distributed systems. (Jaeger, Tempo, Zipkin)

Senior Engineer Rule: Metrics catch WHAT is broken. Logs tell you WHY. Traces show WHERE.

Monitoring vs Observability

Monitoring = watching predefined things you already know to watch.
Observability = being able to ask questions you didn't know you'd need to ask.
Legacy monitoring (Nagios, Zabbix): push-based, check-based, host-centric.
Prometheus: pull-based, metrics-centric, service discovery-native, cloud-native first.

Why Prometheus won:

Born inside SoundCloud (2012), donated to CNCF (2016)
Second CNCF graduated project after Kubernetes
Native integration with the cloud-native ecosystem
PromQL is expressive and composable
Federation and remote write enable scale-out

The Four Golden Signals (SRE Bible)

From Google's SRE Book — the minimum you must monitor for any service:

Signal	Description	Example PromQL
Latency	How long requests take	`histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))\`
Traffic	How much demand (RPS)	`rate(http_requests_total[5m])\`
Errors	Rate of failed requests	`rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])\`
Saturation	How "full" your service is	`1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))\`

RED Method & USE Method

RED Method (for microservices — Tom Wilkie, Grafana Labs):

Rate — requests per second
Errors — error rate
Duration — latency distribution

USE Method (for infrastructure — Brendan Gregg):

Utilization — % time resource is busy
Saturation — extra work queued
Errors — error events count

Rule of thumb: Use RED for services, USE for hosts/infrastructure.

Chapter 2: Prometheus Architecture

Core Components

Prometheus Server — The brain. Scrapes, stores, evaluates rules.

Scrape engine: HTTP GET /metrics on targets
TSDB: time-series database (local disk, not distributed)
Rule evaluator: recording & alerting rules

Pushgateway — For short-lived jobs (batch, crons) that can't be scraped. Push metrics here; Prometheus scrapes the gateway. Do not use as a general intermediary.

Alertmanager — Receives alerts, deduplicates, groups, routes, and silences. Routes to PagerDuty, Slack, OpsGenie, email, etc.

Critical Exporters:

node_exporter\ → OS metrics (CPU, memory, disk, network)
kube-state-metrics\ → Kubernetes object states
blackbox_exporter\ → probe HTTP/DNS/ICMP externally
mysqld_exporter\ / postgres_exporter\ → database metrics
kafka_exporter\ → consumer group lag
redis_exporter\ → Redis memory, hit rate, replication

Pull vs Push Model

Prometheus pulls (scrapes) metrics from targets.

Advantages of pull:

Prometheus controls scrape interval — prevents metric floods
Easy to detect if a target is down (scrape fails)
Simpler firewall rules (Prometheus initiates)

When pull doesn't work:

Short-lived batch jobs → use Pushgateway
Network segmented environments → use Grafana Agent / Agent Mode
Massive scale (100k+ targets) → use federation or Thanos

TSDB Internals

Chunks: 2-hour blocks on disk
Head block: last 2h in memory (mmap'd)
Compaction: blocks merged into larger chunks over time
Retention config: --storage.tsdb.retention.time=30d\

⚠️ CRITICAL — Cardinality: High cardinality labels (user_id\, request_id\, email\) WILL kill Prometheus. Labels must have bounded, predictable value sets. This is the #1 production issue.

Service Discovery

Static configs are for demos. Production uses service discovery:

kubernetes_sd_configs\ — pods, services, endpoints, nodes, ingresses
ec2_sd_configs\ — AWS EC2 instances
consul_sd_configs\ — HashiCorp Consul
file_sd_configs\ — custom SD via JSON/YAML files

relabel_configs — Transform discovered metadata into labels BEFORE scraping.
metric_relabel_configs — Transform metrics AFTER scraping (drop, rename, filter).

\`yaml

Drop high-cardinality label after scraping

metric_relabel_configs:

source_labels: [request_id] action: labeldrop `\

Chapter 3: PromQL — The Query Language

Data Types

Type	Description	Example
Instant vector	Single sample per series at evaluation time	`http_requests_total{job="api"}\`
Range vector	Range of samples over a time window	`http_requests_total[5m]\`
Scalar	Single float	`1.5\`

Label matchers:

=\ exact match, !=\ not equal
=~\ regex match: {status=~"5.."}\
!~\ regex not match: {env!~"dev|staging"}\

Functions You Must Know

\`promql

Per-second rate of a counter (use this, not instant value)

rate(http_requests_total[5m])

p99 latency from histograms

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Total increase over a window

increase(http_requests_total[1h])

Aggregation

sum by (job) (rate(http_requests_total[5m]))
topk(5, rate(http_requests_total[5m]))

Will disk fill in 4 hours?

predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0

Alert when metric is missing

absent(up{job="critical-service"})
`\

Binary Operators & Vector Matching

\`promql

Error ratio by method

sum by (method) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (method) (rate(http_requests_total[5m]))

Many-to-one matching

method_code:http_errors:rate5m / on(method) group_left method:http_requests:rate5m
`\

Recording Rules

Pre-compute expensive queries and save as new metrics. Makes dashboards fast.

\`yaml
groups:

name: http_metrics
interval: 1m
rules:
- record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_error_rate:ratio5m expr: | sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m])) `\

Naming convention: level:metric:operation\ (e.g., job:http_requests:rate5m\)

Chapter 4: Alerting & Alertmanager

Writing Effective Alerting Rules

\`yaml
groups:

name: api_alerts
rules:
- alert: HighErrorRate expr: job:http_error_rate:ratio5m > 0.05 for: 5m labels: severity: critical team: backend annotations: summary: "High error rate on {{ $labels.job }}" description: "Error rate is {{ $value | humanizePercentage }} for 5 minutes." runbook_url: "https://wiki.company.com/runbooks/high-error-rate" dashboard_url: "https://grafana.company.com/d/api-overview" `\
  - for:\ — Must be pending for this duration before firing (reduces flapping)
  - labels\ — Used for routing in Alertmanager
  - annotations\ — Human-readable context; NOT used for routing
  - Alert states: inactive → pending → firing

Alertmanager Configuration

\`yaml
route:
receiver: 'slack-general'
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
routes:
- match:
severity: critical
receiver: pagerduty-oncall

inhibit_rules:

source_match: severity: 'critical' target_match: severity: 'warning' equal: ['job', 'instance'] `\

SLO-Based Alerting — Multi-Window Multi-Burn-Rate

This is senior-level alerting. Forget simple threshold alerts.

SLO — Service Level Objective. E.g., 99.9% of requests succeed over 30 days.
Error budget — 0.1% of 30 days = 43.8 minutes allowed.
Burn rate — How fast you're consuming error budget. Rate 1 = exactly on budget.

Window	Burn Rate	Alert Type
1h + 5m	14x	Critical — page immediately
6h + 30m	6x	Critical — page
1d + 2h	3x	Warning — ticket
3d + 6h	1x	Info — FYI

\`promql

14x burn rate on a 0.1% error budget

(
rate(http_requests_total{status=~"5.."}[1h])
/
rate(http_requests_total[1h])
) > 14 * 0.001
`\

Tools: Sloth, Pyrra, OpenSLO. Every serious SRE team uses SLO-based alerting. Know this cold.

Deadman's Switch Pattern

Alert when a metric STOPS being reported:

\promql absent(batch_job_last_success_timestamp) or time() - batch_job_last_success_timestamp > 3600 \\

Chapter 5: Scaling Prometheus

Federation

Single Prometheus handles ~1M samples/sec, ~10M active series. Beyond that:

\`yaml
scrape_configs:

job_name: 'federate' honor_labels: true metrics_path: '/federate' params: match[]: - 'job:http_requests:rate5m' static_configs:
- targets:
  - 'prometheus-cluster1:9090'
  - 'prometheus-cluster2:9090' `\

Only federate aggregated recording rules, not raw metrics.

Thanos Architecture

Thanos extends Prometheus with global query view and long-term storage.

Component	Purpose
Sidecar	Runs next to Prometheus; uploads 2h blocks to S3/GCS/Azure
Store Gateway	Serves data from object storage
Querier	Merges + deduplicates results from all sources
Compactor	Compacts, downsamples, enforces retention
Ruler	Evaluates rules against Thanos data
Receive	Remote write ingestion endpoint

Alternatives: Grafana Mimir (recommended), Cortex, VictoriaMetrics

Prometheus Agent Mode

Lightweight scraper — no local storage, no PromQL. Perfect for edge environments.

\bash prometheus --enable-feature=agent \\

\`yaml
remote_write:

url: "https://mimir.company.com/api/v1/push" queue_config: max_shards: 30 max_samples_per_send: 2000 write_relabel_configs:
- source_labels: [name] regex: "unnecessary_metric_.*" action: drop `\

kube-prometheus-stack (Kubernetes)

\bash helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \\

Includes: Prometheus Operator, Prometheus HA pair, Alertmanager HA, Grafana, node_exporter, kube-state-metrics, pre-built alerting rules.

Prometheus Operator CRDs:

ServiceMonitor\ — Auto-discover services via label selectors
PodMonitor\ — Scrape pods directly
PrometheusRule\ — Rules as Kubernetes objects
AlertmanagerConfig\ — Routing as Kubernetes objects

Chapter 6: Exporters & Instrumentation

The 4 Metric Types

\`python
from prometheus_client import Counter, Gauge, Histogram, Summary

Counter — only goes up (requests, errors, bytes)

requests_total = Counter('http_requests_total', 'Total requests',
['method', 'status'])
requests_total.labels(method='GET', status='200').inc()

Gauge — can go up or down (queue size, active connections)

queue_size = Gauge('queue_size', 'Queue depth', ['queue_name'])
queue_size.labels(queue_name='jobs').set(42)

Histogram — latency, request size (enables percentiles)

request_latency = Histogram('request_duration_seconds', 'Latency',
['endpoint'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1])
with request_latency.labels(endpoint='/api').time():
do_work()
`\

⚠️ Histogram vs Summary: Prefer Histogram in distributed systems. Histograms can be aggregated across instances; Summaries cannot.

TLS Certificate Expiry Alert

\`promql

Alert 14 days before cert expires

probe_ssl_earliest_cert_expiry - time() < 86400 * 14
`\

Never let a cert expire in production. Add this to every HTTPS endpoint.

Chapter 7 & 8: Grafana — Foundations to Advanced

Panel Types Reference

Panel	Best For
Time series	Any metric over time (default choice)
Stat	Single current value — KPIs, NOC screens
Gauge	Utilization % (CPU, memory, disk)
Heatmap	Latency distribution over time
Table	Top-N, alert states, comparison tables
Logs	Log lines alongside metrics (Loki)
Node Graph	Service topology / service map

3-Tier Dashboard Hierarchy

Tier	Audience	Content
Tier 1: Fleet Overview	Managers, NOC	All services status, global golden signals
Tier 2: Service Detail	Engineers	Full 4 golden signals, resources, queues
Tier 3: Debugging	Incident response	Detailed metrics, logs, traces, ad-hoc

Dashboard as Code

\`hcl

Terraform

resource "grafana_dashboard" "api_overview" {
config_json = file("dashboards/api-overview.json")
folder = grafana_folder.production.id
}
`\

\`yaml

Grafana Provisioning

apiVersion: 1
datasources:

name: Prometheus type: prometheus url: http://prometheus:9090 isDefault: true editable: false `\

The LGTM Stack

Letter	Component	Purpose
L	Loki	Log aggregation. LogQL. Promtail agent.
G	Grafana	Visualization for the entire stack
T	Tempo	Distributed tracing. OpenTelemetry compatible.
M	Mimir	Horizontally scalable Prometheus. Multi-tenant.

OpenTelemetry (OTel) is the vendor-neutral standard for instrumentation — one SDK for traces, metrics, and logs. Senior engineers must know OTel. It's replacing vendor-specific SDKs.

Chapter 9: Production Operations

Security Hardening

\`yaml

Prometheus web.yml — TLS + Basic Auth

tls_server_config:
cert_file: /etc/prometheus/certs/tls.crt
key_file: /etc/prometheus/certs/tls.key
basic_auth_users:
prometheus: $2b$12$hashed_password_bcrypt
`\

Grafana security checklist:

SSO via OIDC/SAML (Google, Okta, Azure AD)
RBAC: Viewer, Editor, Admin roles per folder
Service accounts for API automation
Audit log for compliance
Network policies to restrict access

Capacity Planning

Samples/sec	RAM (head block)	Notes
100k	~1 GB	Small deployment
500k	~5 GB	Medium deployment
1M	~10 GB	Large — consider sharding
10M+	100+ GB	Use Thanos/Mimir sharding

\`promql

Detect cardinality explosion — top 10 metrics by series count

topk(10, count by (name) ({name=~".+"}))
`\

HA & Disaster Recovery

Run 2 identical Prometheus instances + Thanos Sidecar. Querier deduplicates. No data loss if one goes down.

\`yaml

Alertmanager 3-node gossip cluster

alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager-0:9093
- alertmanager-1:9093
- alertmanager-2:9093
`\

Incident Response — 10-Step Flow

Alert triggers → Alertmanager routes to PagerDuty/Slack
Acknowledge → stops repeat notifications
Open Grafana → navigate to alert's dashboard link
Identify timeframe → when did it start? Correlate with deployments?
Check golden signals → which of the 4 is affected?
Drill down → use Explore for ad-hoc PromQL queries
Correlate → check logs in Loki, traces in Tempo
Mitigate → rollback, scale, circuit break, redirect traffic
Resolve → silence alert if still noisy after fix
Post-mortem → timeline, root cause, action items

\`promql

What changed recently?

changes(kube_deployment_spec_replicas[30m]) > 0

Memory leak detection

predict_linear(container_memory_working_set_bytes[1h], 3600)

container_spec_memory_limit_bytes
`\

Chapter 10: Career Prep — Interview Topics

PromQL Questions You Will Be Asked

Explain the difference between rate()\ and irate()\
How does histogram_quantile()\ work? What are bucket boundaries?
Write PromQL for p99 latency across all instances of a service
What is vector matching? When would you use group_left\?
How do recording rules improve dashboard performance?

Architecture Questions

Why does Prometheus use a pull model? What are the tradeoffs?
How would you handle 10M active time series?
Explain Thanos architecture. How does deduplication work?
What is cardinality and why does it matter?
How do you do HA for Prometheus? For Alertmanager?

Real-World War Stories

The Cardinality Bomb: Developer adds user_id\ as a label. 1M users = 1M new series. Prometheus OOMs. Solution: cardinality limits, label governance, pre-deploy PromQL review.

The Counter Reset Problem: Pod restarts → counter resets. rate()\ handles resets automatically — always use rate()\ on counters, never raw instant values.

The Pushgateway Anti-Pattern: Stale metrics persist forever in Pushgateway. Solution: add last-push timestamp metric, set up deletion via admin API.

The Alert Deduplication Failure: Two HA Prometheus fire the same alert, both page. Solution: Alertmanager cluster with gossip deduplication.

The Full Observability Ecosystem

Category	Tools
Metrics Collection	Prometheus, Grafana Agent, OpenTelemetry Collector
Long-term Storage	Thanos, Grafana Mimir, Cortex, VictoriaMetrics, AWS AMP
Logs	Grafana Loki, ELK Stack, Splunk
Traces	Jaeger, Grafana Tempo, Zipkin, AWS X-Ray
Alerting	Alertmanager, Grafana Unified Alerting, PagerDuty, OpsGenie
SLO Management	Sloth, Pyrra, OpenSLO, Grafana SLO
Visualization	Grafana (dominant), Kibana, Honeycomb

90-Day Learning Roadmap

Days 1–15: Core Prometheus — install locally, query node metrics, write alerting rules, set up Slack integration.

Days 16–30: PromQL mastery — complete PromlLabs PromQL Quiz, write histogram queries, instrument a Python/Go app with all 4 metric types.

Days 31–50: Kubernetes integration — deploy kube-prometheus-stack, write ServiceMonitors and PrometheusRules, explore kube-state-metrics.

Days 51–70: Grafana deep dive — build 3-tier dashboard hierarchy, dashboard-as-code with Terraform, set up SSO, explore Loki.

Days 71–90: Scale & production — deploy Thanos, implement SLO-based alerting with Sloth, practice incident response, build a public GitHub portfolio.

This curriculum covers everything expected of a Senior DevOps/SRE engineer with 6+ years of experience. Go build. Go break things. Go learn.