우병수

Posted on May 10 • Originally published at techdigestor.com

Your Linux Kernel Got CVE'd: Here's How I Actually Handle Patch Management in Production

#productivity #tools #webdev #discuss

TL;DR: The Slack message came in at 6:47 AM on a Tuesday. CVE-2024-1086 — a use-after-free in the netfilter subsystem that gives local users a path to root.

📖 Reading time: ~34 min

What's in this article

The Situation That Made Me Build a Real Patching Workflow
Step 1: Know What You're Actually Running Before You Patch Anything
Step 2: Scanning Your Fleet for Vulnerable Kernels
Step 3: Live Patching Without a Reboot — What Actually Works
Step 4: Unattended Upgrades — Set It Up Correctly or Don't Set It Up
Step 5: Coordinating Reboots Across a Fleet Without Waking Up at 3am
My Actual Patching Runbook (The Short Version)
Kernel Mitigation Flags You Should Actually Know About

The Situation That Made Me Build a Real Patching Workflow

The Slack message came in at 6:47 AM on a Tuesday. CVE-2024-1086 — a use-after-free in the netfilter subsystem that gives local users a path to root. My servers were running kernel 6.1.x. None of them were patched. I had 40 production boxes spread across bare metal and EC2, a mixed Ubuntu 22.04 and RHEL 9 fleet, and exactly zero enterprise support contracts to call. That morning taught me the difference between having a patching process and actually being able to patch 40 servers before the weekend without breaking anything.

CVE-2024-1086 is particularly nasty because the exploit code went public fast, and the netfilter subsystem is loaded on basically every Linux server that touches networking — which is all of them. The CVSS score was a 7.8, but the real-world exploitability bumped it into "drop everything" territory. The thing that caught me off guard wasn't the vulnerability itself — it was discovering that my "process" was a wiki page that hadn't been updated since 2022 and a cron job that ran apt upgrade nightly but never forced a reboot. Kernels don't patch themselves into memory. The package updates were happening. The actual running kernel was frozen in time.

Here's what this guide actually covers, in order of how I'd do it again if I had to start from scratch:

Scanning your fleet fast — figuring out which boxes are exposed before you start touching anything
Live patching with kpatch and kernel-livepatch — buying yourself time on boxes you can't reboot right now
Scheduled reboots that don't wreck your weekend — staggered, monitored, with rollback plans
The stuff vendor docs skip — like what happens when your EC2 instance comes back on a different kernel than you expect, or why RHEL 9's dnf update kernel and Ubuntu's unattended-upgrades behave very differently under pressure

My environment was deliberately unglamorous: Ubuntu 22.04 LTS on bare metal servers in a colo, RHEL 9 on a mix of physical hosts and EC2 instances, no Red Hat Enterprise support subscription beyond the base OS repos, and no AWS Enterprise Support. This matters because every guide I found that morning assumed either a homogeneous fleet or an enterprise contract that gives you access to extended live-patching services. I had neither. What I did have was ssh, ansible, and about three hours before my on-call rotation ended.

# First thing I ran — checking the actual running kernel across all hosts
# Not what's installed. What's *running*.
ansible all -i inventory.ini -m command -a "uname -r" 2>/dev/null | \
  grep -E "6\.1\.[0-9]+" | \
  awk '{print $1}' | sort -u

# Output looked like this:
# web-01 | CHANGED | rc=0 >>
# 6.1.69-1+deb12u1
# db-03 | CHANGED | rc=0 >>
# 6.1.55-ubuntu1

That one command is where I realized the actual scale of the problem. Installed kernel packages were mostly current. Running kernels were scattered across half a dozen patch levels because we'd never formalized reboot schedules. The gap between "kernel package updated" and "server running the updated kernel" was months on some boxes. That's the gap this entire workflow is designed to close — and close in a way that doesn't require everyone to cancel their Friday evening plans.

Step 1: Know What You're Actually Running Before You Patch Anything

Run These Commands First — Your Config Management Is Probably Lying to You

The most embarrassing mistake I see is people running patch playbooks against servers where the "current kernel" in their CMDB doesn't match what's actually booted. Before you touch anything, run both of these and compare:

# What's actually running right now
uname -r
# 5.10.0-21-amd64

# Full build info — shows GCC version, build timestamp, distro patch level
cat /proc/version
# Linux version 5.10.0-21-amd64 (debian-kernel@lists.debian.org)
# (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2)
# #1 SMP Debian 5.10.162-1 (2023-01-21)

uname -r tells you what the kernel is right now. /proc/version tells you when it was compiled and with what toolchain — useful when a CVE targets a specific compiler-level behavior or backport window. Ansible's ansible_kernel fact, Chef's node['kernel']['release'], whatever your config management exports — that data is collected at run time, often cached, and I've seen it be hours or days stale on long-running boxes.

Installed vs. Running: They Are Frequently Not the Same

On RHEL/CentOS/Rocky, you can have three kernel versions installed and still be booting the oldest one:

# RHEL / Rocky / AlmaLinux
rpm -q kernel
# kernel-5.14.0-162.6.1.el9_1.x86_64
# kernel-5.14.0-284.11.1.el9_2.x86_64
# kernel-5.14.0-362.8.1.el9_3.x86_64

# Then check what grub actually boots by default
grubby --default-kernel
# /boot/vmlinuz-5.14.0-162.6.1.el9_1.x86_64  ← that's the oldest one

# Debian / Ubuntu
dpkg -l 'linux-image-*' | grep ^ii
# ii  linux-image-5.15.0-88-generic   5.15.0-88.98  amd64
# ii  linux-image-5.15.0-91-generic   5.15.0-91.101 amd64

# And the running kernel again — notice the mismatch
uname -r
# 5.15.0-88-generic

That second installed kernel on Ubuntu, 5.15.0-91, contains the fix for CVE-2023-32233 (Netfilter use-after-free). Your patch report says "patched." The server is still vulnerable. This is the gap that gets people.

Why the Grub Default Doesn't Update Automatically (And When It Does)

On Debian-based systems, update-grub does set the newest kernel as default automatically — but only if GRUB_DEFAULT=0 in /etc/default/grub. The moment someone changes that to a specific menu entry string or index number to lock a kernel version, auto-promotion stops working silently. On RHEL systems, grubby handles this, and the behavior differs between RHEL 8 and RHEL 9.

# Check your grub config — this is the line that breaks everything
grep GRUB_DEFAULT /etc/default/grub
# GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 5.15.0-88-generic"
# ↑ hardcoded string — new kernels will NEVER boot until this changes

# Fix: reset to automatic newest-first behavior
sed -i 's/^GRUB_DEFAULT=.*/GRUB_DEFAULT=0/' /etc/default/grub
update-grub

# On RHEL, make the newest kernel the default explicitly
grubby --set-default /boot/vmlinuz-$(rpm -q kernel --last | head -1 | awk '{print $1}' | sed 's/kernel-//')

I've seen this exact hardcoded-string problem on production boxes that went through a "stability lockdown" six months prior and nobody documented that the grub default was pinned. The ops team patched faithfully every cycle, the RPMs were installed, the CVE scanner came back green, and the box ran a vulnerable kernel for eight months. Audit your grub config the same way you audit your firewall rules — it's not set-and-forget.

Build a Pre-Patch Snapshot Script You Actually Run

Rather than checking these manually each time, I keep a small script that runs before any kernel patching job and dumps a snapshot to a file:

#!/usr/bin/env bash
# pre-patch-snapshot.sh — run BEFORE touching anything
OUTFILE="/var/log/kernel-snapshot-$(date +%Y%m%d-%H%M%S).txt"

{
  echo "=== RUNNING KERNEL ==="
  uname -r
  cat /proc/version

  echo -e "\n=== INSTALLED KERNELS ==="
  if command -v rpm &>/dev/null; then
    rpm -q kernel --last
    echo "GRUB default: $(grubby --default-kernel)"
  else
    dpkg -l 'linux-image-*' | grep ^ii
    grep GRUB_DEFAULT /etc/default/grub
  fi

  echo -e "\n=== KERNEL CMDLINE (what actually booted with) ==="
  cat /proc/cmdline

  echo -e "\n=== LOADED MODULES ==="
  lsmod | wc -l
  echo "full module list in: /proc/modules"
} | tee "$OUTFILE"

echo "Snapshot saved to $OUTFILE"

That /proc/cmdline line catches a specific gotcha: servers with kpti=off or mitigations=off passed at boot because someone was benchmarking and forgot to revert. You can have a fully patched kernel with Meltdown mitigations explicitly disabled at the boot prompt. Your vulnerability scanner won't catch that — it checks kernel version, not boot flags.

Step 2: Scanning Your Fleet for Vulnerable Kernels

The thing that caught me off guard the first time I audited a fleet was how many servers were running kernels two or three minor versions behind — not because anyone was ignoring updates, but because automated patching had silently failed on a subset of hosts months earlier. Scanning is how you find out you have a problem before an attacker does.

OpenSCAP + OVAL: The Closest Thing to Authoritative

Red Hat and Ubuntu's security teams publish OVAL (Open Vulnerability and Assessment Language) definition files that map CVEs directly to package versions. OpenSCAP consumes these and gives you a structured report you can actually act on. For Ubuntu 22.04 Jammy, grab the OVAL feed and run:

# Download the current OVAL definitions from Ubuntu's security team
wget https://security-metadata.canonical.com/oval/com.ubuntu.jammy.usn.oval.xml.bz2
bunzip2 com.ubuntu.jammy.usn.oval.xml.bz2

# Run the scan — results.xml is machine-parseable, report.html is human-readable
oscap oval eval \
  --results scan-results.xml \
  --report report.html \
  com.ubuntu.jammy.usn.oval.xml

The HTML report will list every USN (Ubuntu Security Notice) that applies to installed packages on that host, including kernel packages. The OVAL definitions are updated multiple times per week by the Ubuntu security team, so freshness isn't usually the problem — the problem is that the feed only knows what's been published to the USN tracker. A zero-day or a CVE that landed upstream but hasn't been backported to a distro package yet won't appear here. That's a real blind spot.

Trivy for Container Images

If you're running containerized workloads, the container image itself carries a syscall surface that matters. An old libc or a kernel-adjacent library inside the image can expose you to vulnerabilities even if the host kernel is patched. Trivy makes this fast:

# Vuln-only scan, skips secret/config scanning for speed
trivy image --scanners vuln ubuntu:22.04

What Trivy does well: it pulls from multiple advisory databases (GitHub Advisory, NVD, OS-specific advisories), handles layered image scanning without you doing anything special, and is fast enough to drop into a CI pipeline without making builds painful. Where it falls short for kernel-specific CVEs is context — it'll flag a CVE against a kernel-related package but won't tell you whether the vulnerability is exploitable given your actual running kernel version vs. the image's userspace. For that nuance you still need the OVAL approach on the host.

Grype as a CI-Friendly Alternative

Grype from Anchore is worth knowing about specifically because its cold scan performance is noticeably better than Trivy's when you haven't pre-warmed a local DB cache. In practice on a mid-size CI fleet, the first Trivy run after a DB update can take 30-45 seconds just on DB fetch; Grype tends to be leaner there. The trade-off is that Grype's kernel CVE context is weaker — it's better at application-layer dependency vulnerabilities than kernel or kernel-module specific issues. Use it for scanning your app containers in CI, not as your primary host kernel audit tool.

# Install grype and scan the same base image
curl -sSfL https://raw.githubusercontent.com/anchore/grype/main/install.sh | sh -s -- -b /usr/local/bin
grype ubuntu:22.04 --only-fixed

The --only-fixed flag is the one you actually want in CI — it suppresses unfixed CVEs that you can't do anything about, so your pipeline doesn't get noisy with known-but-unpatched upstream issues.

The Manual Cross-Reference You Can't Skip

None of these tools are a substitute for a quick manual check on anything that looks critical. OVAL definitions lag behind upstream publication, Trivy's DB update cadence varies, and Grype may simply not have the kernel advisory context you need. Before you act on a kernel CVE finding — especially if you're considering an emergency out-of-band patch — do this:

# Pull the actual changelog for your running kernel package
apt-get changelog linux-image-$(uname -r) | grep -A5 CVE-2024-XXXXX

This tells you whether your distro has already backported a fix into your current kernel package even if the version number didn't change in a way the scanners expect. Ubuntu in particular backports security fixes into the same package version and just bumps the build revision — scanners that rely purely on version comparison will miss this and report false positives. I've burned time chasing CVEs that were already patched this way. The changelog is ground truth.

Step 3: Live Patching Without a Reboot — What Actually Works

Live Patching Without a Reboot — What Actually Works

The thing most people don't realize until they've been running Canonical Livepatch for a month: the status output tells you which CVEs are patched, not just a generic "you're good" confirmation. That's actually useful. Run canonical-livepatch status --verbose and you'll see CVE IDs with their patch state — covered, not-covered, or applied. This matters when your security team asks "are we protected against CVE-2024-XXXX" and you need a real answer, not a guess.

Setup on Ubuntu is fast. Install the snap, grab your token from ubuntu.com/security/livepatch (free for up to 3 machines, paid tiers after that), and you're running in under two minutes:

# Install the snap and enable with your token
snap install canonical-livepatch
canonical-livepatch enable <your-token-here>

# Check what's actually covered
canonical-livepatch status --verbose

# Sample output you'll see:
# kernel: 5.15.0-91-generic
# fully-patched: true
# CVE-2024-1086: applied
# CVE-2023-6931: applied

On RHEL/CentOS, kpatch is the equivalent. The kpatch-dnf plugin is the part most tutorials skip — it automatically pulls the right patch set for your currently running kernel, not just whatever's latest. That distinction matters because your running kernel and installed kernel can diverge after a partial update.

# Install kpatch and the dnf plugin
dnf install kpatch kpatch-dnf

# Pull and apply patches for your running kernel
kpatch-dnf install

# Verify what's loaded
kpatch list

# Expected output:
# Loaded patch modules:
# kpatch_5_14_0_362_8_1_CVE_2023_6931 [enabled]

Here's the hard truth nobody puts in their blog post: live patching cannot replace reboots for a significant class of vulnerabilities. Spectre/Meltdown variants, anything touching the scheduler, memory subsystem changes, and patches that require modifying data structures in-flight — all of these require a real reboot. The kernel live patching infrastructure in Linux (the klp_* framework since kernel 4.0) works by redirecting function pointers at runtime, so it's fundamentally limited to function-level patches. If the fix requires changing a struct layout or altering how memory is allocated during boot, there's no live patch pathway. Your SLA for "zero downtime" doesn't override physics.

If you're on SUSE SLES, kGraft is their equivalent and ships with the enterprise subscription. I won't spend more time on it here — if you're not on SLES, it's irrelevant to your stack. The architecture is similar to kpatch but with SUSE's own patch delivery pipeline through their customer center portal.

When live patching silently fails — and it does, particularly when the patch module can't load because a function is currently on the call stack — your first move is:

# Check for any livepatch or kpatch errors in the kernel ring buffer
dmesg | grep -i 'livepatch\|kpatch'

# What a failed patch load looks like:
# livepatch: pre_patch_callback failed for object 'vmlinux'
# kpatch: patch module failed to load: -16

# Also check systemd journal for kpatch service failures
journalctl -u kpatch --since "1 hour ago"

The -16 error code from kpatch is EBUSY — the function you're trying to patch is currently executing somewhere. kpatch will retry, but it won't tell you loudly when it gives up. If kpatch list shows a module as [disabled] rather than [enabled], the patch loaded but couldn't be applied — you need the reboot path. Don't assume silence means success.

Step 4: Unattended Upgrades — Set It Up Correctly or Don't Set It Up

The most common mistake I see with unattended upgrades isn't skipping the setup — it's setting it up wrong and assuming you're covered. Both Ubuntu and RHEL ship with defaults that look functional but silently don't do what you need for kernel patches specifically. The tool runs, the logs look clean, and your kernel is still three months old.

Ubuntu: The Config File That Actually Matters

Install is one line, but the install isn't the work:

apt install unattended-upgrades
dpkg-reconfigure --priority=low unattended-upgrades

The real config lives in /etc/apt/apt.conf.d/50unattended-upgrades. Most of it is noise. The stanza that controls what actually gets pulled is Allowed-Origins, and the one entry you cannot skip is the -security line:

Unattended-Upgrade::Allowed-Origins {
    "${distro_id}:${distro_codename}";
    "${distro_id}:${distro_codename}-security";
    // This one is what catches kernel patches:
    "${distro_id}ESMApps:${distro_codename}-apps-security";
    "${distro_id}ESM:${distro_codename}-infra-security";
};

// Don't leave this empty — pick a maintenance window
Unattended-Upgrade::Automatic-Reboot "true";
Unattended-Upgrade::Automatic-Reboot-Time "03:00";

// Mail yourself when something actually happens
Unattended-Upgrade::Mail "ops@yourcompany.com";
Unattended-Upgrade::MailReport "on-change";

The ${distro_id} and ${distro_codename} expand at runtime — on Ubuntu 24.04 LTS that becomes Ubuntu:noble. The -security suffix is what determines whether kernel updates get applied. If that line is commented out (it is by default in some Ubuntu versions), you'll get package updates but zero kernel patches. I've seen this exact config on production servers that the team was convinced were fully patched.

The reboot time setting is not optional if you run kernel updates. A kernel update without a reboot does nothing — the running kernel is unchanged until the system restarts. 03:00 in local system time is usually safe for most workloads, but check your timezone with timedatectl first and make sure your monitoring doesn't alert on the expected downtime window.

RHEL / CentOS Stream / Rocky: dnf-automatic and the Reboot Trap

On RHEL 9 (and derivatives), install and enable like this:

dnf install dnf-automatic
systemctl enable --now dnf-automatic-install.timer

# Verify the timer is actually active
systemctl list-timers dnf-automatic*

Then open /etc/dnf/automatic.conf. The defaults will burn you:

[commands]
# Change this from "default" to "security" to limit scope
upgrade_type = security

# THIS is what most guides skip — default is "never"
apply_updates = yes

# Default on RHEL 9 is "never". Change it.
reboot = when-needed
reboot_command = "shutdown -r +5 'Rebooting for security updates'"

[emitters]
emit_via = email

[email]
email_from = dnf-automatic@yourhost.example.com
email_to = ops@yourcompany.com
email_host = localhost

The reboot = never default on RHEL 9 is the exact failure mode I mentioned at the top. dnf-automatic downloads and installs the kernel package, journalctl shows success, your kernel rpm is updated — and the running kernel in uname -r is still the old one. The new kernel only activates on next reboot. With reboot = never, that reboot never comes automatically. I've audited systems sitting six months behind on running kernel version because of this.

Verifying It's Actually Running

Don't trust the setup — check the evidence. On Ubuntu:

# Should show recent entries with packages applied
ls -lth /var/log/unattended-upgrades/

# Tail the actual log
tail -50 /var/log/unattended-upgrades/unattended-upgrades.log

# Force a dry run to confirm config parses correctly
unattended-upgrade --dry-run --debug 2>&1 | grep -E "(Allowed|fetch|upgrade)"

On RHEL:

# Check recent timer execution
journalctl -u dnf-automatic-install --since "7 days ago"

# Confirm the running kernel matches what's installed
uname -r
rpm -q kernel | sort | tail -3

# If these don't match, a reboot is pending

If uname -r and the latest installed kernel rpm don't match, you have a patching process that isn't completing its job. That's the whole audit loop: check the logs to confirm execution, then cross-reference the running kernel against installed packages. Both checks together tell you whether your automation is actually delivering security coverage or just looking busy in the logs.

Step 5: Coordinating Reboots Across a Fleet Without Waking Up at 3am

The thing that caught me off guard early on was how often a kernel patch doesn't actually require a full reboot — but you find out the hard way when you blindly reboot everything and your database replica falls behind by 45 minutes. needrestart solves this. After patching, run it and it tells you specifically what's stale: running processes using old library versions, services that need a restart vs. the full kernel that needs a reboot. The difference matters enormously on a fleet of 40 nodes.

# Install it first if it's not already there
apt install needrestart -y

# The flag that's actually useful in automation — lists outdated libs,
# no interactive prompts, exits cleanly for scripting
needrestart -r l

# Sample output when only services need restart (no reboot required):
# NEEDRESTART-SVC: nginx.service
# NEEDRESTART-SVC: postgresql.service
# No kernel update pending.

# vs. when you DO need a reboot:
# NEEDRESTART-KSTA: 3
# Running kernel: 6.1.0-18-amd64
# Expected kernel: 6.1.0-21-amd64

In Ansible, I pipe needrestart -r l -b (machine-readable output) into a register and only schedule the reboot task when the kernel status code is 3 (update pending). This alone cut our unnecessary reboots by roughly half. The -b flag gives you parseable key-value lines instead of the colored interactive UI — critical when you're running this headlessly across 60 hosts.

For the actual rolling reboot, serial: 1 is your best friend and your worst enemy simultaneously. It's safe but slow. I run serial: "20%" for most of our stateless app servers and drop to serial: 1 only for database nodes. Here's the pattern I actually use in production:

---
- name: Rolling kernel reboot after patch
  hosts: web
  serial: 1
  gather_facts: true

  tasks:
    - name: Reboot if kernel update pending
      ansible.builtin.reboot:
        reboot_timeout: 300
        post_reboot_delay: 15
      when: needrestart_kernel_pending | bool  # set earlier via needrestart check

    - name: Wait for SSH to come back
      ansible.builtin.wait_for_connection:
        delay: 10
        timeout: 180

    - name: Gather facts post-reboot (refreshes ansible_kernel)
      ansible.builtin.setup:
        gather_subset: ["min"]

    - name: Check running kernel matches installed package
      ansible.builtin.shell: |
        installed=$(dpkg -s linux-image-$(uname -m) 2>/dev/null | grep ^Version | awk '{print $2}')
        running=$(uname -r)
        # Verify the running kernel string appears in installed package version
        echo "$installed" | grep -q "${running}" && echo "OK" || echo "MISMATCH"
      register: kernel_check
      failed_when: "'MISMATCH' in kernel_check.stdout"
      changed_when: false

The staggered reboot pattern for primary/replica setups is where most guides skip the important bit. You can't just iterate over all DB nodes — you need the primary to be healthy before replicas start rebooting. I use group ordering plus a conditional to enforce sequence:

- name: Reboot primary before replicas
  hosts: db
  serial: 1

  tasks:
    - name: Reboot primary first, replicas only after
      ansible.builtin.reboot:
        reboot_timeout: 300
      # Only runs in the order Ansible iterates the group.
      # Primaries are manually placed first in inventory [db] group.
      # This guard is a safety net for when someone edits inventory carelessly.
      when: >
        inventory_hostname == groups['db'][0] or
        (groups['db'][0] + ' reboot complete') in hostvars[groups['db'][0]]

Honestly, the cleaner approach is separate plays — one for hosts: db_primary, one for hosts: db_replica — rather than trying to get clever with conditionals inside a single play. Ansible's sequential play execution gives you the ordering guarantee for free.

AWS Systems Manager Patch Manager is worth using specifically for two things: the maintenance window scheduler (so you're not writing cron jobs that call Ansible at 2am) and the patch baseline compliance reporting in the console. What it doesn't save you from is the actual reboot coordination logic — SSM will happily reboot all your EC2 instances simultaneously if you don't configure concurrency limits in your Run Command or Automation document. Set MaxConcurrency to something sane like 20% and MaxErrors to 10%. The patch baseline itself is genuinely useful: you define "auto-approve Critical CVEs after 7 days, High after 14 days" and it enforces that across your fleet without you maintaining a spreadsheet. The compliance dashboard is mostly checkbox theater for auditors, but the baseline logic is real operational value.

My Actual Patching Runbook (The Short Version)

The thing that surprised me most after formalizing this process: the hardest part isn't the patching itself — it's the visibility. I spent months flying blind because I had no quick way to see which kernel was running on which node across a mixed fleet of bare-metal and cloud VMs. Everything below is the result of iterating past that pain.

Monday morning: OVAL scan via cron. I run OpenSCAP against a downloaded OVAL feed for RHEL/Rocky or the Ubuntu USN feed depending on the distro. The cron fires at 6am and pipes results into a Slack webhook. The channel is #sec-vuln-feed and only the team watches it — no noise bots, no auto-close spam. The command that does the actual work:

# Rocky 8 example — pull current OVAL and scan
wget -q https://access.redhat.com/security/data/oval/v2/RHEL8/rhel-8.oval.xml.bz2 \
  -O /tmp/rhel8-oval.xml.bz2 && bzip2 -d /tmp/rhel8-oval.xml.bz2

oscap oval eval \
  --report /var/reports/oval-$(date +%F).html \
  --results /var/reports/oval-$(date +%F).xml \
  /tmp/rhel8-oval.xml

# Extract CVSS scores and POST to Slack via jq + curl
python3 /opt/scripts/oval_to_slack.py /var/reports/oval-$(date +%F).xml

Tuesday: triage and ticketing. Anything scoring CVSS 7.0 or above gets a Jira ticket opened same day with a Thursday patch target. I don't manually read every CVE description — the oval_to_slack.py script already filters by severity and links directly to the NVD entry. Scores below 7.0 go into a rolling backlog I review every other Monday. The one exception: anything touching the kernel network stack or a privilege escalation regardless of CVSS score jumps the queue automatically. CVSS is a guide, not a judge.

Thursday 2am patch window. The Ansible playbook does a rolling restart with a configurable serial value — I run 2 nodes at a time in a 12-node cluster. Between each node, it hits a health check endpoint and waits up to 90 seconds before continuing. If the health check fails, the play stops entirely — no silent cascading failure.

# roles/kernel_patch/tasks/main.yml (Rocky/RHEL variant)
- name: Apply all security updates
  ansible.builtin.dnf:
    name: "*"
    security: true
    state: latest
  register: dnf_result

- name: Reboot if kernel updated
  ansible.builtin.reboot:
    reboot_timeout: 300
  when: dnf_result.changed and 'kernel' in dnf_result.results | join

- name: Health check post-reboot
  ansible.builtin.uri:
    url: "http://{{ inventory_hostname }}:8080/health"
    status_code: 200
  retries: 6
  delay: 15

Post-patch verification. After the playbook finishes, a second job runs uname -r across every host and compares the output against the expected kernel version I store in the inventory as a host variable (expected_kernel). Any mismatch pages the on-call. This caught a case where a node silently failed to reboot because of a misconfigured GRUB_DEFAULT — the playbook reported success, the kernel was still old. Without the comparison step I wouldn't have known until the next scan.

# inventory/hosts.yml snippet
webservers:
  hosts:
    web01.prod:
      expected_kernel: "5.14.0-427.22.1.el9_4.x86_64"
    web02.prod:
      expected_kernel: "5.14.0-427.22.1.el9_4.x86_64"

The thing I wish I'd set up in month one instead of month eight: exposing kernel version as a Prometheus metric via node_exporter. It's already collected by default as node_uname_info — you just need to pull it into a dashboard. One PromQL query gives you instant fleet-wide visibility:

# Show all unique kernel versions currently running across fleet
count by (release) (node_uname_info)

# Alert if any node is behind expected kernel after patch window
(time() - node_boot_time_seconds) < 3600
  and on(instance) node_uname_info{release!="5.14.0-427.22.1.el9_4.x86_64"}

That second alert fires if a node booted recently but isn't on the expected kernel — meaning it rebooted but didn't actually update. Before I had this, the only way to check was SSH'ing into nodes manually or running ad-hoc Ansible commands. A Grafana panel with a kernel version variable filter replaced all of that. Total setup time once node_exporter is already deployed: about 20 minutes.

Kernel Mitigation Flags You Should Actually Know About

The single most useful command I show junior sysadmins who think their server is "secure" is this:

# Run this on any Linux box you manage
cat /sys/devices/system/cpu/vulnerabilities/*

# You'll see output like:
# Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable
# Mitigation: Retpolines, IBPB: conditional, IBRS_FW, PBRSB-eIBRS: SW sequence
# Vulnerable: Processor vulnerable
# Mitigation: Clear CPU buffers; SMT vulnerable

That word "Vulnerable" in the output is not a warning — it's a statement of fact. I've done this on cloud instances from major providers and found at least one unmitigated vector more often than I'd like to admit. The file list covers Spectre v1/v2, Meltdown, MDS (Microarchitectural Data Sampling), L1TF, Retbleed, and a handful of others depending on your kernel version. Each file maps to a specific hardware-level attack class. If a file says "Not affected," that means your CPU microarchitecture genuinely isn't vulnerable — AMD chips dodge several Intel-specific issues. If it says "Mitigation: ..." — the kernel is doing work to protect you. If it says "Vulnerable" — something is off.

The mitigations=off boot parameter disables essentially all of these in one shot. There's exactly one situation where I consider this legitimate: a bare-metal benchmark box running single-tenant workloads where you need to isolate raw hardware performance from kernel overhead. I've used it when profiling syscall-heavy code to separate "is this slow because of my code or because of Spectre mitigations?" That's it. If you find mitigations=off on a database server, a shared host, or anything touching the public internet, that's not a configuration choice — it's negligence. Someone traded your users' security for a benchmark number they probably never validated.

KPTI (Kernel Page Table Isolation), IBRS (Indirect Branch Restricted Speculation), and IBPB (Indirect Branch Predictor Barrier) each have real, measurable costs. KPTI causes a TLB flush on every user-kernel boundary crossing, which means syscall-heavy workloads — think Redis, PostgreSQL with lots of short queries, or anything doing frequent read()/write() calls — take a visible hit. On older Xeon E5 chips without PCID support, I measured 15–30% throughput drops in Redis benchmarks. On newer chips with PCID (most things post-Skylake), it's closer to 2–5%. IBRS adds overhead on every privilege level change. IBPB is the nuclear option — it flushes the branch predictor entirely, and you only want it on context switches between untrusted processes. You won't feel any of this on a web server handling 50 req/s. You'll absolutely feel it running 100K IOPS or a syscall-per-request workload at scale.

Check your actual boot parameters right now:

cat /proc/cmdline

# A clean production box looks like:
# BOOT_IMAGE=/vmlinuz-6.1.0-21-amd64 root=UUID=... ro quiet

# A box someone "optimized" looks like:
# BOOT_IMAGE=... nospectre_v2 nopti mitigations=off spectre_v2=off

If you see nospectre_v2, nopti, or individual mitigation disablement flags on a production machine, that's a conversation you need to have immediately — and document the outcome. Sometimes there's a reason (a legacy app with a known, isolated environment). More often, someone read a "speed up Linux" blog post from 2019 and applied it wholesale. The flags persist through reboots via /etc/default/grub or the bootloader config, so check those too:

grep -i "GRUB_CMDLINE" /etc/default/grub
# Then look for GRUB_CMDLINE_LINUX or GRUB_CMDLINE_LINUX_DEFAULT

AppArmor and SELinux are worth addressing here directly because I see them either completely ignored or oversold. They do not fix a kernel vulnerability. If an attacker can trigger a Spectre gadget or exploit a use-after-free in the kernel, MAC policies are not going to save you — the exploit runs before your policy engine gets a say. What they do buy you is containment time. If a service gets compromised while you're waiting for a kernel patch to land in your distro's stable repo (which can take days to weeks for less critical CVEs), AppArmor profiles that restrict ptrace, cap filesystem access, and deny raw socket creation meaningfully limit the blast radius. I run both AppArmor in enforce mode on Ubuntu/Debian systems and SELinux in enforcing mode on RHEL-family boxes. The setup time is front-loaded, and the ongoing cost is occasional "why is this broken" debugging — but it's a real layer, not theater.

The Tools I Actually Use and What I Dropped

The spreadsheet lasted exactly two months. I had a tab for each server, columns for CVE IDs, patch dates, and "who's responsible" — and by week six it was already lying to me. Servers had been rebooted without the sheet being updated, someone had manually applied a fix and not logged it, and two hosts showed "patched" because I'd copied a row down by accident. That's the thing about manual tracking: it degrades silently. You don't know it's wrong until something breaks.

Here's what's actually running in my stack right now:

OpenSCAP — I run oscap xccdf eval against CIS profiles monthly. The HTML reports are ugly but the data is precise. Pair it with scap-security-guide and you get RHEL/Ubuntu profiles without writing your own XCCDF content.
Canonical Livepatch — If you're on Ubuntu and haven't turned this on, you're leaving live kernel patching on the table for free (up to 5 machines on the free tier, $75/year per machine on Ubuntu Pro beyond that). It doesn't cover every CVE, but for the critical memory-corruption and privilege-escalation class of bugs, it buys you days of real patch window without a reboot.
needrestart — Underrated. After every apt upgrade, it tells you which processes are still running against old library versions and whether the kernel needs a reboot. I pipe its output into a Slack webhook so nobody can claim they didn't know a restart was needed.
Ansible — Orchestration, not just ad-hoc patching. I maintain a patch_kernel.yml playbook that handles the full cycle: pre-checks, the actual upgrade, a conditional reboot if needrestart flags it, and a post-check that confirms the booted kernel version matches what was installed.
node_exporter — The node_uname_info metric carries the full kernel version string. Feed it into a Grafana dashboard with a variable for expected kernel version per distro, and you instantly see which hosts are lagging. No custom tooling, just a PromQL query.

# PromQL to find hosts running outdated kernels
# assumes you tag expected version in a config map or recording rule
count by (instance, release) (
  node_uname_info{release!~"5.15.0-112-generic"}
)

Qualys VMDR is genuinely powerful — the asset correlation and CVSS-based prioritization are solid features. But at ~$300+ per asset per year for a fleet of 20-30 servers, you're paying for a dedicated vuln management workflow that requires someone to actually sit in it every week. We didn't have that person. The tool kept generating reports that nobody actioned because nobody owned the process. That's not a Qualys problem, it's a team-size problem, but the cost made walking away easy.

Trivy I kept, but only in CI. It's excellent for scanning container images and catching kernel-level CVEs in base images before they hit production — trivy image --ignore-unfixed ubuntu:22.04 gives you clean, actionable output in under 30 seconds. What it doesn't do well is host-level runtime tracking. It can't tell you whether a running kernel has had a specific CVE patched versus just updated, and it has no concept of Livepatch-applied fixes. Use it at the image build stage, not as a replacement for host scanning.

For smaller teams managing infra alongside product work — where nobody has a dedicated security ops role — the toolchain above covers the practical gap. But if you're also looking at consolidating monitoring, alerting, and some of this security surface into fewer panes of glass, it's worth reading through the Essential SaaS Tools for Small Business in 2026 guide. Some of those tools overlap more with infra monitoring than you'd expect from the name.

When to Pick What: Matching Strategy to Your Situation

The thing that surprises most people is how much setup complexity they don't actually need. I've watched sysadmins stand up Ansible playbooks, configure AWX, and set up custom OVAL pipelines for a 6-node fleet that runs a staging environment. That's overkill by a factor of three. Match the tooling to the real problem or you'll spend more time maintaining the patch infrastructure than the servers.

Small Fleet (Under 10 Nodes) on Ubuntu: Just Use What Ships

Canonical Livepatch free tier covers up to 5 machines, and Ubuntu Pro gives you 5 free personal machines or up to 50 for community contributors — check your eligibility at ubuntu.com/pro before paying for anything. Pair that with unattended-upgrades configured to actually apply security updates (not just download them), and you're done. Here's the config that matters:

# /etc/apt/apt.conf.d/50unattended-upgrades
Unattended-Upgrade::Allowed-Origins {
    "${distro_id}:${distro_codename}-security";
    "${distro_id}ESMApps:${distro_codename}-apps-security";  // Ubuntu Pro
    "${distro_id}ESMI:${distro_codename}-infra-security";    // Ubuntu Pro kernel patches
};

// Actually reboot when the kernel needs it — this is off by default, which is insane
Unattended-Upgrade::Automatic-Reboot "true";
Unattended-Upgrade::Automatic-Reboot-Time "03:00";

Enable Livepatch so you're not rebooting every time a kernel CVE drops:

sudo pro attach YOUR_TOKEN
sudo pro enable livepatch
# Verify what's actually patched:
canonical-livepatch status --verbose

That's the whole strategy. A cron job that emails you the unattended-upgrades log weekly is more than enough visibility at this scale.

Medium Fleet (10–100 Nodes), Mixed OS: Ansible + OVAL Scanning on a Schedule

This is where the real work lives. Mixed OS fleets — some Ubuntu 22.04, some Rocky 9, maybe a few RHEL 8 stragglers — mean you can't rely on one distro's tooling. I run OVAL scanning on a schedule to figure out what's actually vulnerable before I touch anything, then let Ansible handle the remediation. The scanning command that gives you actionable output:

# Pull the current OVAL feed for Ubuntu 22.04
wget https://security.ubuntu.com/oval/com.ubuntu.jammy.cve.oval.xml.bz2
bunzip2 com.ubuntu.jammy.cve.oval.xml.bz2

# Run the scan — this produces a machine-readable report
oscap oval eval \
  --report oval-report-$(hostname)-$(date +%F).html \
  --results oval-results-$(hostname)-$(date +%F).xml \
  com.ubuntu.jammy.cve.oval.xml

Then needrestart tells you which running services are using stale library versions after a patch run — install it on every node and wire it into your Ansible playbook's post-tasks so it runs automatically. The gotcha with needrestart is that in batch/non-interactive mode it won't actually restart services unless you pass -r a. Add needrestart -r a -q to your post-patch task and you won't get blindsided by services still holding the old shared libraries in memory.

RHEL with an Active Subscription: Use Insights — You're Already Paying For It

RHEL Insights is genuinely good and most RHEL shops I've talked to either don't know it's included or assume it's another Red Hat product that requires a separate sales call. It's not. If you have a RHEL subscription, you have Insights. Register a system and you'll immediately get vulnerability advisories ranked by severity with specific CVE numbers, drift detection, and patch recommendations that account for your actual installed package versions — not generic OVAL output.

# Register a RHEL system with Insights (takes 2 minutes)
subscription-manager register --username=YOUR_RHN_USER --auto-attach
insights-client --register

# Check what it found on this host
insights-client --check-results

The Insights web console at console.redhat.com aggregates this across all your registered systems. You get a prioritized vulnerability list, remediations as downloadable Ansible playbooks, and patch timestamps. I switched a client's RHEL fleet from a hand-rolled Ansible + OVAL setup to Insights and the time spent per patching cycle dropped significantly — not because Insights is magic, but because the remediation playbooks it generates actually account for package dependencies correctly, which ours sometimes didn't.

High-Availability, No Maintenance Windows: Live Patching Is Non-Negotiable (With a Catch)

If you're running something where "we'll patch it Saturday at 2am" isn't an option — financial transaction processing, healthcare systems, anything with an SLA that makes reboots genuinely painful — live patching is the tool. On RHEL, that's kpatch. On Ubuntu, Livepatch. On vanilla kernels, you can wire up kpatch-build yourself if you enjoy suffering.

# RHEL: check what kpatch modules are loaded right now
kpatch list

# See what the current patch covers vs the running kernel
kpatch info

The catch that nobody puts in the marketing: live patching does not cover every CVE. High CVSS memory corruption bugs? Usually yes. Anything requiring a data structure change in the kernel? Often no, because you can't safely patch that without a full restart. I've seen teams assume live patching means they never reboot, then discover during a quarterly audit that 30% of the CVEs from the past 6 months weren't covered. Check canonical-livepatch status --verbose or your kpatch changelog against the CVE list every quarter. You will find gaps.

Regulated Environments (PCI-DSS, SOC 2): Automate the Evidence From Day One

The difference between a smooth audit and a painful one isn't whether you patched — it's whether you can prove you patched, when you patched, what kernel version is running, and when you were notified of the vulnerability. Auditors want timestamps, not your word. Build the evidence pipeline before your first audit, not the week before.

# Log kernel version + patch timestamp to a central store on every patch run
PATCH_LOG="/var/log/patch-evidence/$(date +%F)-$(hostname).log"
mkdir -p /var/log/patch-evidence

echo "=== Patch Run: $(date -u +%Y-%m-%dT%H:%M:%SZ) ===" >> "$PATCH_LOG"
echo "Hostname: $(hostname -f)" >> "$PATCH_LOG"
echo "Kernel: $(uname -r)" >> "$PATCH_LOG"
echo "Packages updated:" >> "$PATCH_LOG"
dnf history info last >> "$PATCH_LOG"   # RHEL
# or: grep "upgraded" /var/log/dpkg.log | tail -50 >> "$PATCH_LOG"  # Debian/Ubuntu

# Generate an oscap compliance report for the host
oscap xccdf eval \
  --profile xccdf_org.ssgproject.content_profile_pci-dss \
  --report /var/log/patch-evidence/oscap-$(hostname)-$(date +%F).html \
  /usr/share/xml/scap/ssg/content/ssg-rhel9-ds.xml

Ship those files to an S3 bucket or any append-only storage immediately after the run — don't leave them on the host being audited. The control you're proving to PCI QSAs is that you applied critical patches within 30 days (Requirement 6.3.3) and have a process. Timestamps in logs that you can pull on demand are that proof. If you're collecting this retroactively before an audit, you're already losing — the tooling should be running from the first week the environment exists.

Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

DEV Community