Akhilesh

Posted on May 11

67. DBSCAN: Clustering That Handles Messy Data

#ai #productivity #beginners #python

Last post K-Means failed on crescent-shaped data. It cut across the natural curves instead of following them. You also had to tell it K upfront. And one outlier could drag a centroid completely off course.

DBSCAN fixes all three problems.

It finds clusters based on density, not distance to a centroid. It discovers K automatically. It labels outliers explicitly instead of forcing them into a cluster.

Different idea. Different use cases. Worth knowing.

What You'll Learn Here

How density-based clustering works
What eps and min_samples actually control
Core points, border points, and noise points explained
How to tune DBSCAN parameters properly
When DBSCAN wins and when K-Means is still better
Anomaly detection with DBSCAN
Full working code

The Core Idea: Density

K-Means asks: what's the nearest centroid?

DBSCAN asks: how many neighbors does this point have within radius epsilon?

If a point has at least min_samples neighbors within distance eps, it's a core point. Core points form the dense heart of a cluster.

Points that are within eps of a core point but don't have enough neighbors themselves are border points. They're on the edge of a cluster.

Points that are not within eps of any core point are noise. DBSCAN labels them -1. They don't belong to any cluster.

Core point:   has >= min_samples neighbors within eps
Border point: within eps of a core point, but < min_samples neighbors itself
Noise:        not within eps of any core point

Two clusters are distinct if there's no chain of core points connecting them.

Visualizing the Three Point Types

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN

# Data that K-Means can't handle
X_moons, _ = make_moons(n_samples=300, noise=0.08, random_state=42)

# Run DBSCAN
db = DBSCAN(eps=0.2, min_samples=5)
labels = db.fit_predict(X_moons)

# Identify point types
core_mask   = np.zeros_like(labels, dtype=bool)
core_mask[db.core_sample_indices_] = True
border_mask = (~core_mask) & (labels != -1)
noise_mask  = labels == -1

print(f"Core points:   {core_mask.sum()}")
print(f"Border points: {border_mask.sum()}")
print(f"Noise points:  {noise_mask.sum()}")
print(f"Clusters found: {len(set(labels)) - (1 if -1 in labels else 0)}")

Output:

Core points:   250
Border points: 45
Noise points:  5
Clusters found: 2

# Plot each type differently
plt.figure(figsize=(8, 5))

# Core points colored by cluster
plt.scatter(X_moons[core_mask, 0],   X_moons[core_mask, 1],
            c=labels[core_mask], cmap='bwr', s=40, alpha=0.8, label='Core')

# Border points same color, smaller
plt.scatter(X_moons[border_mask, 0], X_moons[border_mask, 1],
            c=labels[border_mask], cmap='bwr', s=15, alpha=0.5, label='Border')

# Noise in black
plt.scatter(X_moons[noise_mask, 0],  X_moons[noise_mask, 1],
            c='black', s=60, marker='x', linewidths=2, label='Noise')

plt.title('DBSCAN: Core, Border, and Noise Points')
plt.legend()
plt.savefig('dbscan_point_types.png', dpi=100)
plt.show()

Comparing DBSCAN to K-Means on Shapes K-Means Can't Handle

from sklearn.cluster import KMeans
from sklearn.datasets import make_circles

fig, axes = plt.subplots(2, 3, figsize=(15, 9))
datasets = [
    ('Moons',   make_moons(n_samples=300,   noise=0.08, random_state=42)),
    ('Circles', make_circles(n_samples=300, noise=0.05, factor=0.4, random_state=42)),
]

# Add a blob dataset with outliers
np.random.seed(42)
from sklearn.datasets import make_blobs
X_blob, _ = make_blobs(n_samples=280, centers=3, random_state=42)
X_outliers = np.random.uniform(-10, 10, (20, 2))
X_noise_data = np.vstack([X_blob, X_outliers])
datasets.append(('Blobs + Outliers', (X_noise_data, None)))

dbscan_params = [
    {'eps': 0.2,  'min_samples': 5},
    {'eps': 0.3,  'min_samples': 5},
    {'eps': 1.5,  'min_samples': 5},
]

for col, ((name, (X_d, _)), params) in enumerate(zip(datasets, dbscan_params)):

    # K-Means
    km = KMeans(n_clusters=2 if col < 2 else 3, random_state=42, n_init=10)
    km_labels = km.fit_predict(X_d)
    axes[0, col].scatter(X_d[:, 0], X_d[:, 1], c=km_labels, cmap='tab10', s=20, alpha=0.7)
    axes[0, col].set_title(f'K-Means on {name}')

    # DBSCAN
    db_d = DBSCAN(**params)
    db_labels = db_d.fit_predict(X_d)
    n_clusters = len(set(db_labels)) - (1 if -1 in db_labels else 0)
    axes[1, col].scatter(X_d[:, 0], X_d[:, 1], c=db_labels, cmap='tab10', s=20, alpha=0.7)
    axes[1, col].set_title(f'DBSCAN on {name} ({n_clusters} clusters)')

plt.tight_layout()
plt.savefig('dbscan_vs_kmeans.png', dpi=100)
plt.show()

DBSCAN correctly follows the crescent and circle shapes. K-Means cuts them up with straight boundaries.

The Two Parameters: eps and min_samples

These are the only two knobs in DBSCAN. Getting them right is the main challenge.

eps (epsilon): the radius of the neighborhood around each point. If you can reach another point within eps distance, they're neighbors.

Too small: most points become noise. Clusters fragment.
Too large: everything merges into one big cluster.

min_samples: minimum number of neighbors within eps to be a core point.

Too small (like 2): noisy points accidentally become core points.
Too large: real cluster members get labeled as noise.

from sklearn.preprocessing import StandardScaler

# Always scale before DBSCAN
X_s = StandardScaler().fit_transform(X_moons)

print(f"{'eps':<8} {'min_s':<8} {'Clusters':<12} {'Noise %'}")
print("-" * 42)

for eps in [0.1, 0.2, 0.3, 0.5, 1.0]:
    for min_s in [3, 5, 10]:
        db_test = DBSCAN(eps=eps, min_samples=min_s)
        lbl = db_test.fit_predict(X_s)
        n_clusters = len(set(lbl)) - (1 if -1 in lbl else 0)
        noise_pct  = (lbl == -1).mean() * 100
        print(f"{eps:<8} {min_s:<8} {n_clusters:<12} {noise_pct:.1f}%")
    print()

Output:

eps      min_s    Clusters     Noise %
------------------------------------------
0.1      3        8            7.7%
0.1      5        5            19.0%
0.1      10       2            42.3%

0.2      3        2            0.3%
0.2      5        2            1.7%    <- good
0.2      10       2            11.3%

0.3      3        2            0.0%
0.3      5        2            0.0%
0.3      10       2            0.0%

0.5      3        1            0.0%
0.5      5        1            0.0%    <- everything merged
...

At eps=0.2, min_samples=5 the algorithm finds 2 clusters with minimal noise. That's the sweet spot for this data.

The K-Distance Graph: Finding eps Systematically

You shouldn't guess eps. There's a principled way to find a good starting value.

For each point, calculate the distance to its K-th nearest neighbor (where K = min_samples). Sort those distances and plot them. Look for the "knee" in the curve.

from sklearn.neighbors import NearestNeighbors

min_samples = 5

# Fit nearest neighbors
nbrs = NearestNeighbors(n_neighbors=min_samples)
nbrs.fit(X_s)

# Get distances to min_samples-th nearest neighbor
distances, _ = nbrs.kneighbors(X_s)
kth_distances = distances[:, -1]  # distance to k-th neighbor
kth_distances_sorted = np.sort(kth_distances)[::-1]

plt.figure(figsize=(8, 5))
plt.plot(kth_distances_sorted, color='blue', linewidth=2)
plt.xlabel('Points (sorted by distance)')
plt.ylabel(f'Distance to {min_samples}-th nearest neighbor')
plt.title('K-Distance Graph: Look for the Knee')
plt.grid(True, alpha=0.3)
plt.savefig('k_distance_graph.png', dpi=100)
plt.show()

# The knee is where the curve bends sharply
# Points above the knee → noise
# Points below the knee → cluster members
print("Look for the knee in the plot to choose eps")
print(f"Suggested eps range: {kth_distances_sorted[int(len(kth_distances_sorted)*0.1):.3f}"
      f" to {kth_distances_sorted[int(len(kth_distances_sorted)*0.05):.3f}")

The knee in the K-distance graph is your suggested eps. Above that value, you start labeling too many points as noise. Below it, everything merges.

Anomaly Detection With DBSCAN

DBSCAN's noise label (-1) is naturally useful for anomaly detection. Points that don't belong to any cluster are outliers.

import pandas as pd

# Simulate transaction data with some fraudulent transactions
np.random.seed(42)
n_normal  = 500
n_fraud   = 20

normal_transactions = pd.DataFrame({
    'amount':   np.random.normal(50, 15, n_normal),
    'hour':     np.random.randint(8, 20, n_normal),     # normal business hours
    'items':    np.random.randint(1, 10, n_normal),
})

fraud_transactions = pd.DataFrame({
    'amount':  np.random.normal(800, 100, n_fraud),     # unusually large amounts
    'hour':    np.random.randint(0, 5, n_fraud),         # odd hours
    'items':   np.random.randint(20, 50, n_fraud),       # many items
})

all_data = pd.concat([normal_transactions, fraud_transactions], ignore_index=True)
true_fraud = np.array([0]*n_normal + [1]*n_fraud)

# Scale and cluster
from sklearn.preprocessing import StandardScaler
X_trans = StandardScaler().fit_transform(all_data)

db_fraud = DBSCAN(eps=0.5, min_samples=10)
labels_fraud = db_fraud.fit_predict(X_trans)

# Points labeled -1 are anomalies
predicted_fraud = (labels_fraud == -1).astype(int)

from sklearn.metrics import classification_report
print("Anomaly Detection Results:")
print(classification_report(true_fraud, predicted_fraud,
                              target_names=['normal', 'fraud']))

The fraud transactions cluster away from normal ones. DBSCAN labels them as noise. That noise label becomes your fraud flag.

DBSCAN on Real Data: Iris

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score, silhouette_score

iris = load_iris()
X_iris = StandardScaler().fit_transform(iris.data)
y_iris = iris.target

# Try different parameters
print(f"{'eps':<8} {'min_s':<8} {'K':<6} {'Noise':<8} {'ARI':<8} {'Sil'}")
print("-" * 50)

for eps in [0.5, 0.8, 1.0, 1.2, 1.5]:
    for min_s in [3, 5, 8]:
        db_iris = DBSCAN(eps=eps, min_samples=min_s)
        lbl = db_iris.fit_predict(X_iris)
        n_k    = len(set(lbl)) - (1 if -1 in lbl else 0)
        n_noise = (lbl == -1).sum()

        if n_k >= 2:  # silhouette needs at least 2 clusters
            sil = silhouette_score(X_iris, lbl) if n_k >= 2 else 0
            ari = adjusted_rand_score(y_iris, lbl)
            print(f"{eps:<8} {min_s:<8} {n_k:<6} {n_noise:<8} {ari:.3f}    {sil:.3f}")

DBSCAN vs K-Means: When to Use Which

Use DBSCAN when:
- You don't know K upfront
- Clusters have irregular shapes (crescents, rings, blobs of any form)
- You want outlier detection built in
- Clusters have very different sizes or densities (though DBSCAN struggles with varying density)
- Data has meaningful noise that shouldn't be forced into a cluster

Use K-Means when:
- Clusters are roughly spherical/convex
- You know approximately how many clusters you want
- Dataset is very large (DBSCAN is slower on millions of points)
- You need fast, predictable results
- All data should be assigned to a cluster (no noise points)

# Quick runtime comparison on larger data
import time
from sklearn.datasets import make_blobs

X_large, _ = make_blobs(n_samples=50000, centers=5, random_state=42)
X_large_s  = StandardScaler().fit_transform(X_large)

# K-Means
start = time.time()
KMeans(n_clusters=5, random_state=42, n_init=3).fit(X_large_s)
print(f"K-Means (50k points): {time.time()-start:.2f}s")

# DBSCAN
start = time.time()
DBSCAN(eps=0.3, min_samples=10).fit(X_large_s)
print(f"DBSCAN  (50k points): {time.time()-start:.2f}s")

Output:

K-Means (50k points): 0.38s
DBSCAN  (50k points): 2.14s

K-Means is significantly faster on large datasets. For very large data, consider HDBSCAN or Approximate Nearest Neighbor versions of DBSCAN.

Complete Workflow With Best Practices

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score
import numpy as np
import matplotlib.pyplot as plt

def run_dbscan(X, min_samples=5, plot=True):

    # Step 1: Scale
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Step 2: Find eps using k-distance graph
    nbrs = NearestNeighbors(n_neighbors=min_samples).fit(X_scaled)
    distances, _ = nbrs.kneighbors(X_scaled)
    kth_dist = np.sort(distances[:, -1])[::-1]

    # Step 3: Fit DBSCAN with suggested eps
    # Using the 95th percentile of the k-distance as eps
    suggested_eps = np.percentile(distances[:, -1], 5)
    print(f"Suggested eps: {suggested_eps:.3f}")

    db = DBSCAN(eps=suggested_eps, min_samples=min_samples)
    labels = db.fit_predict(X_scaled)

    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise    = (labels == -1).sum()

    print(f"Clusters found: {n_clusters}")
    print(f"Noise points:   {n_noise} ({n_noise/len(X)*100:.1f}%)")

    if n_clusters >= 2:
        sil = silhouette_score(X_scaled, labels)
        print(f"Silhouette score: {sil:.3f}")

    if plot and X.shape[1] == 2:
        plt.figure(figsize=(7, 5))
        unique_labels = set(labels)
        colors = plt.cm.tab10(np.linspace(0, 1, len(unique_labels)))

        for label, color in zip(sorted(unique_labels), colors):
            if label == -1:
                plt.scatter(X[labels == label, 0], X[labels == label, 1],
                            c='black', s=60, marker='x', label='Noise')
            else:
                plt.scatter(X[labels == label, 0], X[labels == label, 1],
                            color=color, s=30, alpha=0.7, label=f'Cluster {label}')

        plt.title(f'DBSCAN Result: {n_clusters} clusters')
        plt.legend()
        plt.savefig('dbscan_result.png', dpi=100)
        plt.show()

    return labels

X_demo, _ = make_moons(n_samples=300, noise=0.08, random_state=42)
labels_demo = run_dbscan(X_demo, min_samples=5)

Quick Cheat Sheet

Task	Code
Basic DBSCAN	`DBSCAN(eps=0.5, min_samples=5).fit_predict(X)`
Always scale first	`StandardScaler().fit_transform(X)`
Find eps	K-distance graph: plot sorted distances to k-th neighbor
Get noise points	`labels == -1`
Get core points	`db.core_sample_indices_`
Count clusters	`len(set(labels)) - (1 if -1 in labels else 0)`
Evaluate quality	`silhouette_score(X, labels)`
Compare to truth	`adjusted_rand_score(true_labels, labels)`

Practice Challenges

Level 1:
Run DBSCAN on make_circles(noise=0.05) with K-Means also on the same data. Plot both results side by side. Which one correctly identifies the two circles?

Level 2:
On the iris dataset, use the k-distance graph to find a good eps. Then tune min_samples between 3 and 15. Report the ARI and silhouette score for each combination. Which parameters give the closest match to the true species?

Level 3:
Generate a dataset with 3 clusters of very different sizes (100, 500, 1000 points) plus 50 outliers. Run both K-Means and DBSCAN. Compare how many outliers each one correctly identifies. Does DBSCAN handle the size difference well?

References

Next up, Post 68: PCA: Shrinking Data Without Losing Information. Too many features slow everything down and cause the curse of dimensionality. PCA finds the directions of maximum variance and lets you keep 95% of the information in far fewer dimensions.

DEV Community

67. DBSCAN: Clustering That Handles Messy Data

What You'll Learn Here

The Core Idea: Density

Visualizing the Three Point Types

Comparing DBSCAN to K-Means on Shapes K-Means Can't Handle

The Two Parameters: eps and min_samples

The K-Distance Graph: Finding eps Systematically

Anomaly Detection With DBSCAN

DBSCAN on Real Data: Iris

DBSCAN vs K-Means: When to Use Which

Complete Workflow With Best Practices

Quick Cheat Sheet

Practice Challenges

References

Top comments (0)