Akhilesh

Posted on May 11

68. PCA: Shrinking Data Without Losing Information

#ai #productivity #beginners #python

You have 100 features. Most of them are correlated. Training is slow. Visualization is impossible. KNN is useless (curse of dimensionality).

PCA is the tool that handles this. It takes your 100 features and finds 10 new features that capture 95% of the original information. Training gets faster, visualization becomes possible, and your models often get better too.

It's one of those techniques you'll use constantly once you understand it.

What You'll Learn Here

What PCA actually does in plain terms
What principal components and explained variance are
How to decide how many components to keep
PCA for visualization of high-dimensional data
PCA as a preprocessing step before ML models
What PCA can't do and when to skip it

The Core Idea: Find the Directions of Spread

Imagine you have data in 2D. A cloud of points that stretches more in one direction than another.

PCA finds the direction with the most spread (variance). That's the first principal component. Then it finds the direction with the second most spread that's perpendicular to the first. That's the second principal component.

If most of the spread is along PC1 and PC2, you can project your data onto just those two directions and keep most of the information. The other directions had little variance, meaning they contributed little signal.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Create correlated 2D data
np.random.seed(42)
mean   = [0, 0]
cov    = [[3, 2], [2, 2]]   # correlated features
X_2d   = np.random.multivariate_normal(mean, cov, 300)

# Fit PCA
pca_2d = PCA(n_components=2)
pca_2d.fit(X_2d)

pc1 = pca_2d.components_[0]
pc2 = pca_2d.components_[1]

plt.figure(figsize=(7, 5))
plt.scatter(X_2d[:, 0], X_2d[:, 1], alpha=0.4, color='steelblue', s=25)

# Draw the principal components as arrows
origin = X_2d.mean(axis=0)
scale  = 2
plt.arrow(*origin, *(scale * pc1), head_width=0.15, head_length=0.1,
          color='red',    label='PC1 (most variance)')
plt.arrow(*origin, *(scale * pc2), head_width=0.15, head_length=0.1,
          color='orange', label='PC2 (2nd most variance)')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Principal Components Show Directions of Maximum Variance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('pca_directions.png', dpi=100)
plt.show()

print(f"PC1 direction: {pc1.round(3)}")
print(f"PC2 direction: {pc2.round(3)}")
print(f"Variance explained by PC1: {pca_2d.explained_variance_ratio_[0]:.1%}")
print(f"Variance explained by PC2: {pca_2d.explained_variance_ratio_[1]:.1%}")

Output:

PC1 direction: [0.847 0.532]
PC2 direction: [-0.532  0.847]
Variance explained by PC1: 88.2%
Variance explained by PC2: 11.8%

PC1 captures 88% of the variance. If you only keep PC1, you keep 88% of the information. PC2 adds another 12%. Together they explain 100% because the original data was 2D.

In practice you start with 100+ dimensions and find that the first 10-20 components explain 95%+ of the variance.

PCA on Real Data: Digits Dataset

The digits dataset has 64 features (8x8 pixel images). Let's compress it.

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt

digits = load_digits()
X = digits.data    # 1797 samples, 64 features
y = digits.target

print(f"Original shape: {X.shape}")

# Fit PCA to find all components
pca_full = PCA()
pca_full.fit(X)

# Explained variance ratio
evr = pca_full.explained_variance_ratio_
cumulative_evr = np.cumsum(evr)

# How many components to reach 95% variance?
n_95 = np.argmax(cumulative_evr >= 0.95) + 1
n_99 = np.argmax(cumulative_evr >= 0.99) + 1

print(f"Components for 95% variance: {n_95}")
print(f"Components for 99% variance: {n_99}")

# Plot the explained variance
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.bar(range(1, 21), evr[:20], color='steelblue')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Variance per Component (first 20)')

plt.subplot(1, 2, 2)
plt.plot(range(1, len(evr)+1), cumulative_evr, color='blue', linewidth=2)
plt.axhline(0.95, color='red',    linestyle='--', label='95%')
plt.axhline(0.99, color='orange', linestyle='--', label='99%')
plt.axvline(n_95, color='red',    linestyle=':',  alpha=0.7)
plt.axvline(n_99, color='orange', linestyle=':',  alpha=0.7)
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance Explained')
plt.title('Cumulative Explained Variance')
plt.legend()

plt.tight_layout()
plt.savefig('pca_explained_variance.png', dpi=100)
plt.show()

Output:

Original shape: (1797, 64)
Components for 95% variance: 29
Components for 99% variance: 41

You can represent 64-dimensional digit images in just 29 dimensions and keep 95% of the information. That's a 55% reduction.

How to Decide How Many Components to Keep

Three strategies:

Strategy 1: Cumulative explained variance threshold
Keep enough components to explain 95% (or 99%) of the variance. Most common.

# Keep 95% of variance
pca_95 = PCA(n_components=0.95)   # pass float between 0 and 1
pca_95.fit(X)
print(f"Components kept: {pca_95.n_components_}")

Strategy 2: Fixed number
When you know what you want (e.g., 2 for visualization, 50 for a pipeline).

pca_50 = PCA(n_components=50)
pca_50.fit(X)

Strategy 3: The elbow in the scree plot
Plot variance per component. Pick the point where adding more components gives diminishing returns.

plt.figure(figsize=(8, 4))
plt.plot(range(1, 21), evr[:20], marker='o', color='blue', linewidth=2)
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Scree Plot: Look for the Elbow')
plt.grid(True, alpha=0.3)
plt.savefig('scree_plot.png', dpi=100)
plt.show()

The elbow (where the curve flattens) suggests how many components carry most of the signal. Components after the elbow add noise as much as information.

PCA for Visualization

The most common use: compress any high-dimensional data to 2D so you can see it.

from sklearn.preprocessing import StandardScaler

# Scale first (always before PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Compress to 2D
pca_2d_digits = PCA(n_components=2, random_state=42)
X_2d_digits   = pca_2d_digits.fit_transform(X_scaled)

print(f"Original: {X_scaled.shape}")
print(f"After PCA: {X_2d_digits.shape}")
print(f"Variance explained: {pca_2d_digits.explained_variance_ratio_.sum():.1%}")

# Plot colored by digit class
plt.figure(figsize=(9, 7))
scatter = plt.scatter(
    X_2d_digits[:, 0], X_2d_digits[:, 1],
    c=y, cmap='tab10', s=15, alpha=0.7
)
plt.colorbar(scatter, label='Digit')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Digits Dataset in 2D via PCA')
plt.savefig('pca_digits_2d.png', dpi=100)
plt.show()

Even in 2D, you can see clusters of similar digits. 0s cluster together. 1s cluster together. Some digits (4, 7, 9) overlap because they look similar.

Try 3D for even more separation:

from mpl_toolkits.mplot3d import Axes3D

pca_3d_digits = PCA(n_components=3, random_state=42)
X_3d_digits   = pca_3d_digits.fit_transform(X_scaled)

fig = plt.figure(figsize=(9, 7))
ax  = fig.add_subplot(111, projection='3d')
sc  = ax.scatter(
    X_3d_digits[:, 0], X_3d_digits[:, 1], X_3d_digits[:, 2],
    c=y, cmap='tab10', s=10, alpha=0.6
)
plt.colorbar(sc, label='Digit')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
plt.title('Digits in 3D via PCA')
plt.savefig('pca_digits_3d.png', dpi=100)
plt.show()

PCA as a Preprocessing Step

PCA often improves downstream model performance on high-dimensional data. It removes noise dimensions and speeds up training.

Always: scale first, then PCA, then model. Use a Pipeline so no leakage happens.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np

# Compare: with and without PCA
pipeline_no_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  LogisticRegression(max_iter=1000, random_state=42))
])

pipeline_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('pca',    PCA(n_components=0.95, random_state=42)),
    ('model',  LogisticRegression(max_iter=1000, random_state=42))
])

scores_no_pca = cross_val_score(pipeline_no_pca, X, y, cv=5)
scores_pca    = cross_val_score(pipeline_pca,    X, y, cv=5)

print(f"Without PCA: {scores_no_pca.mean():.3f} +/- {scores_no_pca.std():.3f}")
print(f"With PCA:    {scores_pca.mean():.3f} +/- {scores_pca.std():.3f}")

Output:

Without PCA: 0.952 +/- 0.010
With PCA:    0.921 +/- 0.013

On this dataset, PCA slightly reduces accuracy. That's common with clean, well-structured data. PCA helps more on noisy datasets with many redundant features.

Reconstructing Data From Components

PCA is reversible. You can compress data and then approximately reconstruct it. The reconstruction error tells you how much information was lost.

from sklearn.datasets import load_digits

digits = load_digits()
X_orig = digits.data

# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_orig)

# Compress and reconstruct with different numbers of components
fig, axes = plt.subplots(3, 6, figsize=(14, 7))
sample_idx = 0  # show first digit

# Original
axes[0, 0].imshow(X_orig[sample_idx].reshape(8, 8), cmap='gray')
axes[0, 0].set_title('Original')
axes[0, 0].axis('off')

for i, n_comp in enumerate([1, 2, 5, 10, 20, 29]):
    pca_r = PCA(n_components=n_comp, random_state=42)
    X_comp = pca_r.fit_transform(X_scaled)
    X_recon_scaled = pca_r.inverse_transform(X_comp)
    X_recon = scaler.inverse_transform(X_recon_scaled)

    variance_kept = pca_r.explained_variance_ratio_.sum()

    axes[1, i].imshow(X_recon[sample_idx].reshape(8, 8), cmap='gray')
    axes[1, i].set_title(f'n={n_comp}\n({variance_kept:.0%})')
    axes[1, i].axis('off')

    # Reconstruction error
    mse = np.mean((X_orig[sample_idx] - X_recon[sample_idx]) ** 2)
    axes[2, i].bar(['Error'], [mse], color='steelblue')
    axes[2, i].set_ylim(0, 50)
    axes[2, i].set_title(f'MSE={mse:.1f}')

plt.tight_layout()
plt.savefig('pca_reconstruction.png', dpi=100)
plt.show()

With 1 component you get a blurry mess. With 29 components (95% variance) the digit is clearly recognizable. With more components the reconstruction gets closer to the original.

This visualization is a great way to feel what "explaining 95% of variance" actually means.

What PCA Assumes and When It Fails

PCA is powerful but has real limitations.

PCA assumes linear relationships.
It finds linear combinations of features. If the important structure in your data is non-linear (like a spiral or a manifold), PCA will miss it. Use t-SNE or UMAP for non-linear visualization.

PCA is unsupervised.
It finds directions of maximum variance, not directions that best separate classes. Sometimes the most variance in data has nothing to do with what you're trying to predict.

PCA components are not interpretable.
Each component is a linear combination of all original features. You can't easily say "PC1 means house size." It means: 0.23 * size + 0.18 * age - 0.14 * distance + ... This is a tradeoff for compression.

PCA requires scaling.
If one feature has values in millions and another in 0 to 1, PCA will find directions that mostly capture the variance of the big-scale feature. Always StandardScale before PCA.

# Proof that scaling matters
from sklearn.datasets import fetch_california_housing
from sklearn.decomposition import PCA

housing = fetch_california_housing()
X_h = housing.data

# Without scaling
pca_unscaled = PCA(n_components=2)
pca_unscaled.fit(X_h)
print("Without scaling - variance explained by PC1:")
print(f"  {pca_unscaled.explained_variance_ratio_[0]:.1%}  <- first feature dominates")

# With scaling
from sklearn.preprocessing import StandardScaler
X_h_scaled = StandardScaler().fit_transform(X_h)
pca_scaled = PCA(n_components=2)
pca_scaled.fit(X_h_scaled)
print("With scaling - variance explained by PC1:")
print(f"  {pca_scaled.explained_variance_ratio_[0]:.1%}  <- more balanced")

Output:

Without scaling - variance explained by PC1:
  99.9%  <- first feature dominates

With scaling - variance explained by PC1:
  34.8%  <- more balanced

Without scaling, one feature with large numeric values owns 99.9% of PC1. That tells you almost nothing useful.

PCA for Noise Reduction

Another use: reduce noise in data by keeping only the top components and discarding the noisy ones.

# Add noise to digits data and see if PCA helps denoise
X_noisy = X + np.random.normal(0, 4, X.shape)

# Scale
scaler_n = StandardScaler()
X_noisy_s = scaler_n.fit_transform(X_noisy)
X_clean_s = scaler_n.fit_transform(X)

# Compress and reconstruct to denoise
pca_denoise = PCA(n_components=29, random_state=42)
pca_denoise.fit(X_clean_s)

X_noisy_comp  = pca_denoise.transform(X_noisy_s)
X_denoised_s  = pca_denoise.inverse_transform(X_noisy_comp)
X_denoised    = scaler_n.inverse_transform(X_denoised_s)

# Compare original, noisy, denoised
fig, axes = plt.subplots(3, 5, figsize=(12, 7))
for i in range(5):
    axes[0, i].imshow(X[i].reshape(8, 8),          cmap='gray')
    axes[1, i].imshow(X_noisy[i].reshape(8, 8),    cmap='gray')
    axes[2, i].imshow(X_denoised[i].reshape(8, 8), cmap='gray')

axes[0, 0].set_ylabel('Original')
axes[1, 0].set_ylabel('Noisy')
axes[2, 0].set_ylabel('Denoised')
for ax in axes.flat:
    ax.axis('off')

plt.suptitle('PCA for Noise Reduction')
plt.tight_layout()
plt.savefig('pca_denoising.png', dpi=100)
plt.show()

The denoised images are much cleaner than the noisy ones. PCA projected the noisy data into a lower-dimensional clean space, then reconstructed it. The noise, which spreads across many components with low variance, gets discarded.

Quick Cheat Sheet

Task	Code
Fit PCA	`PCA(n_components=50).fit(X_scaled)`
Keep 95% variance	`PCA(n_components=0.95)`
Transform data	`pca.transform(X)`
Fit and transform	`pca.fit_transform(X)`
Reconstruct	`pca.inverse_transform(X_compressed)`
Variance explained	`pca.explained_variance_ratio_`
Cumulative variance	`np.cumsum(pca.explained_variance_ratio_)`
Components	`pca.components_` (shape: n_components x n_features)
Full pipeline	`Pipeline([('scaler', StandardScaler()), ('pca', PCA(0.95)), ('model', ...)])`

Practice Challenges

Level 1:
Load load_breast_cancer(). Scale it. Apply PCA keeping 95% variance. How many components does that require from the original 30? Train a LogisticRegression before and after PCA. Does accuracy change?

Level 2:
Load any dataset with many features. Plot the scree plot and cumulative explained variance. Find the elbow. Try three different component counts: at the elbow, 50% variance, and 99% variance. Compare classifier performance for each.

Level 3:
Load the digits dataset. Add Gaussian noise (std=5). Apply PCA with 10, 20, and 40 components. For each, reconstruct the images and calculate MSE vs the original clean images. Which component count gives the best denoising? Plot original, noisy, and all three reconstructions side by side.

References

Next up, Post 69: Feature Engineering: Building Better Inputs. The algorithm matters less than what you feed it. Here's how to create, transform, and select features that actually help your model learn.

DEV Community

68. PCA: Shrinking Data Without Losing Information

What You'll Learn Here

The Core Idea: Find the Directions of Spread

PCA on Real Data: Digits Dataset

How to Decide How Many Components to Keep

PCA for Visualization

PCA as a Preprocessing Step

Reconstructing Data From Components

What PCA Assumes and When It Fails

PCA for Noise Reduction

Quick Cheat Sheet

Practice Challenges

References

Top comments (0)