Akhilesh

Posted on May 12

69. Feature Engineering: Building Better Inputs

#ai #productivity #programming #tutorial

You've tried three different algorithms. None of them break 78% accuracy. You add dropout, tune hyperparameters, try XGBoost. Still stuck.

Then you create one new feature from the existing data. Accuracy jumps to 86%.

That's feature engineering. And it's the part of ML that makes the biggest difference in practice. Not the algorithm. Not the hyperparameters. The features.

This post covers the core techniques you'll actually use on real datasets.

What You'll Learn Here

Why features matter more than algorithms
Handling categorical variables: label encoding vs one-hot encoding
Scaling and transformation: when and why
Creating new features from existing ones
Interaction features and polynomial features
Handling dates and times
Domain-specific feature ideas
Feature selection: dropping what doesn't help

Why Features Beat Algorithms

Here's a concrete example. You're predicting house prices. You have:

bedrooms: 3
bathrooms: 2
square_feet: 1800

A simple addition gives you:

bed_bath_ratio: 1.5 (bedrooms per bathroom)
price_per_sqft: calculated from sale price
total_rooms: bedrooms + bathrooms

That ratio might tell the model something neither raw number could. A house with 5 bedrooms and 1 bathroom signals something completely different from a house with 5 bedrooms and 4 bathrooms. The ratio captures that relationship.

Good features compress domain knowledge into numbers the model can use. No algorithm can discover what it was never told.

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# Baseline score
baseline = cross_val_score(
    RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
    X, y, cv=5, scoring='r2'
)
print(f"Baseline R2: {baseline.mean():.3f}")

# Add engineered features
X_eng = X.copy()
X_eng['rooms_per_person']  = X['AveRooms']  / X['AveOccup']
X_eng['beds_per_room']     = X['AveBedrms'] / X['AveRooms']
X_eng['pop_per_household'] = X['Population'] / X['AveOccup']

engineered = cross_val_score(
    RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
    X_eng, y, cv=5, scoring='r2'
)
print(f"With features R2: {engineered.mean():.3f}")
print(f"Improvement: +{(engineered.mean() - baseline.mean()):.3f}")

Output:

Baseline R2: 0.789
With features R2: 0.806
Improvement: +0.017

Three new features. One point seven percent improvement. No algorithm change.

Encoding Categorical Variables

Most ML algorithms need numbers. When you have text categories, you need to convert them.

Label Encoding
Assigns an integer to each category. Fine for tree-based models. Bad for linear models because it implies order (cat=2 is not "twice" cat=1).

from sklearn.preprocessing import LabelEncoder
import pandas as pd

df = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue', 'red']})

le = LabelEncoder()
df['color_encoded'] = le.fit_transform(df['color'])

print(df)
print(f"\nMapping: {dict(zip(le.classes_, le.transform(le.classes_)))}")

Output:

   color  color_encoded
0    red              2
1   blue              0
2  green              1
3   blue              0
4    red              2

Mapping: {'blue': 0, 'green': 1, 'red': 2}

One-Hot Encoding
Creates a binary column for each category. No false ordering. Works for all models. Can create many columns if there are many categories.

df_onehot = pd.get_dummies(df['color'], prefix='color')
print(df_onehot)

Output:

   color_blue  color_green  color_red
0           0            0          1
1           1            0          0
2           0            1          0
3           1            0          0
4           0            0          1

Ordinal Encoding
For categories with a real order: Small < Medium < Large.

from sklearn.preprocessing import OrdinalEncoder

size_data = pd.DataFrame({'size': ['Small', 'Large', 'Medium', 'Small', 'Large']})

oe = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
size_data['size_encoded'] = oe.fit_transform(size_data[['size']])
print(size_data)

Output:

     size  size_encoded
0   Small           0.0
1   Large           2.0
2  Medium           1.0
3   Small           0.0
4   Large           2.0

High-cardinality categories: Target encoding

When a category has 500+ unique values (like zip codes), one-hot creates 500 columns. Target encoding replaces each category with the mean of the target for that category.

# Target encoding example
df_target = pd.DataFrame({
    'city':       ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'NYC'],
    'house_price': [800, 600, 850, 400, 650, 780]
})

# Replace city with mean price per city
city_means = df_target.groupby('city')['house_price'].mean()
df_target['city_encoded'] = df_target['city'].map(city_means)
print(df_target)

Output:

      city  house_price  city_encoded
0      NYC          800        810.0
1       LA          600        625.0
2      NYC          850        810.0
3  Chicago          400        400.0
4       LA          650        625.0
5      NYC          780        810.0

Warning: target encoding can leak information if done before the train/test split. Always fit encoding on training data only.

Scaling and Transformations

Some features need to be transformed before they're useful.

Log transformation for skewed features

Many real-world features are heavily right-skewed. Income. House prices. Population. Taking the log makes the distribution more symmetric and helps linear models.

import matplotlib.pyplot as plt
import numpy as np

# Skewed data
incomes = np.random.exponential(scale=50000, size=1000)

fig, axes = plt.subplots(1, 2, figsize=(11, 4))
axes[0].hist(incomes, bins=50, color='steelblue')
axes[0].set_title('Raw Income (skewed)')
axes[0].set_xlabel('Income')

axes[1].hist(np.log1p(incomes), bins=50, color='orange')
axes[1].set_title('Log(Income + 1) (more symmetric)')
axes[1].set_xlabel('log(Income)')

plt.tight_layout()
plt.savefig('log_transform.png', dpi=100)
plt.show()

# In a real pipeline
from sklearn.preprocessing import FunctionTransformer

log_transformer = FunctionTransformer(np.log1p, validate=True)
X_log = log_transformer.fit_transform(X[['Population']])
print(f"Before: mean={X['Population'].mean():.0f}, std={X['Population'].std():.0f}")
print(f"After:  mean={X_log.mean():.2f}, std={X_log.std():.2f}")

Power transformation for normalizing distributions

from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method='yeo-johnson')  # handles negative values too
X_transformed = pt.fit_transform(X[['MedInc', 'Population', 'AveRooms']])
print("Distributions after power transform are more Gaussian-like")

Creating New Features From Existing Ones

This is the creative part. You combine, divide, subtract, and multiply features to capture relationships the model might miss.

import pandas as pd
import numpy as np

# Simulated customer dataset
np.random.seed(42)
n = 1000

customers = pd.DataFrame({
    'total_spend':    np.random.exponential(200, n),
    'n_orders':       np.random.randint(1, 50, n),
    'days_since_join':np.random.randint(30, 1000, n),
    'last_purchase':  np.random.randint(1, 365, n),
    'n_returns':      np.random.randint(0, 10, n),
    'n_complaints':   np.random.randint(0, 5, n),
})

# Ratio features
customers['avg_order_value']   = customers['total_spend'] / customers['n_orders']
customers['return_rate']       = customers['n_returns'] / customers['n_orders']
customers['spend_per_day']     = customers['total_spend'] / customers['days_since_join']

# Difference features
customers['recency_frequency_gap'] = customers['last_purchase'] - (365 / customers['n_orders'])

# Aggregation features
customers['problem_score'] = customers['n_returns'] + customers['n_complaints'] * 2

# Binary flag features
customers['is_high_value']    = (customers['total_spend'] > 500).astype(int)
customers['is_recent_buyer']  = (customers['last_purchase'] < 30).astype(int)
customers['has_complained']   = (customers['n_complaints'] > 0).astype(int)

# Binning continuous features
customers['spend_bucket'] = pd.cut(
    customers['total_spend'],
    bins=[0, 100, 300, 600, np.inf],
    labels=['low', 'medium', 'high', 'premium']
)

print(customers.head())
print(f"\nOriginal features: 6, New total: {len(customers.columns)}")

Date and Time Features

Dates carry a lot of information that models can't use in raw form. You need to extract it.

import pandas as pd

# Sample transaction log
df_dates = pd.DataFrame({
    'transaction_date': pd.date_range('2023-01-01', periods=10, freq='13D'),
    'amount': [120, 45, 380, 90, 210, 55, 430, 175, 310, 88]
})

# Extract useful components
df_dates['year']         = df_dates['transaction_date'].dt.year
df_dates['month']        = df_dates['transaction_date'].dt.month
df_dates['day']          = df_dates['transaction_date'].dt.day
df_dates['day_of_week']  = df_dates['transaction_date'].dt.dayofweek  # 0=Monday
df_dates['is_weekend']   = (df_dates['day_of_week'] >= 5).astype(int)
df_dates['quarter']      = df_dates['transaction_date'].dt.quarter
df_dates['week_of_year'] = df_dates['transaction_date'].dt.isocalendar().week.astype(int)

# Time since a reference point
reference_date = pd.Timestamp('2023-01-01')
df_dates['days_since_start'] = (df_dates['transaction_date'] - reference_date).dt.days

print(df_dates[['transaction_date', 'month', 'day_of_week', 'is_weekend',
                 'quarter', 'days_since_start']].to_string())

Cyclical encoding for time features

Month 12 is close to month 1. But if you use raw month numbers, the model sees 12 and 1 as far apart. Cyclical encoding fixes this using sine and cosine.

df_dates['month_sin'] = np.sin(2 * np.pi * df_dates['month'] / 12)
df_dates['month_cos'] = np.cos(2 * np.pi * df_dates['month'] / 12)

df_dates['dow_sin'] = np.sin(2 * np.pi * df_dates['day_of_week'] / 7)
df_dates['dow_cos'] = np.cos(2 * np.pi * df_dates['day_of_week'] / 7)

print("\nCyclical encoding example:")
print(df_dates[['month', 'month_sin', 'month_cos']].head())

Now January (month=1) and December (month=12) are numerically close in the sine/cosine space. The model can learn seasonal patterns correctly.

Interaction Features

When two features combine to mean something neither means alone.

from sklearn.preprocessing import PolynomialFeatures
import pandas as pd

# Simple example
df_int = pd.DataFrame({
    'study_hours':  [2, 5, 8, 1, 6],
    'sleep_hours':  [8, 6, 5, 4, 7],
    'exam_score':   [70, 82, 75, 55, 88]
})

# study_hours * sleep_hours = well-prepared AND well-rested
df_int['study_x_sleep'] = df_int['study_hours'] * df_int['sleep_hours']

print("Correlation with exam score:")
print(df_int.corr()['exam_score'].sort_values(ascending=False))

# Automated polynomial and interaction features
from sklearn.preprocessing import PolynomialFeatures

X_small = df_int[['study_hours', 'sleep_hours']].values

poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X_small)

feature_names = poly.get_feature_names_out(['study_hours', 'sleep_hours'])
print(f"\nOriginal features: 2")
print(f"After degree-2 polynomial: {X_poly.shape[1]}")
print(f"New features: {list(feature_names)}")

Output:

Original features: 2
After degree-2 polynomial: 5
New features: ['study_hours', 'sleep_hours', 'study_hours^2', 'study_hours sleep_hours', 'sleep_hours^2']

Be careful with high-degree polynomial features. With 20 original features and degree=2 you already get 210 new features. With degree=3 it explodes. Only use this with few features.

Feature Selection: Dropping What Doesn't Help

Adding many features can hurt. Noisy features add dimensions and confuse the model. Use selection to keep only what matters.

Method 1: Correlation filter

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np

housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='price')

# Drop features with low correlation to target
correlations = X.corrwith(y).abs().sort_values(ascending=False)
print("Correlations with target:")
print(correlations)

# Keep features with correlation > 0.1
keep_features = correlations[correlations > 0.1].index.tolist()
print(f"\nKeeping {len(keep_features)} of {len(X.columns)} features")

Method 2: SelectKBest

from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression

# F-statistic based selection
selector_f = SelectKBest(score_func=f_regression, k=5)
selector_f.fit(X, y)
selected_f = X.columns[selector_f.get_support()]
print(f"SelectKBest (F-stat) top 5: {list(selected_f)}")

# Mutual information selection (catches non-linear relationships too)
selector_mi = SelectKBest(score_func=mutual_info_regression, k=5)
selector_mi.fit(X, y)
selected_mi = X.columns[selector_mi.get_support()]
print(f"SelectKBest (MI) top 5: {list(selected_mi)}")

Method 3: Tree-based feature importance

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X, y)

importance_df = pd.DataFrame({
    'Feature':    X.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nRandom Forest Feature Importance:")
print(importance_df.to_string(index=False))

# Drop features with near-zero importance
threshold   = 0.01
keep_rf     = importance_df[importance_df['Importance'] >= threshold]['Feature'].tolist()
print(f"\nKeeping features with importance >= {threshold}: {keep_rf}")

Method 4: Recursive Feature Elimination (RFE)

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

rfe = RFE(estimator=LinearRegression(), n_features_to_select=4)
rfe.fit(X, y)

selected_rfe = X.columns[rfe.support_]
print(f"RFE selected features: {list(selected_rfe)}")

Putting It All Together: A Feature Engineering Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np

# Simulate dataset with mixed types
np.random.seed(42)
n = 500
df = pd.DataFrame({
    'age':        np.random.randint(18, 80, n),
    'income':     np.random.exponential(50000, n),
    'city':       np.random.choice(['NYC', 'LA', 'Chicago', 'Houston'], n),
    'education':  np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n),
    'experience': np.random.randint(0, 40, n),
})

# Add some missing values
df.loc[np.random.choice(n, 30), 'income']  = np.nan
df.loc[np.random.choice(n, 20), 'age']     = np.nan

# Create target
df['buys'] = ((df['income'].fillna(df['income'].median()) > 60000) &
              (df['age'].fillna(30) > 25)).astype(int)

X_df = df.drop('buys', axis=1)
y_df = df['buys']

# Define column types
numeric_cols     = ['age', 'income', 'experience']
categorical_cols = ['city', 'education']

# Preprocessing pipeline for each type
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot',  OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])

# Combine
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_cols),
    ('cat', categorical_transformer, categorical_cols),
])

# Full pipeline
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model',        RandomForestClassifier(n_estimators=100, random_state=42)),
])

scores = cross_val_score(full_pipeline, X_df, y_df, cv=5)
print(f"Pipeline CV Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")

This ColumnTransformer pattern is what real ML engineers use in production. Numeric and categorical features get different treatments, everything is done safely inside a pipeline, and there's no data leakage.

The Things Everyone Gets Wrong

Mistake 1: Engineering features before the train/test split

If you compute target encoding or fill missing values using the entire dataset, information from the test set leaks into training. Always put feature engineering inside a pipeline or do it only on training data.

Mistake 2: One-hot encoding high-cardinality features

A city column with 500 cities creates 500 binary columns. Most will be sparse and useless. Use target encoding or embeddings for high-cardinality categories.

Mistake 3: Ignoring domain knowledge

The best features come from understanding the business. A data scientist who knows that revenue / headcount is a key business metric will create better features than one blindly generating all combinations.

Mistake 4: Adding too many features and not selecting

More features is not better. Noisy, irrelevant features add noise and slow training. Always run feature selection after engineering.

Quick Cheat Sheet

Technique	When to use	Code
Label encoding	Tree models, ordinal data	`LabelEncoder()`
One-hot encoding	Linear/NN models, low cardinality	`pd.get_dummies()` or `OneHotEncoder()`
Target encoding	High cardinality categories	`groupby().mean()` on train only
Log transform	Right-skewed features	`np.log1p(X)`
Power transform	Normalize distributions	`PowerTransformer()`
Interaction features	Known domain relationships	`X1 * X2` or `PolynomialFeatures`
Cyclical encoding	Time/date features	`sin`, `cos` of period
Binning	Non-linear bucket effects	`pd.cut()`
Feature selection	Too many features	`SelectKBest`, RF importance, RFE

Practice Challenges

Level 1:
Take the California housing dataset. Add three engineered features: rooms per person, population density per block, and a flag for unusually large households. Does cross-val R2 improve?

Level 2:
Load a Kaggle dataset with dates (any sales or event dataset). Extract year, month, day of week, is_weekend, and cyclical month encoding. Check which extracted features correlate most with the target.

Level 3:
Build a full ColumnTransformer pipeline on a mixed dataset. Include numeric imputation, log-transform for skewed columns, one-hot for low-cardinality categorical, and ordinal encoding for an ordered category. Compare CV accuracy before and after the full preprocessing pipeline.

References

Next up, Post 70: Hyperparameter Tuning: Finding the Best Settings. Grid search, random search, and Optuna. Stop guessing your model's settings and start finding them systematically.

DEV Community

69. Feature Engineering: Building Better Inputs

What You'll Learn Here

Why Features Beat Algorithms

Encoding Categorical Variables

Scaling and Transformations

Creating New Features From Existing Ones

Date and Time Features

Interaction Features

Feature Selection: Dropping What Doesn't Help

Putting It All Together: A Feature Engineering Pipeline

The Things Everyone Gets Wrong

Quick Cheat Sheet

Practice Challenges

References

Top comments (0)