DEV Community

Cover image for 69. Feature Engineering: Building Better Inputs
Akhilesh
Akhilesh

Posted on

69. Feature Engineering: Building Better Inputs

You've tried three different algorithms. None of them break 78% accuracy. You add dropout, tune hyperparameters, try XGBoost. Still stuck.

Then you create one new feature from the existing data. Accuracy jumps to 86%.

That's feature engineering. And it's the part of ML that makes the biggest difference in practice. Not the algorithm. Not the hyperparameters. The features.

This post covers the core techniques you'll actually use on real datasets.


What You'll Learn Here

  • Why features matter more than algorithms
  • Handling categorical variables: label encoding vs one-hot encoding
  • Scaling and transformation: when and why
  • Creating new features from existing ones
  • Interaction features and polynomial features
  • Handling dates and times
  • Domain-specific feature ideas
  • Feature selection: dropping what doesn't help

Why Features Beat Algorithms

Here's a concrete example. You're predicting house prices. You have:

  • bedrooms: 3
  • bathrooms: 2
  • square_feet: 1800

A simple addition gives you:

  • bed_bath_ratio: 1.5 (bedrooms per bathroom)
  • price_per_sqft: calculated from sale price
  • total_rooms: bedrooms + bathrooms

That ratio might tell the model something neither raw number could. A house with 5 bedrooms and 1 bathroom signals something completely different from a house with 5 bedrooms and 4 bathrooms. The ratio captures that relationship.

Good features compress domain knowledge into numbers the model can use. No algorithm can discover what it was never told.

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# Baseline score
baseline = cross_val_score(
    RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
    X, y, cv=5, scoring='r2'
)
print(f"Baseline R2: {baseline.mean():.3f}")

# Add engineered features
X_eng = X.copy()
X_eng['rooms_per_person']  = X['AveRooms']  / X['AveOccup']
X_eng['beds_per_room']     = X['AveBedrms'] / X['AveRooms']
X_eng['pop_per_household'] = X['Population'] / X['AveOccup']

engineered = cross_val_score(
    RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
    X_eng, y, cv=5, scoring='r2'
)
print(f"With features R2: {engineered.mean():.3f}")
print(f"Improvement: +{(engineered.mean() - baseline.mean()):.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Baseline R2: 0.789
With features R2: 0.806
Improvement: +0.017
Enter fullscreen mode Exit fullscreen mode

Three new features. One point seven percent improvement. No algorithm change.


Encoding Categorical Variables

Most ML algorithms need numbers. When you have text categories, you need to convert them.

Label Encoding
Assigns an integer to each category. Fine for tree-based models. Bad for linear models because it implies order (cat=2 is not "twice" cat=1).

from sklearn.preprocessing import LabelEncoder
import pandas as pd

df = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue', 'red']})

le = LabelEncoder()
df['color_encoded'] = le.fit_transform(df['color'])

print(df)
print(f"\nMapping: {dict(zip(le.classes_, le.transform(le.classes_)))}")
Enter fullscreen mode Exit fullscreen mode

Output:

   color  color_encoded
0    red              2
1   blue              0
2  green              1
3   blue              0
4    red              2

Mapping: {'blue': 0, 'green': 1, 'red': 2}
Enter fullscreen mode Exit fullscreen mode

One-Hot Encoding
Creates a binary column for each category. No false ordering. Works for all models. Can create many columns if there are many categories.

df_onehot = pd.get_dummies(df['color'], prefix='color')
print(df_onehot)
Enter fullscreen mode Exit fullscreen mode

Output:

   color_blue  color_green  color_red
0           0            0          1
1           1            0          0
2           0            1          0
3           1            0          0
4           0            0          1
Enter fullscreen mode Exit fullscreen mode

Ordinal Encoding
For categories with a real order: Small < Medium < Large.

from sklearn.preprocessing import OrdinalEncoder

size_data = pd.DataFrame({'size': ['Small', 'Large', 'Medium', 'Small', 'Large']})

oe = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
size_data['size_encoded'] = oe.fit_transform(size_data[['size']])
print(size_data)
Enter fullscreen mode Exit fullscreen mode

Output:

     size  size_encoded
0   Small           0.0
1   Large           2.0
2  Medium           1.0
3   Small           0.0
4   Large           2.0
Enter fullscreen mode Exit fullscreen mode

High-cardinality categories: Target encoding

When a category has 500+ unique values (like zip codes), one-hot creates 500 columns. Target encoding replaces each category with the mean of the target for that category.

# Target encoding example
df_target = pd.DataFrame({
    'city':       ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'NYC'],
    'house_price': [800, 600, 850, 400, 650, 780]
})

# Replace city with mean price per city
city_means = df_target.groupby('city')['house_price'].mean()
df_target['city_encoded'] = df_target['city'].map(city_means)
print(df_target)
Enter fullscreen mode Exit fullscreen mode

Output:

      city  house_price  city_encoded
0      NYC          800        810.0
1       LA          600        625.0
2      NYC          850        810.0
3  Chicago          400        400.0
4       LA          650        625.0
5      NYC          780        810.0
Enter fullscreen mode Exit fullscreen mode

Warning: target encoding can leak information if done before the train/test split. Always fit encoding on training data only.


Scaling and Transformations

Some features need to be transformed before they're useful.

Log transformation for skewed features

Many real-world features are heavily right-skewed. Income. House prices. Population. Taking the log makes the distribution more symmetric and helps linear models.

import matplotlib.pyplot as plt
import numpy as np

# Skewed data
incomes = np.random.exponential(scale=50000, size=1000)

fig, axes = plt.subplots(1, 2, figsize=(11, 4))
axes[0].hist(incomes, bins=50, color='steelblue')
axes[0].set_title('Raw Income (skewed)')
axes[0].set_xlabel('Income')

axes[1].hist(np.log1p(incomes), bins=50, color='orange')
axes[1].set_title('Log(Income + 1) (more symmetric)')
axes[1].set_xlabel('log(Income)')

plt.tight_layout()
plt.savefig('log_transform.png', dpi=100)
plt.show()
Enter fullscreen mode Exit fullscreen mode
# In a real pipeline
from sklearn.preprocessing import FunctionTransformer

log_transformer = FunctionTransformer(np.log1p, validate=True)
X_log = log_transformer.fit_transform(X[['Population']])
print(f"Before: mean={X['Population'].mean():.0f}, std={X['Population'].std():.0f}")
print(f"After:  mean={X_log.mean():.2f}, std={X_log.std():.2f}")
Enter fullscreen mode Exit fullscreen mode

Power transformation for normalizing distributions

from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method='yeo-johnson')  # handles negative values too
X_transformed = pt.fit_transform(X[['MedInc', 'Population', 'AveRooms']])
print("Distributions after power transform are more Gaussian-like")
Enter fullscreen mode Exit fullscreen mode

Creating New Features From Existing Ones

This is the creative part. You combine, divide, subtract, and multiply features to capture relationships the model might miss.

import pandas as pd
import numpy as np

# Simulated customer dataset
np.random.seed(42)
n = 1000

customers = pd.DataFrame({
    'total_spend':    np.random.exponential(200, n),
    'n_orders':       np.random.randint(1, 50, n),
    'days_since_join':np.random.randint(30, 1000, n),
    'last_purchase':  np.random.randint(1, 365, n),
    'n_returns':      np.random.randint(0, 10, n),
    'n_complaints':   np.random.randint(0, 5, n),
})

# Ratio features
customers['avg_order_value']   = customers['total_spend'] / customers['n_orders']
customers['return_rate']       = customers['n_returns'] / customers['n_orders']
customers['spend_per_day']     = customers['total_spend'] / customers['days_since_join']

# Difference features
customers['recency_frequency_gap'] = customers['last_purchase'] - (365 / customers['n_orders'])

# Aggregation features
customers['problem_score'] = customers['n_returns'] + customers['n_complaints'] * 2

# Binary flag features
customers['is_high_value']    = (customers['total_spend'] > 500).astype(int)
customers['is_recent_buyer']  = (customers['last_purchase'] < 30).astype(int)
customers['has_complained']   = (customers['n_complaints'] > 0).astype(int)

# Binning continuous features
customers['spend_bucket'] = pd.cut(
    customers['total_spend'],
    bins=[0, 100, 300, 600, np.inf],
    labels=['low', 'medium', 'high', 'premium']
)

print(customers.head())
print(f"\nOriginal features: 6, New total: {len(customers.columns)}")
Enter fullscreen mode Exit fullscreen mode

Date and Time Features

Dates carry a lot of information that models can't use in raw form. You need to extract it.

import pandas as pd

# Sample transaction log
df_dates = pd.DataFrame({
    'transaction_date': pd.date_range('2023-01-01', periods=10, freq='13D'),
    'amount': [120, 45, 380, 90, 210, 55, 430, 175, 310, 88]
})

# Extract useful components
df_dates['year']         = df_dates['transaction_date'].dt.year
df_dates['month']        = df_dates['transaction_date'].dt.month
df_dates['day']          = df_dates['transaction_date'].dt.day
df_dates['day_of_week']  = df_dates['transaction_date'].dt.dayofweek  # 0=Monday
df_dates['is_weekend']   = (df_dates['day_of_week'] >= 5).astype(int)
df_dates['quarter']      = df_dates['transaction_date'].dt.quarter
df_dates['week_of_year'] = df_dates['transaction_date'].dt.isocalendar().week.astype(int)

# Time since a reference point
reference_date = pd.Timestamp('2023-01-01')
df_dates['days_since_start'] = (df_dates['transaction_date'] - reference_date).dt.days

print(df_dates[['transaction_date', 'month', 'day_of_week', 'is_weekend',
                 'quarter', 'days_since_start']].to_string())
Enter fullscreen mode Exit fullscreen mode

Cyclical encoding for time features

Month 12 is close to month 1. But if you use raw month numbers, the model sees 12 and 1 as far apart. Cyclical encoding fixes this using sine and cosine.

df_dates['month_sin'] = np.sin(2 * np.pi * df_dates['month'] / 12)
df_dates['month_cos'] = np.cos(2 * np.pi * df_dates['month'] / 12)

df_dates['dow_sin'] = np.sin(2 * np.pi * df_dates['day_of_week'] / 7)
df_dates['dow_cos'] = np.cos(2 * np.pi * df_dates['day_of_week'] / 7)

print("\nCyclical encoding example:")
print(df_dates[['month', 'month_sin', 'month_cos']].head())
Enter fullscreen mode Exit fullscreen mode

Now January (month=1) and December (month=12) are numerically close in the sine/cosine space. The model can learn seasonal patterns correctly.


Interaction Features

When two features combine to mean something neither means alone.

from sklearn.preprocessing import PolynomialFeatures
import pandas as pd

# Simple example
df_int = pd.DataFrame({
    'study_hours':  [2, 5, 8, 1, 6],
    'sleep_hours':  [8, 6, 5, 4, 7],
    'exam_score':   [70, 82, 75, 55, 88]
})

# study_hours * sleep_hours = well-prepared AND well-rested
df_int['study_x_sleep'] = df_int['study_hours'] * df_int['sleep_hours']

print("Correlation with exam score:")
print(df_int.corr()['exam_score'].sort_values(ascending=False))
Enter fullscreen mode Exit fullscreen mode
# Automated polynomial and interaction features
from sklearn.preprocessing import PolynomialFeatures

X_small = df_int[['study_hours', 'sleep_hours']].values

poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X_small)

feature_names = poly.get_feature_names_out(['study_hours', 'sleep_hours'])
print(f"\nOriginal features: 2")
print(f"After degree-2 polynomial: {X_poly.shape[1]}")
print(f"New features: {list(feature_names)}")
Enter fullscreen mode Exit fullscreen mode

Output:

Original features: 2
After degree-2 polynomial: 5
New features: ['study_hours', 'sleep_hours', 'study_hours^2', 'study_hours sleep_hours', 'sleep_hours^2']
Enter fullscreen mode Exit fullscreen mode

Be careful with high-degree polynomial features. With 20 original features and degree=2 you already get 210 new features. With degree=3 it explodes. Only use this with few features.


Feature Selection: Dropping What Doesn't Help

Adding many features can hurt. Noisy features add dimensions and confuse the model. Use selection to keep only what matters.

Method 1: Correlation filter

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np

housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='price')

# Drop features with low correlation to target
correlations = X.corrwith(y).abs().sort_values(ascending=False)
print("Correlations with target:")
print(correlations)

# Keep features with correlation > 0.1
keep_features = correlations[correlations > 0.1].index.tolist()
print(f"\nKeeping {len(keep_features)} of {len(X.columns)} features")
Enter fullscreen mode Exit fullscreen mode

Method 2: SelectKBest

from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression

# F-statistic based selection
selector_f = SelectKBest(score_func=f_regression, k=5)
selector_f.fit(X, y)
selected_f = X.columns[selector_f.get_support()]
print(f"SelectKBest (F-stat) top 5: {list(selected_f)}")

# Mutual information selection (catches non-linear relationships too)
selector_mi = SelectKBest(score_func=mutual_info_regression, k=5)
selector_mi.fit(X, y)
selected_mi = X.columns[selector_mi.get_support()]
print(f"SelectKBest (MI) top 5: {list(selected_mi)}")
Enter fullscreen mode Exit fullscreen mode

Method 3: Tree-based feature importance

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X, y)

importance_df = pd.DataFrame({
    'Feature':    X.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nRandom Forest Feature Importance:")
print(importance_df.to_string(index=False))

# Drop features with near-zero importance
threshold   = 0.01
keep_rf     = importance_df[importance_df['Importance'] >= threshold]['Feature'].tolist()
print(f"\nKeeping features with importance >= {threshold}: {keep_rf}")
Enter fullscreen mode Exit fullscreen mode

Method 4: Recursive Feature Elimination (RFE)

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

rfe = RFE(estimator=LinearRegression(), n_features_to_select=4)
rfe.fit(X, y)

selected_rfe = X.columns[rfe.support_]
print(f"RFE selected features: {list(selected_rfe)}")
Enter fullscreen mode Exit fullscreen mode

Putting It All Together: A Feature Engineering Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np

# Simulate dataset with mixed types
np.random.seed(42)
n = 500
df = pd.DataFrame({
    'age':        np.random.randint(18, 80, n),
    'income':     np.random.exponential(50000, n),
    'city':       np.random.choice(['NYC', 'LA', 'Chicago', 'Houston'], n),
    'education':  np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n),
    'experience': np.random.randint(0, 40, n),
})

# Add some missing values
df.loc[np.random.choice(n, 30), 'income']  = np.nan
df.loc[np.random.choice(n, 20), 'age']     = np.nan

# Create target
df['buys'] = ((df['income'].fillna(df['income'].median()) > 60000) &
              (df['age'].fillna(30) > 25)).astype(int)

X_df = df.drop('buys', axis=1)
y_df = df['buys']

# Define column types
numeric_cols     = ['age', 'income', 'experience']
categorical_cols = ['city', 'education']

# Preprocessing pipeline for each type
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot',  OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])

# Combine
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_cols),
    ('cat', categorical_transformer, categorical_cols),
])

# Full pipeline
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model',        RandomForestClassifier(n_estimators=100, random_state=42)),
])

scores = cross_val_score(full_pipeline, X_df, y_df, cv=5)
print(f"Pipeline CV Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
Enter fullscreen mode Exit fullscreen mode

This ColumnTransformer pattern is what real ML engineers use in production. Numeric and categorical features get different treatments, everything is done safely inside a pipeline, and there's no data leakage.


The Things Everyone Gets Wrong

Mistake 1: Engineering features before the train/test split

If you compute target encoding or fill missing values using the entire dataset, information from the test set leaks into training. Always put feature engineering inside a pipeline or do it only on training data.

Mistake 2: One-hot encoding high-cardinality features

A city column with 500 cities creates 500 binary columns. Most will be sparse and useless. Use target encoding or embeddings for high-cardinality categories.

Mistake 3: Ignoring domain knowledge

The best features come from understanding the business. A data scientist who knows that revenue / headcount is a key business metric will create better features than one blindly generating all combinations.

Mistake 4: Adding too many features and not selecting

More features is not better. Noisy, irrelevant features add noise and slow training. Always run feature selection after engineering.


Quick Cheat Sheet

Technique When to use Code
Label encoding Tree models, ordinal data LabelEncoder()
One-hot encoding Linear/NN models, low cardinality pd.get_dummies() or OneHotEncoder()
Target encoding High cardinality categories groupby().mean() on train only
Log transform Right-skewed features np.log1p(X)
Power transform Normalize distributions PowerTransformer()
Interaction features Known domain relationships X1 * X2 or PolynomialFeatures
Cyclical encoding Time/date features sin, cos of period
Binning Non-linear bucket effects pd.cut()
Feature selection Too many features SelectKBest, RF importance, RFE

Practice Challenges

Level 1:
Take the California housing dataset. Add three engineered features: rooms per person, population density per block, and a flag for unusually large households. Does cross-val R2 improve?

Level 2:
Load a Kaggle dataset with dates (any sales or event dataset). Extract year, month, day of week, is_weekend, and cyclical month encoding. Check which extracted features correlate most with the target.

Level 3:
Build a full ColumnTransformer pipeline on a mixed dataset. Include numeric imputation, log-transform for skewed columns, one-hot for low-cardinality categorical, and ordinal encoding for an ordered category. Compare CV accuracy before and after the full preprocessing pipeline.


References


Next up, Post 70: Hyperparameter Tuning: Finding the Best Settings. Grid search, random search, and Optuna. Stop guessing your model's settings and start finding them systematically.

Top comments (0)