You've tried three different algorithms. None of them break 78% accuracy. You add dropout, tune hyperparameters, try XGBoost. Still stuck.
Then you create one new feature from the existing data. Accuracy jumps to 86%.
That's feature engineering. And it's the part of ML that makes the biggest difference in practice. Not the algorithm. Not the hyperparameters. The features.
This post covers the core techniques you'll actually use on real datasets.
What You'll Learn Here
- Why features matter more than algorithms
- Handling categorical variables: label encoding vs one-hot encoding
- Scaling and transformation: when and why
- Creating new features from existing ones
- Interaction features and polynomial features
- Handling dates and times
- Domain-specific feature ideas
- Feature selection: dropping what doesn't help
Why Features Beat Algorithms
Here's a concrete example. You're predicting house prices. You have:
-
bedrooms: 3 -
bathrooms: 2 -
square_feet: 1800
A simple addition gives you:
-
bed_bath_ratio: 1.5 (bedrooms per bathroom) -
price_per_sqft: calculated from sale price -
total_rooms: bedrooms + bathrooms
That ratio might tell the model something neither raw number could. A house with 5 bedrooms and 1 bathroom signals something completely different from a house with 5 bedrooms and 4 bathrooms. The ratio captures that relationship.
Good features compress domain knowledge into numbers the model can use. No algorithm can discover what it was never told.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target
# Baseline score
baseline = cross_val_score(
RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
X, y, cv=5, scoring='r2'
)
print(f"Baseline R2: {baseline.mean():.3f}")
# Add engineered features
X_eng = X.copy()
X_eng['rooms_per_person'] = X['AveRooms'] / X['AveOccup']
X_eng['beds_per_room'] = X['AveBedrms'] / X['AveRooms']
X_eng['pop_per_household'] = X['Population'] / X['AveOccup']
engineered = cross_val_score(
RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
X_eng, y, cv=5, scoring='r2'
)
print(f"With features R2: {engineered.mean():.3f}")
print(f"Improvement: +{(engineered.mean() - baseline.mean()):.3f}")
Output:
Baseline R2: 0.789
With features R2: 0.806
Improvement: +0.017
Three new features. One point seven percent improvement. No algorithm change.
Encoding Categorical Variables
Most ML algorithms need numbers. When you have text categories, you need to convert them.
Label Encoding
Assigns an integer to each category. Fine for tree-based models. Bad for linear models because it implies order (cat=2 is not "twice" cat=1).
from sklearn.preprocessing import LabelEncoder
import pandas as pd
df = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue', 'red']})
le = LabelEncoder()
df['color_encoded'] = le.fit_transform(df['color'])
print(df)
print(f"\nMapping: {dict(zip(le.classes_, le.transform(le.classes_)))}")
Output:
color color_encoded
0 red 2
1 blue 0
2 green 1
3 blue 0
4 red 2
Mapping: {'blue': 0, 'green': 1, 'red': 2}
One-Hot Encoding
Creates a binary column for each category. No false ordering. Works for all models. Can create many columns if there are many categories.
df_onehot = pd.get_dummies(df['color'], prefix='color')
print(df_onehot)
Output:
color_blue color_green color_red
0 0 0 1
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1
Ordinal Encoding
For categories with a real order: Small < Medium < Large.
from sklearn.preprocessing import OrdinalEncoder
size_data = pd.DataFrame({'size': ['Small', 'Large', 'Medium', 'Small', 'Large']})
oe = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
size_data['size_encoded'] = oe.fit_transform(size_data[['size']])
print(size_data)
Output:
size size_encoded
0 Small 0.0
1 Large 2.0
2 Medium 1.0
3 Small 0.0
4 Large 2.0
High-cardinality categories: Target encoding
When a category has 500+ unique values (like zip codes), one-hot creates 500 columns. Target encoding replaces each category with the mean of the target for that category.
# Target encoding example
df_target = pd.DataFrame({
'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'NYC'],
'house_price': [800, 600, 850, 400, 650, 780]
})
# Replace city with mean price per city
city_means = df_target.groupby('city')['house_price'].mean()
df_target['city_encoded'] = df_target['city'].map(city_means)
print(df_target)
Output:
city house_price city_encoded
0 NYC 800 810.0
1 LA 600 625.0
2 NYC 850 810.0
3 Chicago 400 400.0
4 LA 650 625.0
5 NYC 780 810.0
Warning: target encoding can leak information if done before the train/test split. Always fit encoding on training data only.
Scaling and Transformations
Some features need to be transformed before they're useful.
Log transformation for skewed features
Many real-world features are heavily right-skewed. Income. House prices. Population. Taking the log makes the distribution more symmetric and helps linear models.
import matplotlib.pyplot as plt
import numpy as np
# Skewed data
incomes = np.random.exponential(scale=50000, size=1000)
fig, axes = plt.subplots(1, 2, figsize=(11, 4))
axes[0].hist(incomes, bins=50, color='steelblue')
axes[0].set_title('Raw Income (skewed)')
axes[0].set_xlabel('Income')
axes[1].hist(np.log1p(incomes), bins=50, color='orange')
axes[1].set_title('Log(Income + 1) (more symmetric)')
axes[1].set_xlabel('log(Income)')
plt.tight_layout()
plt.savefig('log_transform.png', dpi=100)
plt.show()
# In a real pipeline
from sklearn.preprocessing import FunctionTransformer
log_transformer = FunctionTransformer(np.log1p, validate=True)
X_log = log_transformer.fit_transform(X[['Population']])
print(f"Before: mean={X['Population'].mean():.0f}, std={X['Population'].std():.0f}")
print(f"After: mean={X_log.mean():.2f}, std={X_log.std():.2f}")
Power transformation for normalizing distributions
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson') # handles negative values too
X_transformed = pt.fit_transform(X[['MedInc', 'Population', 'AveRooms']])
print("Distributions after power transform are more Gaussian-like")
Creating New Features From Existing Ones
This is the creative part. You combine, divide, subtract, and multiply features to capture relationships the model might miss.
import pandas as pd
import numpy as np
# Simulated customer dataset
np.random.seed(42)
n = 1000
customers = pd.DataFrame({
'total_spend': np.random.exponential(200, n),
'n_orders': np.random.randint(1, 50, n),
'days_since_join':np.random.randint(30, 1000, n),
'last_purchase': np.random.randint(1, 365, n),
'n_returns': np.random.randint(0, 10, n),
'n_complaints': np.random.randint(0, 5, n),
})
# Ratio features
customers['avg_order_value'] = customers['total_spend'] / customers['n_orders']
customers['return_rate'] = customers['n_returns'] / customers['n_orders']
customers['spend_per_day'] = customers['total_spend'] / customers['days_since_join']
# Difference features
customers['recency_frequency_gap'] = customers['last_purchase'] - (365 / customers['n_orders'])
# Aggregation features
customers['problem_score'] = customers['n_returns'] + customers['n_complaints'] * 2
# Binary flag features
customers['is_high_value'] = (customers['total_spend'] > 500).astype(int)
customers['is_recent_buyer'] = (customers['last_purchase'] < 30).astype(int)
customers['has_complained'] = (customers['n_complaints'] > 0).astype(int)
# Binning continuous features
customers['spend_bucket'] = pd.cut(
customers['total_spend'],
bins=[0, 100, 300, 600, np.inf],
labels=['low', 'medium', 'high', 'premium']
)
print(customers.head())
print(f"\nOriginal features: 6, New total: {len(customers.columns)}")
Date and Time Features
Dates carry a lot of information that models can't use in raw form. You need to extract it.
import pandas as pd
# Sample transaction log
df_dates = pd.DataFrame({
'transaction_date': pd.date_range('2023-01-01', periods=10, freq='13D'),
'amount': [120, 45, 380, 90, 210, 55, 430, 175, 310, 88]
})
# Extract useful components
df_dates['year'] = df_dates['transaction_date'].dt.year
df_dates['month'] = df_dates['transaction_date'].dt.month
df_dates['day'] = df_dates['transaction_date'].dt.day
df_dates['day_of_week'] = df_dates['transaction_date'].dt.dayofweek # 0=Monday
df_dates['is_weekend'] = (df_dates['day_of_week'] >= 5).astype(int)
df_dates['quarter'] = df_dates['transaction_date'].dt.quarter
df_dates['week_of_year'] = df_dates['transaction_date'].dt.isocalendar().week.astype(int)
# Time since a reference point
reference_date = pd.Timestamp('2023-01-01')
df_dates['days_since_start'] = (df_dates['transaction_date'] - reference_date).dt.days
print(df_dates[['transaction_date', 'month', 'day_of_week', 'is_weekend',
'quarter', 'days_since_start']].to_string())
Cyclical encoding for time features
Month 12 is close to month 1. But if you use raw month numbers, the model sees 12 and 1 as far apart. Cyclical encoding fixes this using sine and cosine.
df_dates['month_sin'] = np.sin(2 * np.pi * df_dates['month'] / 12)
df_dates['month_cos'] = np.cos(2 * np.pi * df_dates['month'] / 12)
df_dates['dow_sin'] = np.sin(2 * np.pi * df_dates['day_of_week'] / 7)
df_dates['dow_cos'] = np.cos(2 * np.pi * df_dates['day_of_week'] / 7)
print("\nCyclical encoding example:")
print(df_dates[['month', 'month_sin', 'month_cos']].head())
Now January (month=1) and December (month=12) are numerically close in the sine/cosine space. The model can learn seasonal patterns correctly.
Interaction Features
When two features combine to mean something neither means alone.
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd
# Simple example
df_int = pd.DataFrame({
'study_hours': [2, 5, 8, 1, 6],
'sleep_hours': [8, 6, 5, 4, 7],
'exam_score': [70, 82, 75, 55, 88]
})
# study_hours * sleep_hours = well-prepared AND well-rested
df_int['study_x_sleep'] = df_int['study_hours'] * df_int['sleep_hours']
print("Correlation with exam score:")
print(df_int.corr()['exam_score'].sort_values(ascending=False))
# Automated polynomial and interaction features
from sklearn.preprocessing import PolynomialFeatures
X_small = df_int[['study_hours', 'sleep_hours']].values
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X_small)
feature_names = poly.get_feature_names_out(['study_hours', 'sleep_hours'])
print(f"\nOriginal features: 2")
print(f"After degree-2 polynomial: {X_poly.shape[1]}")
print(f"New features: {list(feature_names)}")
Output:
Original features: 2
After degree-2 polynomial: 5
New features: ['study_hours', 'sleep_hours', 'study_hours^2', 'study_hours sleep_hours', 'sleep_hours^2']
Be careful with high-degree polynomial features. With 20 original features and degree=2 you already get 210 new features. With degree=3 it explodes. Only use this with few features.
Feature Selection: Dropping What Doesn't Help
Adding many features can hurt. Noisy features add dimensions and confuse the model. Use selection to keep only what matters.
Method 1: Correlation filter
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='price')
# Drop features with low correlation to target
correlations = X.corrwith(y).abs().sort_values(ascending=False)
print("Correlations with target:")
print(correlations)
# Keep features with correlation > 0.1
keep_features = correlations[correlations > 0.1].index.tolist()
print(f"\nKeeping {len(keep_features)} of {len(X.columns)} features")
Method 2: SelectKBest
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression
# F-statistic based selection
selector_f = SelectKBest(score_func=f_regression, k=5)
selector_f.fit(X, y)
selected_f = X.columns[selector_f.get_support()]
print(f"SelectKBest (F-stat) top 5: {list(selected_f)}")
# Mutual information selection (catches non-linear relationships too)
selector_mi = SelectKBest(score_func=mutual_info_regression, k=5)
selector_mi.fit(X, y)
selected_mi = X.columns[selector_mi.get_support()]
print(f"SelectKBest (MI) top 5: {list(selected_mi)}")
Method 3: Tree-based feature importance
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X, y)
importance_df = pd.DataFrame({
'Feature': X.columns,
'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nRandom Forest Feature Importance:")
print(importance_df.to_string(index=False))
# Drop features with near-zero importance
threshold = 0.01
keep_rf = importance_df[importance_df['Importance'] >= threshold]['Feature'].tolist()
print(f"\nKeeping features with importance >= {threshold}: {keep_rf}")
Method 4: Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
rfe = RFE(estimator=LinearRegression(), n_features_to_select=4)
rfe.fit(X, y)
selected_rfe = X.columns[rfe.support_]
print(f"RFE selected features: {list(selected_rfe)}")
Putting It All Together: A Feature Engineering Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np
# Simulate dataset with mixed types
np.random.seed(42)
n = 500
df = pd.DataFrame({
'age': np.random.randint(18, 80, n),
'income': np.random.exponential(50000, n),
'city': np.random.choice(['NYC', 'LA', 'Chicago', 'Houston'], n),
'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n),
'experience': np.random.randint(0, 40, n),
})
# Add some missing values
df.loc[np.random.choice(n, 30), 'income'] = np.nan
df.loc[np.random.choice(n, 20), 'age'] = np.nan
# Create target
df['buys'] = ((df['income'].fillna(df['income'].median()) > 60000) &
(df['age'].fillna(30) > 25)).astype(int)
X_df = df.drop('buys', axis=1)
y_df = df['buys']
# Define column types
numeric_cols = ['age', 'income', 'experience']
categorical_cols = ['city', 'education']
# Preprocessing pipeline for each type
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])
# Combine
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_cols),
('cat', categorical_transformer, categorical_cols),
])
# Full pipeline
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('model', RandomForestClassifier(n_estimators=100, random_state=42)),
])
scores = cross_val_score(full_pipeline, X_df, y_df, cv=5)
print(f"Pipeline CV Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
This ColumnTransformer pattern is what real ML engineers use in production. Numeric and categorical features get different treatments, everything is done safely inside a pipeline, and there's no data leakage.
The Things Everyone Gets Wrong
Mistake 1: Engineering features before the train/test split
If you compute target encoding or fill missing values using the entire dataset, information from the test set leaks into training. Always put feature engineering inside a pipeline or do it only on training data.
Mistake 2: One-hot encoding high-cardinality features
A city column with 500 cities creates 500 binary columns. Most will be sparse and useless. Use target encoding or embeddings for high-cardinality categories.
Mistake 3: Ignoring domain knowledge
The best features come from understanding the business. A data scientist who knows that revenue / headcount is a key business metric will create better features than one blindly generating all combinations.
Mistake 4: Adding too many features and not selecting
More features is not better. Noisy, irrelevant features add noise and slow training. Always run feature selection after engineering.
Quick Cheat Sheet
| Technique | When to use | Code |
|---|---|---|
| Label encoding | Tree models, ordinal data | LabelEncoder() |
| One-hot encoding | Linear/NN models, low cardinality |
pd.get_dummies() or OneHotEncoder()
|
| Target encoding | High cardinality categories |
groupby().mean() on train only |
| Log transform | Right-skewed features | np.log1p(X) |
| Power transform | Normalize distributions | PowerTransformer() |
| Interaction features | Known domain relationships |
X1 * X2 or PolynomialFeatures
|
| Cyclical encoding | Time/date features |
sin, cos of period |
| Binning | Non-linear bucket effects | pd.cut() |
| Feature selection | Too many features |
SelectKBest, RF importance, RFE |
Practice Challenges
Level 1:
Take the California housing dataset. Add three engineered features: rooms per person, population density per block, and a flag for unusually large households. Does cross-val R2 improve?
Level 2:
Load a Kaggle dataset with dates (any sales or event dataset). Extract year, month, day of week, is_weekend, and cyclical month encoding. Check which extracted features correlate most with the target.
Level 3:
Build a full ColumnTransformer pipeline on a mixed dataset. Include numeric imputation, log-transform for skewed columns, one-hot for low-cardinality categorical, and ordinal encoding for an ordered category. Compare CV accuracy before and after the full preprocessing pipeline.
References
- Scikit-learn: Feature engineering
- Scikit-learn: ColumnTransformer
- Scikit-learn: Feature selection
- Kaggle: Feature engineering course
- Pandas: Working with dates
Next up, Post 70: Hyperparameter Tuning: Finding the Best Settings. Grid search, random search, and Optuna. Stop guessing your model's settings and start finding them systematically.
Top comments (0)