How to Do Feature Engineering

Updated May 2026
Feature engineering is the process of creating, transforming, and selecting input variables that help a machine learning model make better predictions. It is widely considered the most impactful step in the ML pipeline: good features with a simple model almost always outperform bad features with a complex model. This guide covers practical techniques for turning raw data into features that algorithms can learn from effectively.

Andrew Ng, one of the most influential ML researchers, has said that coming up with features is difficult, time-consuming, and requires expert knowledge. It is also the part of machine learning that has the single largest impact on model performance. Kaggle competition winners consistently report that feature engineering matters more than algorithm choice.

Step 1: Understand Your Data and Domain

Before creating any features, understand what each column represents, what values are possible, and what domain knowledge might be predictive. In a fraud detection problem, knowing that fraudsters often make purchases at unusual hours or in rapid succession tells you to engineer time-based and velocity features.

Run exploratory data analysis: distributions, missing value patterns, correlations between features and the target, and scatter plots of the most promising variables. Look for features with clear relationships to the target and features that seem redundant (highly correlated with each other).

Talk to domain experts when possible. A doctor will tell you that the ratio of two lab values is more diagnostic than either value alone. A financial analyst will tell you that quarter-over-quarter revenue change matters more than absolute revenue. These insights translate directly into powerful features that no automated method would discover.

Step 2: Create New Features from Existing Data

Date and time features: Extract hour, day of week, month, quarter, year, is_weekend, is_holiday, days_since_event, and time_of_day_bucket from timestamp columns. For a retail model, day of week and proximity to holidays are often among the top predictors.

Mathematical combinations: Create ratios (price_per_sqft = price / sqft), differences (years_since_renovation = current_year - renovation_year), products (total_value = quantity * price), and aggregates (total_purchases_last_30_days). These encode relationships between features that the model might not discover on its own, especially with linear models.

Text features: Extract word count, character count, number of special characters, presence of specific keywords, sentiment scores, and TF-IDF vectors. For email classification, features like number_of_links, contains_urgency_words, and email_length are highly predictive.

Aggregation features: For data with natural groups (customers, stores, products), compute group-level statistics: average purchase amount per customer, standard deviation of daily sales per store, number of unique products per category. These capture behavioral patterns at the group level.

Interaction features: Multiply or combine two features to capture their joint effect. In a housing model, bedrooms * bathrooms, or neighborhood_median_income * square_footage, can be more predictive than either feature alone. Be selective, because the number of possible interactions grows quadratically with the number of features.

Step 3: Transform Features for Model Compatibility

Scaling: Standardization (subtract mean, divide by standard deviation) makes features comparable in magnitude. Normalization (scale to 0-1 range) is useful when features have different units. SVM, KNN, linear regression, and neural networks all benefit from scaling. Tree-based methods do not, because they only compare values within a single feature.

Encoding categorical variables: One-hot encoding creates a binary column for each category (city_new_york = 0 or 1). It works well for features with few categories but creates too many columns for high-cardinality features. Target encoding replaces each category with the average target value for that category. Label encoding assigns integers (Monday=0, Tuesday=1) and works for ordinal categories.

Binning: Convert continuous variables into discrete buckets. Age might become age_group (18-25, 26-35, 36-45, etc.). This can help when the relationship between the feature and target is nonlinear and step-wise rather than smooth. Be careful not to lose signal by using too few bins.

Log and power transforms: Apply log(x), sqrt(x), or Box-Cox transforms to reduce skewness in features with long tails. Income data, house prices, and population counts are typically right-skewed; a log transform makes their distribution more symmetric, which benefits linear models and distance-based algorithms.

Step 4: Select the Most Informative Features

Correlation analysis: Remove features that are highly correlated with each other (correlation above 0.95) because they provide redundant information and can destabilize coefficient estimates. Keep the feature that has the stronger correlation with the target.

Feature importance from models: Train a random forest or gradient boosting model and examine its feature importance scores. Features with near-zero importance can usually be removed without losing performance. This is one of the most practical and reliable feature selection methods.

Recursive Feature Elimination (RFE): Train a model, remove the least important feature, retrain, repeat. Stop when removing features starts hurting cross-validation performance. This is computationally expensive but thorough.

Mutual information: Measures the statistical dependence between a feature and the target, capturing both linear and nonlinear relationships. Features with near-zero mutual information are likely noise. This is more general than correlation but more expensive to compute.

Step 5: Validate That Features Improve Performance

Every new feature should be tested through cross-validation. Add the feature, retrain the model with cross-validation, and compare metrics with and without the feature. If the improvement is not measurable, remove the feature; it adds complexity without value.

Watch for data leakage. If a feature uses information from the future (the target variable itself, or a feature computed from test data), it will appear highly predictive during development but fail completely in production. Compute all features using only data that would be available at prediction time.

Track which features you tried, what worked, and what did not. Feature engineering is empirical: many ideas that seem promising fail, and some ideas that seem unlikely succeed. A systematic log prevents you from retrying failed experiments.

Key Takeaway

Feature engineering transforms raw data into variables that encode domain knowledge and make patterns more accessible to algorithms. Good features matter more than algorithm choice. The process involves understanding the data, creating derived features, transforming them for compatibility with your model, selecting the most informative ones, and validating improvements through cross-validation.