---
name: data-science-feature-engineering
description: "Feature engineering for machine learning: encoding, scaling, transformations, datetime features, text features, and feature selection. Use when preparing data for modeling or improving model performance through better representations."
dependsOn: ["@data-science-eda", "@data-engineering-core"]
---

# Feature Engineering

Use this skill for creating, transforming, and selecting features that improve model performance.

## When to use this skill

- After EDA — convert insights into features
- Model underperforming — need better representations
- Handling different data types (numerical, categorical, text, datetime)
- Reducing dimensionality or selecting most predictive features

## Feature engineering workflow

1. **Numerical features**
   - Scaling (StandardScaler, MinMaxScaler, RobustScaler)
   - Transformations (log, sqrt, Box-Cox for skewness)
   - Binning (equal-width, quantile, custom)
   - Interaction features

2. **Categorical features**
   - One-hot encoding (low cardinality)
   - Target/Mean encoding (high cardinality)
   - Ordinal encoding (ordered categories)
   - Frequency/rare category handling

3. **Datetime features**
   - Extract components (year, month, day, hour, dayofweek)
   - Cyclical encoding (sin/cos for time cycles)
   - Time since/duration features

4. **Text features**
   - TF-IDF, CountVectorizer
   - Embeddings (sentence-transformers)
   - Basic text stats (length, word count)

5. **Feature selection**
   - Filter methods (correlation, mutual information)
   - Wrapper methods (recursive feature elimination)
   - Embedded methods (L1 regularization, tree importance)

## Quick tool selection

| Task | Default choice | Notes |
|---|---|---|
| sklearn pipelines | **sklearn.pipeline + ColumnTransformer** | Reproducible, cross-validation safe |
| Categorical encoding | **category_encoders** | Beyond sklearn's limited options |
| Feature selection | **sklearn.feature_selection** | Mutual info, RFE, SelectFromModel |
| Text embeddings | **sentence-transformers** | Pre-trained semantic embeddings |
| Auto feature engineering | **Feature-engine** | Comprehensive transformations |

## Core implementation rules

### 1) Use pipelines to prevent leakage

```python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', RandomForestClassifier())
])
```

### 2) Fit on train only, transform on all

```python
# Correct: fit_transform on train, transform on test
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)  # Only transform!
```

### 3) Handle unknown categories

```python
OneHotEncoder(handle_unknown='ignore')  # Unknown → all zeros
# OR
OneHotEncoder(handle_unknown='infrequent_if_exist')  # Group rare/unknown
```

### 4) Document feature importance

Track which features were created, why, and their expected impact.

## Common anti-patterns

- ❌ Fitting preprocessors on full dataset (leakage!)
- ❌ One-hot encoding high-cardinality features (dimension explosion)
- ❌ Ignoring feature scaling for distance-based models
- ❌ Creating features without domain reasoning
- ❌ Not validating feature distributions match between train/test

## Progressive disclosure

- `../references/categorical-encoding.md` — Comprehensive encoding guide
- `../references/datetime-features.md` — Time-based feature patterns
- `../references/text-features.md` — NLP feature engineering
- `../references/feature-selection.md` — Selection strategies and implementations

## Related skills

- `@data-science-eda` — Understand data before engineering
- `@data-science-model-evaluation` — Validate feature impact
- `@data-engineering-core` — Data processing fundamentals

## References

- [sklearn Preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html)
- [category_encoders](https://contrib.scikit-learn.org/category_encoders/)
- [Feature-engine Documentation](https://feature-engine.trainindata.com/)
- [Sentence Transformers](https://www.sbert.net/)