← Back to Projects

House Price Prediction — v2 Enhanced ML Pipeline

An advanced, modular machine learning pipeline for predicting median house prices in California. Version 2 introduces feature engineering, regularized linear models (Ridge, Lasso), cross-validation, hyperparameter tuning, and a custom Gradient Descent Regressor built from scratch — all organized into reusable source modules.

Preview

Feature distributions for California housing dataset
Feature distributions and relationships used to guide feature engineering in v2 — log transforms, derived ratios, and standardization for Ridge, Lasso, and custom Gradient Descent.

Problem & Context

The goal of v2 is to build a fully modular, extensible ML pipeline for structured tabular data; introduce feature engineering and standardization to improve stability and performance; implement and compare multiple linear models (OLS, Ridge, Lasso, custom Gradient Descent); evaluate robustness through 5-fold cross-validation and hyperparameter tuning; and establish a reproducible training workflow that cleanly separates preprocessing, training, evaluation, and inference.

Geographical distribution of California housing prices
EDA: geographic distribution of median house values across California. Location (latitude/longitude) and ocean proximity drive much of the variation captured later by the model.

What’s New in v2 (vs v1)

What I Built

Results (Summary)

Custom Gradient Descent Regressor: Converged in ~1500 iterations. Test RMSE: ~74.6K USD. Test R²: ~0.57.

The model explains ~57% of variance in housing prices, captures strong linear trends (e.g., median income → price), and misses nonlinear and interaction effects. Full model comparisons (Ridge, Lasso, CV results) are documented in the repository’s reports/report.md.

Tech Stack

Python 3.11 pandas NumPy matplotlib scikit-learn Custom Gradient Descent Jupyter Notebook

Architecture / How It Works

The pipeline runs end-to-end via python3 -m src.train. Data is loaded from data/raw/, preprocessed and feature-engineered dynamically (no persisted processed data), then used to train and evaluate models. The src/ modules handle config, data loading, feature engineering, preprocessing, gradient descent, evaluation, hyperparameter tuning, and model I/O. Notebooks document the full development workflow from exploration through baselines, cross-validation, and pipeline demo.

Key Takeaways

Version 2 demonstrates ML engineering best practices: modularity, reproducibility, and robust evaluation. Feature engineering and regularization improve over the v1 baseline. The custom Gradient Descent implementation reinforces understanding of optimization fundamentals. Future directions include integrated GridSearchCV, nonlinear models (Random Forest, XGBoost), experiment tracking (MLflow, W&B), and production-ready tooling.

← Back to Projects