House Price Prediction — v2 Enhanced ML Pipeline

An advanced, modular machine learning pipeline for predicting median house prices in California. Version 2 introduces feature engineering, regularized linear models (Ridge, Lasso), cross-validation, hyperparameter tuning, and a custom Gradient Descent Regressor built from scratch — all organized into reusable source modules.

GitHub Repo View v1 (Baseline)

Preview

Feature distributions for California housing dataset — Feature distributions and relationships used to guide feature engineering in v2 — log transforms, derived ratios, and standardization for Ridge, Lasso, and custom Gradient Descent.

Problem & Context

The goal of v2 is to build a fully modular, extensible ML pipeline for structured tabular data; introduce feature engineering and standardization to improve stability and performance; implement and compare multiple linear models (OLS, Ridge, Lasso, custom Gradient Descent); evaluate robustness through 5-fold cross-validation and hyperparameter tuning; and establish a reproducible training workflow that cleanly separates preprocessing, training, evaluation, and inference.

Geographical distribution of California housing prices — EDA: geographic distribution of median house values across California. Location (latitude/longitude) and ocean proximity drive much of the variation captured later by the model.

What’s New in v2 (vs v1)

Feature Engineering — Full transformations and scaling (v1 had minimal)
Models — Custom Gradient Descent, OLS, Ridge, Lasso (v1 had OLS only)
Cross-Validation — 5-fold CV for stability analysis (v1 had none)
Pipeline — Modular Python pipeline in src/ (v1 was notebook-only)
Performance — Lower RMSE after feature engineering and regularization

What I Built

Custom modular ML pipeline — Clean separation of preprocessing, training, evaluation, and utilities in src/
Feature engineering & standardization — Consistent transformations across training and inference
Multiple linear models — OLS, Ridge (L2), Lasso (L1), plus custom Gradient Descent Regressor with configurable learning rate, iterations, and convergence tracking
5-fold Cross-Validation — Model stability and variance assessment
Hyperparameter tuning — Grid search for regularized models
Reproducible pipeline — End-to-end training via python3 -m src.train
Five structured notebooks — Exploration, custom GD evaluation, sklearn baseline, CV comparison, pipeline demo

Results (Summary)

Custom Gradient Descent Regressor: Converged in ~1500 iterations. Test RMSE: ~74.6K USD. Test R²: ~0.57.

The model explains ~57% of variance in housing prices, captures strong linear trends (e.g., median income → price), and misses nonlinear and interaction effects. Full model comparisons (Ridge, Lasso, CV results) are documented in the repository’s reports/report.md.

Tech Stack

Python 3.11 pandas NumPy matplotlib scikit-learn Custom Gradient Descent Jupyter Notebook

Architecture / How It Works

The pipeline runs end-to-end via python3 -m src.train. Data is loaded from data/raw/, preprocessed and feature-engineered dynamically (no persisted processed data), then used to train and evaluate models. The src/ modules handle config, data loading, feature engineering, preprocessing, gradient descent, evaluation, hyperparameter tuning, and model I/O. Notebooks document the full development workflow from exploration through baselines, cross-validation, and pipeline demo.

Key Takeaways

Version 2 demonstrates ML engineering best practices: modularity, reproducibility, and robust evaluation. Feature engineering and regularization improve over the v1 baseline. The custom Gradient Descent implementation reinforces understanding of optimization fundamentals. Future directions include integrated GridSearchCV, nonlinear models (Random Forest, XGBoost), experiment tracking (MLflow, W&B), and production-ready tooling.

← Back to Projects