House Price Prediction — v1 Baseline Linear Model

A baseline machine learning pipeline for predicting median house prices in California. This first version establishes a clean, reproducible workflow for loading, exploring, and modeling structured tabular data using Linear Regression and the Normal Equation.

GitHub Repo View v2 (Feature Engineering)

Preview

House price v1 baseline — EDA and Linear Regression workflow — V1 baseline workflow: exploratory notebooks (schema, correlation, distributions) and Linear Regression trained with the Normal Equation. This clean baseline set the stage for feature engineering and regularization in v2.

Problem & Context

The goal of this version is to build a clean, reproducible baseline model using Linear Regression, establish a solid workflow for loading, exploring, and modeling structured tabular data, and evaluate baseline performance to guide improvements in future versions. This version uses the Normal Equation (closed-form solution) for Linear Regression as implemented in scikit-learn.

Dataset

The project uses the California Housing dataset (~20,000 samples, 8 numerical + 1 categorical feature) with median_house_value as the target. Features include geographic coordinates (longitude, latitude), housing stock (housing_median_age, total_rooms, total_bedrooms), demographic context (population, households), median_income, and ocean_proximity.

Key insight from EDA: median_income shows the strongest positive correlation with house value.

What I Built

Exploratory Data Analysis — Inspected schema, data types, and null values; visualized distributions; computed correlation matrix; examined categorical feature impact
Baseline feature selection — median_income, housing_median_age, latitude, longitude, ocean_proximity
Preprocessing — ColumnTransformer for numeric & categorical columns; OneHotEncoder for ocean_proximity
Linear Regression — Trained using the Normal Equation via scikit-learn
Evaluation — RMSE and R² score on test set

Results

RMSE: ≈ $73,000 | R²: ≈ 0.6

The model explains ~60% of variance in housing prices. It captures strong linear trends (e.g., income vs. price) but misses nonlinear and interaction effects — which motivated the feature engineering and enhanced pipeline in v2.

Tech Stack

Python 3.11 pandas NumPy matplotlib scikit-learn Jupyter Notebook

Key Takeaways

A clean baseline provides a solid reference point for further experimentation. Median income is the most informative predictor in this dataset. Performance can be improved through feature transformations, nonlinear terms, and regularization — which led to the v2 enhanced pipeline.

← Back to Projects