House Price Prediction — v1 Baseline Linear Model
A baseline machine learning pipeline for predicting median house prices in California. This first version establishes a clean, reproducible workflow for loading, exploring, and modeling structured tabular data using Linear Regression and the Normal Equation.
Preview
Problem & Context
The goal of this version is to build a clean, reproducible baseline model using Linear Regression, establish a solid workflow for loading, exploring, and modeling structured tabular data, and evaluate baseline performance to guide improvements in future versions. This version uses the Normal Equation (closed-form solution) for Linear Regression as implemented in scikit-learn.
Dataset
The project uses the California Housing dataset (~20,000 samples, 8 numerical + 1 categorical feature) with median_house_value as the target. Features include geographic coordinates (longitude, latitude), housing stock (housing_median_age, total_rooms, total_bedrooms), demographic context (population, households), median_income, and ocean_proximity.
Key insight from EDA: median_income shows the strongest positive correlation with house value.
What I Built
- Exploratory Data Analysis — Inspected schema, data types, and null values; visualized distributions; computed correlation matrix; examined categorical feature impact
- Baseline feature selection — median_income, housing_median_age, latitude, longitude, ocean_proximity
- Preprocessing — ColumnTransformer for numeric & categorical columns; OneHotEncoder for ocean_proximity
- Linear Regression — Trained using the Normal Equation via scikit-learn
- Evaluation — RMSE and R² score on test set
Results
RMSE: ≈ $73,000 | R²: ≈ 0.6
The model explains ~60% of variance in housing prices. It captures strong linear trends (e.g., income vs. price) but misses nonlinear and interaction effects — which motivated the feature engineering and enhanced pipeline in v2.
Tech Stack
Key Takeaways
A clean baseline provides a solid reference point for further experimentation. Median income is the most informative predictor in this dataset. Performance can be improved through feature transformations, nonlinear terms, and regularization — which led to the v2 enhanced pipeline.