Graduate Underemployment / Overqualification Prediction

Full machine learning pipeline for predicting overqualification (underemployment) in recruitment using the NGS (National Graduate Survey) structured hiring dataset. Built for the SFU Data Science Student Society ML Hackathon; uses CatBoost with a focus on predictive performance and interpretability (feature importance, optional SHAP).

GitHub Repo

Preview

Exploration heatmap from underemployment prediction EDA — Exploratory analysis: correlation and feature relationships in the NGS hiring dataset, used to inform preprocessing and feature selection for the CatBoost pipeline.

Problem & Context

The goal was to build a robust model that accurately estimates overqualification probability from candidate attributes (education, experience, skills, demographics), work with the NGS dataset and its feature structure (survey codes, missing conventions), and train a CatBoost-based model with validation and leaderboard-oriented iteration. The solution achieved 0.75174 accuracy on the Public leaderboard and 0.70511 on the Private leaderboard, placing close to the top-performing teams.

Model results and evaluation metrics — Model evaluation and interpretability — feature importance and validation metrics from the CatBoost classifier (0.75 public / 0.71 private leaderboard accuracy at the SFU ML Hackathon).

What It Does

Modular ML pipeline — Clean separation of data loading, preprocessing, feature engineering, model training, evaluation, and prediction in src/
NGS-aware preprocessing — Handling of special codes (6, 9, 99) and normalization of mixed-type columns
CatBoost classifier — Native categorical support, early stopping, configurable hyperparameters
Stratified K-fold cross-validation and optional grid search for hyperparameter tuning
Interpretability — Feature importance and optional SHAP integration
Reproducible workflow — python3 -m src.train and python3 -m src.predict for end-to-end training and submission
Five Jupyter notebooks — Exploration, preprocessing, training/tuning, evaluation/interpretability, pipeline demo

Tech Stack

Python CatBoost pandas NumPy scikit-learn SHAP Matplotlib Seaborn Jupyter

Key Takeaways

Structured pipelines and NGS-aware preprocessing were essential for leaderboard performance. CatBoost’s native categorical handling and interpretability (feature importance, SHAP) made it a strong choice for this tabular classification task.

← Back to Projects