← Back to Projects

Graduate Underemployment / Overqualification Prediction

Full machine learning pipeline for predicting overqualification (underemployment) in recruitment using the NGS (National Graduate Survey) structured hiring dataset. Built for the SFU Data Science Student Society ML Hackathon; uses CatBoost with a focus on predictive performance and interpretability (feature importance, optional SHAP).

Preview

Exploration heatmap from underemployment prediction EDA
Exploratory analysis: correlation and feature relationships in the NGS hiring dataset, used to inform preprocessing and feature selection for the CatBoost pipeline.

Problem & Context

The goal was to build a robust model that accurately estimates overqualification probability from candidate attributes (education, experience, skills, demographics), work with the NGS dataset and its feature structure (survey codes, missing conventions), and train a CatBoost-based model with validation and leaderboard-oriented iteration. The solution achieved 0.75174 accuracy on the Public leaderboard and 0.70511 on the Private leaderboard, placing close to the top-performing teams.

Model results and evaluation metrics
Model evaluation and interpretability — feature importance and validation metrics from the CatBoost classifier (0.75 public / 0.71 private leaderboard accuracy at the SFU ML Hackathon).

What It Does

Tech Stack

Python CatBoost pandas NumPy scikit-learn SHAP Matplotlib Seaborn Jupyter

Key Takeaways

Structured pipelines and NGS-aware preprocessing were essential for leaderboard performance. CatBoost’s native categorical handling and interpretability (feature importance, SHAP) made it a strong choice for this tabular classification task.

← Back to Projects