Data Science Portfolio

ML Validation & Data Quality

Case-based flashcards connecting machine learning concepts to real product, data quality, and QA problems.

MS in Data Science Candidate, Boston University — Expected May 2026

View ML Validation Open Flashcards View GitHub

Open to OpportunitiesPrimary focus: ML QA Engineer · Data Quality Engineer. Secondary focus: QA Automation Engineer with Python · Data Analyst · Junior Data Scientist.Email LinkedIn GitHub Resume PDF

Learning Path

Learning Path: From ML Concepts to Production QA

I use this page to connect interview preparation with applied data science, model validation, and real product-quality problems.

Beginner

Foundations

Why it matters: These concepts define whether a model is learning real signal or just memorizing noise.

Intermediate

Intermediate ML

Why it matters: Model choice only matters when evaluation is honest, comparable, and tied to risk.

Applied

Model Validation

Why it matters: Validation shows where a model fails, who is affected, and whether it is safe enough to use.

Advanced

Production ML / MLOps

Why it matters: Production systems need monitoring, documentation, and rollback paths after the notebook ends.

Advanced

LLMs and Embeddings

Why it matters: Semantic systems can improve discovery, but similarity is not the same as verified truth.

Interview Prep

Concept → Example → Flashcard → Interview Question

Each topic pairs interview questions with a portfolio example and the QA failure mode I would test before trusting the system.

Regression

Regression Interview Questions

What is the difference between RMSE and MAE?
Why can R² be misleading?
When would you choose Linear Regression over Random Forest?
What would you check before using a regression model in production?

WPH / Portfolio Example

Regression could support translation-demand forecasting or expected time-to-English-publication, but only after the World Publishing Houses dataset has enough historical examples across countries, languages, genres, and publisher types.

QA Angle

A regression model should not be trusted just because it produces a number. I would check baseline comparison, residual patterns, outliers, data drift, and whether errors are larger for specific countries or categories.

Classification

Classification Interview Questions

What is the difference between precision and recall?
Why is accuracy dangerous with imbalanced classes?
When would you use a confusion matrix?
How do you choose a classification threshold?

WPH / Portfolio Example

World Publishing Houses could classify books into reader buckets such as “Read now in English,” “Coming soon,” or “Not yet in English.” Because the current dataset is still small for supervised ML and reader buckets may be imbalanced, accuracy alone would be misleading.

QA Angle

Before training a classifier, I would check class balance, minority-class count, label definitions, and whether the model simply learns the majority class.

Feature Engineering

Feature Engineering Interview Questions

What makes a feature useful?
What is data leakage?
How do you handle categorical variables?
When should you drop a feature?

WPH / Portfolio Example

Possible World Publishing Houses features include original language, country, publisher activity, award status, translator presence, publication lag, verification status, and translation path.

QA Angle

A feature can exist in the schema but still be useless for ML if it has no variation. For example, if almost every translation path is “Direct,” the model cannot learn pivot-translation behavior from that column.

MLOps

MLOps Interview Questions

What is data drift?
What should be monitored after model release?
What belongs in a model card?
How would you decide whether to roll back a model?

WPH / Portfolio Example

A World Publishing Houses recommendation or rights-signal model would need monitoring because publishing activity changes over time: awards, rights deals, new translations, publisher acquisitions, and seasonal release patterns can all shift the data.

QA Angle

Production ML requires monitoring beyond uptime. I would track feature drift, prediction distribution, source freshness, error patterns, and confidence changes.

LLMs and Embeddings

LLMs and Embeddings Interview Questions

What is an embedding?
How is semantic search different from keyword search?
When can embeddings fail?
How would you evaluate retrieval quality?

WPH / Portfolio Example

Embeddings could help World Publishing Houses find similar books, cluster publishers, normalize messy metadata, or match translation records from different sources.

QA Angle

Embedding similarity is not truth. I would test false matches, false misses, multilingual behavior, publisher-name ambiguity, and whether retrieval results are explainable to users.

ML Model Validation Case Study: Residential Property Values

Outcome: Compared Linear Regression, Random Forest, and Gradient Boosting models for residential property value prediction. Random Forest performed best among the tested models with RMSE of approximately $290,919, MAE of approximately $184,569, and R² of approximately 0.517. The project is framed as a model validation case study, showing how I evaluate performance, limitations, fairness risk, and production readiness.

The strongest insight was not that the model was production-ready, but that validation exposed clear limits: high prediction error, need for baseline comparison, possible location-proxy risk, and the importance of residual analysis before deployment.

Validation resultRandom Forest — RMSE ~$290,919, MAE ~$184,569, R² ~0.517.

View GitHub ML QA Review

What the Model Results Actually Mean

The Random Forest model performed best among the tested models, but the result also shows why validation matters. An R² of approximately 0.517 means the model explains some meaningful signal, but it is not strong enough for high-stakes appraisal decisions without additional feature engineering, market-specific validation, fairness review, and monitoring.

Model	RMSE	MAE	R²	Validation takeaway
Random Forest	~$290,919	~$184,569	~0.517	Best tested model, but still limited for production use.
Gradient Boosting	~$294,663	Not reported	~0.504	Similar performance, slightly weaker than Random Forest.
Linear Regression	~$336,465	Not reported	~0.354	Baseline interpretable model; weaker fit for non-linear housing patterns.

A future improvement is to add a naive mean-prediction baseline so the model lift can be quantified against a simple non-ML benchmark.

ML QA Review: What I Would Test Before Production

Data quality: I would verify missing values, duplicate records, outliers, stale property records, and inconsistent square-footage fields. Housing data can look numeric and clean while still containing unrealistic values that distort model behavior.
Residual analysis: I would compare prediction errors by region, price band, property type, and property age. A global RMSE can hide the fact that the model performs well on average homes but poorly on luxury homes, rural properties, or older housing stock.
Fairness and proxy risk: Location variables such as latitude, longitude, region, and ZIP-like fields can act as proxies for socioeconomic patterns. Before any production use, I would check whether error rates are higher for specific neighborhoods or property groups.
Baseline comparison: I would require a naive mean-prediction baseline and a simple interpretable model before treating the Random Forest result as meaningful. If the more complex model does not clearly improve over simple baselines, it should not be promoted.
Monitoring: Housing markets change over time. I would monitor drift in feature distributions, prediction errors, and regional market behavior after release. A model trained on older market conditions may become unreliable when interest rates or demand patterns change.
Release readiness: Before release, I would require documented validation data, model limitations, fairness checks, monitoring thresholds, rollback criteria, and a clear statement that the model supports analysis only — not automated appraisal decisions.

Capstone framing

Primary metricRMSE, supported by MAE and R².

Business use caseValuation analysis practice for understanding model behavior, not automated appraisal decisions.

Ethics lensLocation features can proxy socioeconomic patterns; monitoring, fairness review, and retraining are important.

Applied Portfolio Cards

Applied ML + Data Quality Flashcards

Case-based cards connecting machine learning concepts to dataset readiness, model validation, and QA risk.

These flashcards are not generic study notes. They connect machine learning concepts to real product and QA problems: translation metadata, publisher entity resolution, model validation, fairness risk, and production-readiness checks.

These cards are organized in four levels: Level 1: Definition recall, Level 2: Tradeoffs, Level 3: When to use what, and Level 4: Failure cases and production risk.

Open WPH Nordic Dataset Explorer

Featured decks

Search Deck Level

1 / 28

0 / 28 cards learned

Fundamentals for Study

General ML cards are available for context, but the featured decks prioritize applied QA, data quality, and project-specific reasoning.

WPH Dataset ReadinessLevel 4: Failure Case1 / 28

reader buckets→class balance→metric risk

Majority-class accuracy can hide weak model behavior.

A majority bucket can make a weak classifier look accurate.

WPH has 70 works across the Nordic dataset. You want to train a classifier to predict whether a book will reach English readers. What is the immediate validation risk?

Answer

The immediate risk is class imbalance and small sample size. A model can look accurate if most records fall into the same reader bucket, even when it learns little about the minority cases. Accuracy is the wrong metric by itself; precision, recall, and minority-class counts matter more.

QA Angle

This is a data quality and validation problem before it is a modeling problem. An ML QA review should check class balance before training and flag the dataset if minority classes are too small to support meaningful evaluation.

Where This Shows Up

reader_bucket.value_counts()
Review majority bucket size
Review minority bucket counts
Compare accuracy with precision / recall

Open WPH Nordic Dataset Explorer

Reveal each layer: answer, QA angle, then the dataset reference.

Tools and concepts

Technologies and methods I use or study in my QA-to-DS transition.

Python data stack

Pandas, NumPy, scikit-learn, notebooks, data cleaning, model comparison, and evaluation.

ML concepts

Regression, classification, clustering, embeddings, decision trees, gradient boosting, cross-validation, and error analysis.

Engineering context

SQL, API testing, logs, automation, CI/CD awareness, cloud/data pipeline concepts, and production QA thinking.