Common Mistakesundergraduate

15 Common Mistakes When Studying Machine Learning (And How to Fix Them) | LearnByTeaching.ai

Machine learning sits at the intersection of linear algebra, probability, calculus, and programming, which means mistakes can come from any of these foundations. Students often jump to running sklearn models without understanding the math beneath, then struggle to debug when things go wrong. Here are 15 common mistakes and how to fix them.

#1CriticalConceptual

Training and evaluating on the same data

Students skip train/test splitting and evaluate model performance on training data, producing misleadingly optimistic metrics that don't reflect real-world performance.

A student trains a decision tree on the full dataset, gets 99% accuracy, and concludes the model is excellent -- but it achieves only 60% on new data because it memorized the training set.

How to fix it

Always split data into train, validation, and test sets before any model fitting. Use k-fold cross-validation for more reliable estimates. Never touch the test set until final evaluation.

#2CriticalConceptual

Data leakage from preprocessing before splitting

Applying transformations like normalization or feature selection on the entire dataset before splitting leaks information from the test set into training.

A student normalizes all features using the mean and standard deviation of the entire dataset, then splits into train/test. The test set statistics influenced the normalization, inflating performance.

How to fix it

Fit all preprocessing steps (scaling, imputation, encoding) on training data only, then apply those same transformations to the test set. Use sklearn Pipelines to automate this.

#3MajorConceptual

Ignoring the bias-variance tradeoff

Students increase model complexity to reduce training error without realizing this increases variance and overfitting on unseen data.

A student keeps adding polynomial features until a regression model fits the training data perfectly, but validation error skyrockets due to overfitting.

How to fix it

Plot learning curves (training and validation error vs. model complexity). High training error signals underfitting (bias); a large gap between training and validation error signals overfitting (variance).

#4MajorConceptual

Using accuracy as the metric for imbalanced datasets

On datasets where one class dominates, accuracy is misleading because a model that predicts the majority class every time achieves high accuracy.

A fraud detection model predicts 'not fraud' for every transaction. On a dataset with 99% legitimate transactions, it reports 99% accuracy but catches zero fraud.

How to fix it

Use precision, recall, F1-score, or AUC-ROC for imbalanced problems. Choose the metric that aligns with the business cost of false positives vs. false negatives.

#5MajorStudy Habit

Treating ML as a black box without understanding the math

Students call model.fit() and model.predict() without understanding gradient descent, loss functions, or regularization, leaving them unable to debug poor performance.

A student's neural network loss plateaus, and they have no idea whether the issue is a bad learning rate, vanishing gradients, or insufficient data because they never studied the optimization process.

How to fix it

Implement linear regression, logistic regression, and a simple neural network from scratch in NumPy before using frameworks. Understanding the math lets you diagnose problems.

#6MajorConceptual

Not scaling features before distance-based algorithms

Algorithms like k-NN, SVM, and k-means depend on distances between points. If features have wildly different scales, larger-magnitude features dominate.

A k-NN model uses 'age' (0-100) and 'income' (0-1,000,000). Income dominates the distance calculation, making age effectively irrelevant.

How to fix it

Standardize (zero mean, unit variance) or normalize (min-max scaling) features before running distance-based algorithms. Tree-based methods are generally scale-invariant.

#7MajorConceptual

Confusing correlation with causation in feature importance

Students interpret high feature importance as proof that a feature causes the outcome, when it only indicates statistical association.

A model predicting ice cream sales shows high importance for 'sunscreen purchases.' A student concludes sunscreen causes ice cream sales, ignoring that both are caused by hot weather.

How to fix it

Feature importance shows predictive power, not causal relationships. For causal claims, you need experimental design or causal inference techniques, not just ML models.

#8CriticalConceptual

Hyperparameter tuning on the test set

Students use test set performance to select hyperparameters, effectively fitting to the test set and eliminating its value as an unbiased evaluation.

A student tries 50 different regularization values, picks the one with the best test accuracy, and reports that test accuracy as the model's expected performance.

How to fix it

Use a three-way split: train, validation, and test. Tune hyperparameters using only the validation set. Report final performance on the test set, which the model has never seen during development.

#9MajorStudy Habit

Neglecting exploratory data analysis

Students jump straight to modeling without examining the data for missing values, outliers, class imbalances, or unexpected distributions.

A student trains a model on medical data with 30% missing values in a key feature. The model imputes with zeros by default, introducing systematic bias.

How to fix it

Before any modeling, run summary statistics, plot distributions, check for missing data, and examine correlations. Spend at least 30% of project time on EDA and data cleaning.

#10MajorConceptual

Overfitting by using too many features with too few samples

The curse of dimensionality means that as feature count grows relative to sample count, models find spurious patterns that don't generalize.

A student builds a model with 500 features from gene expression data on only 50 patients. The model memorizes the training data perfectly but fails on new patients.

How to fix it

Apply dimensionality reduction (PCA), feature selection, or regularization when features outnumber samples. As a rule of thumb, you need at least 10 samples per feature for stable models.

#11MinorConceptual

Misinterpreting regularization

Students know L1 and L2 regularization reduce overfitting but don't understand the mechanism: penalizing large weights to encourage simpler models.

A student sets regularization strength to a very large value, finds training accuracy drops to near chance, and concludes regularization is harmful rather than recognizing it was set too high.

How to fix it

Understand regularization as a tradeoff: too little allows overfitting, too much causes underfitting. Tune the regularization parameter using cross-validation, not intuition.

#12MinorStudy Habit

Failing to set random seeds for reproducibility

ML results depend on random initialization, data shuffling, and train/test splits. Without fixed seeds, results change with every run, making debugging impossible.

A student gets great results once but cannot reproduce them and doesn't know if the improvement came from their code change or random variation.

How to fix it

Set random seeds at the start of every script (numpy.random.seed, torch.manual_seed). Log the seed along with results. For research, report mean and standard deviation across multiple seeds.

#13MinorConceptual

Ignoring the no-free-lunch theorem

Students believe one algorithm (often neural networks) is universally best and apply it to every problem without considering simpler alternatives.

A student uses a deep neural network for a tabular dataset with 500 rows, when gradient-boosted trees or even logistic regression would perform better and train in seconds.

How to fix it

Always try simple baselines first (logistic regression, random forest). More complex models are only justified when simpler ones demonstrably underperform on your specific data.

#14MinorConceptual

Not understanding gradient descent convergence

Students run gradient descent with a fixed learning rate without understanding how the rate affects convergence, oscillation, or divergence.

A student sets the learning rate to 1.0 for a neural network. The loss immediately explodes to infinity because updates overshoot the minimum.

How to fix it

Start with a small learning rate (1e-3) and use learning rate schedules or adaptive optimizers (Adam). Plot the loss curve to verify smooth convergence.

#15MinorTime Management

Poor time allocation in ML projects

Students spend most time on model architecture and tuning but neglect data collection, cleaning, and feature engineering, which typically have the largest impact on performance.

A student spends two weeks tuning a neural network architecture but only one hour on data preprocessing. Cleaning the data and engineering one good feature would have improved results more.

How to fix it

Follow the 80/20 rule: spend 80% of time on data quality and feature engineering, 20% on model selection and tuning. Better data almost always beats a better model.

Quick Self-Check

Can you explain the bias-variance tradeoff using a concrete example?
Do you know why you must fit preprocessing only on training data?
Can you implement gradient descent for linear regression from scratch?
Do you understand when to use precision vs. recall vs. F1 vs. AUC?
Can you explain what regularization does to model weights and why that reduces overfitting?

Pro Tips

✓Implement core algorithms (linear regression, logistic regression, k-means, a basic neural net) from scratch in NumPy before using sklearn or PyTorch. This builds irreplaceable intuition.
✓Always establish a simple baseline model first. If logistic regression gets 90% accuracy, a complex model that gets 91% may not be worth the complexity.
✓Use sklearn Pipelines to prevent data leakage. They ensure preprocessing is fit only on training folds during cross-validation.
✓Read the scikit-learn documentation beyond the API reference -- the user guide sections explain the mathematical foundations clearly.
✓When debugging a model, systematically check: data quality first, then features, then model choice, then hyperparameters. Most problems are in the data.

15 Common Mistakes When Studying Machine Learning (And How to Fix Them) | LearnByTeaching.ai

Training and evaluating on the same data

Data leakage from preprocessing before splitting

Ignoring the bias-variance tradeoff

Using accuracy as the metric for imbalanced datasets

Treating ML as a black box without understanding the math

Not scaling features before distance-based algorithms

Confusing correlation with causation in feature importance

Hyperparameter tuning on the test set

Neglecting exploratory data analysis

Overfitting by using too many features with too few samples

Misinterpreting regularization

Failing to set random seeds for reproducibility

Ignoring the no-free-lunch theorem

Not understanding gradient descent convergence

Poor time allocation in ML projects

Quick Self-Check

Pro Tips

More Machine Learning Resources

Avoid machine learning mistakes by teaching it