Probability and Statistics for ML Interviews: High-Impact Survival Guide

Most ML engineers can build models, but interviews hinge on how you reason about randomness and avoid statistical traps. Rather than derivations, interviewers want clear explanations of conditional probability, confidence intervals, and the right metrics when data are imbalanced. This guide covers the essentials and includes practice questions with answers so you can test yourself as you go.

Probability Basics Interviewers Actually Test

Conditional Probability

Conditional probability answers the question: "What's the probability of event A given that event B already happened?" Formally,

P(A | B) = P(A and B) / P(B).

In interviews you'll see scenarios like: a moderation team labels posts as spam or not spam, or a churn alert fires when a predicted probability exceeds a threshold. To compute P(A | B), always find the joint probability P(A and B) and divide by the known probability P(B). When we update our belief about a hypothesis given new evidence (e.g. probability it's truly defective given that it was flagged), that conditional probability is called the posterior. In Bayes-style problems you'll also hear prior (probability before seeing evidence), likelihood (probability of the evidence given the hypothesis), and posterior (probability of the hypothesis given the evidence).

Interview Tip: Draw a simple table or probability tree. Visualising the joint and conditional parts prevents algebra mistakes.

Practice Question: A quality control system flags 5% of manufactured parts as defective. Only 0.5% of parts are truly defective. The system catches 95% of defective parts and has a 5% false positive rate. If a part is flagged, what is the probability it's truly defective?

Step-by-step solution

Use a concrete count (e.g. 10,000 parts) so you don't get lost in decimals.

Truly defective: 0.5% of 10,000 = 50. Not defective: 9,950.
Of the 50 defective, the system catches 95%: 50 × 0.95 = 47.5 true positives (flagged and truly defective).
Of the 9,950 not defective, 5% are false positives: 9,950 × 0.05 = 497.5 flagged but not defective.
Total flagged: 47.5 + 497.5 = 545.
Among flagged parts, the fraction that are truly defective: 47.5 / 545 ≈ 0.087.

So the probability a flagged part is truly defective is about 8.7%. That is the posterior: P(truly defective | flagged).

Answer: Posterior probability ≈ 8.7%.

Independence

Two events are independent if knowing one doesn't change the probability of the other: P(A and B) = P(A) × P(B). Tossing a fair coin repeatedly yields independent flips. Drawing cards without replacement breaks independence because each draw alters the deck.

Common Mistake: Assuming independence when sampling without replacement. If you draw balls from an urn and don't put them back, adjust your calculations because probabilities change on each draw.

Bayes Rule and Base Rate Intuition

Bayes rule flips conditional probabilities: P(B|A) is the posterior (probability of hypothesis B given evidence A); P(B) is the prior (base rate before seeing A); P(A|B) is the likelihood (probability of evidence A if B is true).

P(B | A) = P(A | B) × P(B) / P(A) (posterior = likelihood × prior / P(evidence)).

When the base rate of an event is tiny (fraud detection, rare disease screening), high accuracy can still produce more false positives than true positives, so always multiply by the prior probability before interpreting model predictions.

The following diagram tracks the base rates for a fraud alert example:

How to read this chart: The first split shows base rates (fraud vs not fraud). The second split applies model behavior (true positives, false negatives, false positives, true negatives). The key idea is that a tiny base rate can make false positives dominate alerts even when model recall is high.

Practice Question: In a payment system, fraud affects 0.1% of transactions. Your classifier catches 99% of fraud but has a 1% false positive rate. Out of 10,000 transactions, what fraction of alerts correspond to real fraud?

Step-by-step solution

Use the 10,000-trick. Of 10,000 transactions: 0.1% are fraud → 10 fraudulent, 9,990 not. True positives: 99% of 10 = 9.9. False positives: 1% of 9,990 = 99.9. Total alerts: 9.9 + 99.9 = 109.8. Fraction of alerts that are real fraud: 9.9 / 109.8 ≈ 0.09 (about 9%). So even with 99% recall and 1% FP rate, the low base rate means most alerts are false positives.

Answer: About 9% of alerts correspond to real fraud.

Common Mistake: Many candidates compute P(alert | fraud) but forget to multiply by the prior P(fraud). Always weight by the base rate when applying Bayes.

Distributions That Show Up in Interviews

Interviewers don't expect derivations; they want you to recognise which model fits a situation and make quick calculations. For each distribution below we include a brief description and a practice question.

Bernoulli, Binomial and Geometric

Bernoulli is a single yes-no trial with success probability p. Binomial counts the number of successes in n independent Bernoulli trials; use it for the number of clicks or positive labels out of a fixed number of observations. Geometric counts how many trials until the first success and has mean 1/p. A quick shortcut for a binomial with small p: the chance of no successes is (1−p)ⁿ and at least one success is 1−(1−p)ⁿ.

Practice Question: A fraud detector flags each of 10 transactions independently with probability 0.1. What is the probability of exactly three flags? What is the probability of at least one flag?

Step-by-step solution

X = number of flags is Binomial(n=10, p=0.1). Exactly 3: P(X=3) = C(10,3) × 0.1³ × 0.9⁷. C(10,3) = 120; 0.1³ = 0.001; 0.9⁷ ≈ 0.478. So 120 × 0.001 × 0.478 ≈ 0.057. At least one: P(X≥1) = 1 − P(X=0) = 1 − 0.9¹⁰ ≈ 1 − 0.349 = 0.651.

Answer: P(X=3) ≈ 0.057; P(X≥1) ≈ 0.651.

Poisson

Poisson models counts of events in a fixed interval when events occur independently at a constant average rate λ. The mean and variance are both λ. A handy shortcut: P(X=0) = e^−λ, so P(X≥1) = 1 − e^−λ.

Practice Question: The number of fraud incidents detected per day follows a Poisson distribution with λ=3. What's the probability of seeing at least two frauds in a day?

Step-by-step solution

P(X≥2) = 1 − P(X=0) − P(X=1). For Poisson(λ), P(X=k) = λ^ke^−λ/k!. So P(0) = e⁻³ and P(1) = 3e⁻³. Thus P(X≥2) = 1 − e⁻³ − 3e⁻³ = 1 − 4e⁻³. Using e⁻³ ≈ 0.0498: 4 × 0.0498 ≈ 0.199, so 1 − 0.199 = 0.801.

Answer: ≈ 0.801.

Practice Question: A call center receives on average 4 calls per hour. What is the probability that no calls arrive in a 30-minute period?

Step-by-step solution

Calls in a fixed interval follow Poisson(λ). Rate is 4 per hour, so over 30 minutes the expected count is λ = 4 × (30/60) = 2. P(X=0) = e^−λ = e⁻² ≈ 0.135 (about 13.5%).

Answer: P(X=0) = e⁻² ≈ 0.135 (about 13.5%).

Normal and the Central Limit Theorem

The Normal distribution is defined by its mean and variance and often arises when you average many independent variables. The Central Limit Theorem says that as sample size grows, the distribution of the sample mean approaches normal no matter what the original data look like.

Practice Question: Daily impressions (in thousands) for an ad campaign have mean 100 and standard deviation 20. Using the CLT, approximate the probability that the average over 30 days exceeds 105.

Step-by-step solution

By the CLT, the sample mean (n=30) is approximately normal with mean μ = 100 and standard error σ/√n = 20/√30 ≈ 3.65. Standardise: z = (105 − 100) / 3.65 ≈ 1.37. From the normal table, P(Z > 1.37) ≈ 0.085, so about 8.5% of the time the 30-day average exceeds 105.

Answer: The z-score is (105−100)/(20/√30) ≈ 1.37, giving P(Z > 1.37) ≈ 0.085 (about 8.5%).

Exponential

The Exponential distribution models waiting times between events in a Poisson process. If events occur at rate λ, the time between events follows an exponential distribution with mean 1/λ. It has the memoryless property: P(T > s + t | T > s) = P(T > t), meaning the distribution forgets how long you've already waited.

Practice Question: Latency between API requests follows an exponential distribution with mean 200 ms. What's the probability that the next request arrives after more than 500 ms?

Step-by-step solution

Exponential with mean 200 ms has rate λ = 1/(0.2 s) = 5 per second. For exponential, P(T > t) = e^−λt. We want P(T > 0.5 s) = e^−5×0.5 = e^−2.5 ≈ 0.082. So about 8% of inter-request intervals exceed 500 ms.

Answer: Rate λ = 1/0.2 per second = 5 requests per second. P(T>0.5) = e^−5×0.5 = e^−2.5 ≈ 0.082. About 8% of intervals exceed 0.5 s.

Expectation, Variance and Covariance

Expectation is the long-run average. It is linear: E[aX + bY] = aE[X] + bE[Y], even when variables are dependent. Variance measures spread; scaling by a multiplies variance by a², and independent variances add. Covariance describes how two variables change together; dividing covariance by the product of standard deviations gives correlation, a dimensionless number between −1 and 1.

Practice Question: A recommendation engine shows 5 items to a user. Each item has an independent 10% chance of being clicked. What is the expected number of clicks? What is the variance?

Step-by-step solution

Each item is Bernoulli(p=0.1). Total clicks X is Binomial(n=5, p=0.1). Expectation: E[X] = np = 5 × 0.1 = 0.5. Variance: For independent trials, Var(X) = np(1−p) = 5 × 0.1 × 0.9 = 0.45. (You can also derive from linearity: E[X] = Σ E[X_i] = 5×0.1; Var(X) = Σ Var(X_i) because they're independent.)

Answer: Expected clicks = 0.5; variance = 0.45.

Practice Question: Your team tracks churn probabilities for users each month. User i churns with probability p_i. Assuming independence, derive the expected number of churns and the variance across n users.

Step-by-step solution

Let X_i = 1 if user i churns (Bernoulli(p_i)), so X = Σ X_i is total churns. Expectation: By linearity, E[X] = Σ E[X_i] = Σ p_i. Variance: For independent random variables, variances add: Var(X) = Σ Var(X_i). For Bernoulli(p_i), Var(X_i) = p_i(1−p_i). So Var(X) = Σ p_i(1−p_i).

Answer: E[X] = Σ p_i and Var(X) = Σ p_i(1−p_i).

Estimation and Confidence Intervals

When you estimate a metric (such as a mean latency or conversion rate), you should report a confidence interval to quantify uncertainty. A 95% interval means that, over many repeated experiments, about 95% of the intervals will cover the true value; it does not assign a 95% probability to a specific interval.

Point estimates alone can be misleading. A tiny lift may vanish once you account for uncertainty. Visualise uncertainty with overlapping intervals:

Model A 0.842 |----[===]----|
Model B 0.846   |----[===]----|

Practice Question: An A/B test measures click-through rate (CTR). The control has CTR 5% over 10,000 impressions. The variant has CTR 5.2% over 10,000 impressions. Compute an approximate 95% CI for the difference in proportions and interpret.

Step-by-step solution

p₁=0.05, p₂=0.052, n=10,000 each. Standard error of each proportion: SE ≈ √(p(1−p)/n). For control: √(0.05×0.95/10000) ≈ 0.00218; for variant: √(0.052×0.948/10000) ≈ 0.00222. SE of the difference: √(SE₁²+SE₂²) ≈ 0.0031 (or use pooled SE; result is similar). Difference: 0.052 − 0.05 = 0.002. 95% CI: 0.002 ± 1.96×0.0031 ≈ 0.002 ± 0.006, so roughly (−0.004, 0.008). (Tighter SE gives roughly (−0.017, 0.021) depending on formula; either way the interval contains 0.) So we cannot reject "no difference"; the lift is not statistically significant.

Answer: The 95% CI for the difference contains 0, so the lift is not significant.

Common Mistake: Interpreting a 95% CI as "there's a 95% chance the true value lies in the interval." The interval is random; the true value is fixed. Over many repeated experiments, 95% of the intervals will cover the truth.

Hypothesis Testing and p-Values

In hypothesis testing you start with a null hypothesis (no effect) and an alternative (an effect exists). You compute a statistic and its p-value: the probability of obtaining data as extreme as yours if the null is true. A small p-value suggests your result is unlikely under the null.

Testing comes with two error types: Type I (false positive) controlled by the significance level α, and Type II (false negative) related to the power 1−β. Larger samples boost power but fixed α always controls the Type I error rate.

Multiple Comparisons and Peeking

Running multiple tests inflates false positives: five comparisons at α=0.05 give a family-wise error of 1−0.95⁵ ≈ 22.6%. Corrections like Bonferroni (divide α by the number of tests) control this. Peeking at interim results also raises Type I error; sequential testing methods adjust for peeks.

Below is a flowchart for a simple A/B test decision process:

flowchart TD StartTest([Start A/B Test]) --> Collect[Collect Data] Collect --> CheckSize{Reached Desired Sample Size?} CheckSize -- No --> Collect CheckSize -- Yes --> Compute[Compute Test Statistic] Compute --> PValue[Calculate p-value] PValue --> Decision{p-value < alpha?} Decision -- Yes --> RejectH0[Reject Null - Ship Variant] Decision -- No --> FailH0[Fail to Reject - Keep Control] RejectH0 --> EndTest([End Test]) FailH0 --> EndTest

How to read this chart: It outlines the minimum valid testing flow: collect data, wait for target sample size, compute the statistic, then compare p-value to α. The point is to avoid early peeking and make the ship/no-ship decision only after a pre-defined stopping rule.

Practice Question: You run five variants against a control in an A/B test. Each pairwise comparison uses α = 0.05. What is the approximate family-wise error rate? If you apply a Bonferroni correction, what should be the new significance threshold per comparison?

Step-by-step solution

Family-wise error rate (FWER): With 5 comparisons, each at α=0.05, the probability of at least one false positive = 1 − P(no false positives). If we assume the null is true for all, P(no FP) = 0.95⁵ ≈ 0.774. So FWER ≈ 1 − 0.774 = 0.226 (about 22.6%). Bonferroni: To control FWER at level α, use α/n per test. So new threshold = 0.05/5 = 0.01. Each comparison is significant only if p < 0.01.

Answer: Family-wise error ≈ 1 − 0.95⁵ ≈ 0.226. Bonferroni threshold = 0.05/5 = 0.01.

Exam Pitfall: A p-value is not the probability that the null hypothesis is true. It assumes the null and quantifies how surprising your data is under that assumption. Always define your null and alternative hypotheses, mention the significance level, and consider sample size and power when explaining a test.

Metrics and Evaluation for ML

In classification tasks, accuracy often hides the real story, especially when positives are rare. Two metrics matter more:

Precision, Recall and F1

Precision measures how many predicted positives are correct (TP / (TP + FP)); recall measures how many actual positives are caught (TP / (TP + FN)); F1 is their harmonic mean. In imbalanced problems a model that never predicts the minority class achieves high accuracy but zero recall, so precision and recall matter far more.

ROC vs. Precision–Recall (PR)

A ROC curve plots recall (true positive rate) versus false positive rate and measures overall ranking performance; ROC-AUC is fairly insensitive to class imbalance. A PR curve plots precision versus recall and is more informative when positives are rare. Use PR-AUC for rare events like fraud or churn.

Metric	Best For	Base-Rate Sensitive
ROC-AUC	Overall ranking	No
PR-AUC	Rare event detection	Yes

Calibration

Calibration means that predicted probabilities align with actual frequencies: for example, among cases predicted at 0.7, about 70% should be positive. Calibration matters when probabilities drive decisions, such as risk scoring or pricing. A reliability diagram plots predicted probabilities against observed frequencies; a calibrated model lies on the diagonal.

Practice Question: In the 0.8–1.0 probability bin of a credit scoring model, 200 loans were assigned probabilities averaging 0.9. Of these, 140 actually defaulted. Is the model calibrated in this bin? How might you adjust the probabilities?

Step-by-step solution

Calibration: In this bin the model predicts ~0.9 (90% default probability). The observed default rate is 140/200 = 0.70 (70%). Calibration means predicted probability matches observed frequency. Here 0.9 ≠ 0.70, so the model is not calibrated in this bin; it overestimates risk. Adjustment: To make the bin calibrated, scale predicted probabilities so they align with 0.70. A simple fix: multiply predictions in this bin by 0.70/0.90 ≈ 0.78, so that after adjustment the effective prediction in the bin is about 0.70, matching the observed rate.

Answer: Observed default rate = 140/200 = 0.70; model predicts 0.9, so it overestimates. Scaling by 0.70/0.90 ≈ 0.78 would bring the bin closer to calibration.

Exam Pitfall: Many candidates rely solely on ROC-AUC to judge model quality. A model can have high ROC-AUC but poor precision at the chosen threshold or terrible calibration. Always check multiple metrics and connect them to the business cost of false positives versus false negatives.

10-Day Interview Prep Plan

This plan structures your study into manageable daily topics. Allocate at least an hour per day to reading, solving problems, and writing down your reasoning.

Day 1 – Random variables and conditional probability; draw tables and trees.
Day 2 – Independence and Bayes; test independence and apply Bayes with base rates.
Day 3 – Bernoulli, Binomial and Geometric; compute counts and waiting times.
Day 4 – Poisson and Exponential; count rare events and model waiting times.
Day 5 – Normal distribution and CLT; use z-scores to approximate sample means.
Day 6 – Expectation and Variance; practise linearity and variance addition.
Day 7 – Confidence intervals and bootstrap; build intervals for proportions and means.
Day 8 – Hypothesis testing and multiple comparisons; design an A/B test and set α.
Day 9 – Metrics and calibration; compute precision, recall, F1 and plot reliability.
Day 10 – Full mock interview covering all topics; review mistakes and refine.

Last-Minute Interview Cheat Sheet

Conditional Probability: Use tables or trees to compute P(A | B).
Independence: Events multiply; sampling without replacement breaks it.
Bayes: Posterior ∝ likelihood × prior; remember base rates.
Bernoulli/Binomial/Geometric: Single trial, count successes, count trials till first success.
Poisson/Exponential: Counts and waiting times; exponential is memoryless.
Normal and CLT: Sample means tend toward normal; use z-scores.
Expectation and Variance: Linearity always; variances add if independent; correlation rescales covariance.
Confidence Intervals: Repeat experiments many times; intervals cover the truth about 95% of the time.
Hypothesis Testing: Null vs. alternative; small p-value means data unlikely under null; correct for multiple tests.
Metrics: Prioritise precision, recall and F1 in imbalanced problems; PR-AUC reveals rare class performance; check calibration.
Prep: Follow the 10-day plan and practise on real-world scenarios like CTR, fraud, churn and latency.

With these concepts and practice questions, you're ready to tackle probability and statistics questions in ML interviews. Explain your reasoning clearly, connect abstract ideas to product scenarios, and highlight trade-offs instead of reciting formulas. That clarity will set you apart from other candidates.

Probability and Statistics for ML Interviews: High-Impact Survival Guide

Probability Basics Interviewers Actually Test

Conditional Probability

Step-by-step solution

Independence

Bayes Rule and Base Rate Intuition

Step-by-step solution

Distributions That Show Up in Interviews

Bernoulli, Binomial and Geometric

Step-by-step solution

Poisson

Step-by-step solution

Step-by-step solution

Normal and the Central Limit Theorem

Step-by-step solution

Exponential

Step-by-step solution

Expectation, Variance and Covariance

Step-by-step solution

Step-by-step solution

Estimation and Confidence Intervals

Step-by-step solution

Hypothesis Testing and p-Values

Multiple Comparisons and Peeking

Step-by-step solution

Metrics and Evaluation for ML

Precision, Recall and F1

ROC vs. Precision–Recall (PR)

Calibration

Step-by-step solution

10-Day Interview Prep Plan

Last-Minute Interview Cheat Sheet

Related Articles

Building Production-Ready RAG Pipelines

Securing OpenClaw on a VPS with Docker

Responsible AI in Practice

Let's Connect