Top Statistics Every Data Analyst Should Know

🔍 1. Hypothesis Testing & Statistical Significance

p-value
- Definition: Measures the probability of obtaining results as extreme as the observed, assuming the null hypothesis is true.
- Interpretation: A p-value < 0.05 generally indicates statistical significance.
- Example: p = 0.01 → 1% chance results are random → reject null hypothesis.
t-test
- Use: Compare means between two groups (independent or paired).
- Example: Comparing two marketing campaigns’ effectiveness.
z-test
- Use: Similar to t-test but for large samples and known population variance.
- Example: Known standard deviation of customer spending.
Type I & Type II Errors
- Type I: False positive – reject true null hypothesis.
- Type II: False negative – fail to reject false null hypothesis.
- Example: Type I = wrongly think a campaign works; Type II = miss a good campaign.
Power & Power Analysis
- Power: Probability of detecting a true effect (1 – β).
- Power Analysis: Used to calculate required sample size.
- Example: Detecting a 5% sales increase with 80% power.
Confidence Interval (CI)
- Definition: A range where the true population parameter lies with a given confidence level (e.g., 95%).
- Example: CI of [80%, 90%] for customer satisfaction.
Multiple Testing
- Concern: Increases false discovery rate (FDR).
- Solution: Use corrections (e.g., Bonferroni, Benjamini-Hochberg).
- Example: Testing 10 campaigns with FDR control at 5%.

Central Limit Theorem (CLT)
- Concept: Sample means tend toward normal distribution as n increases.
- Use: Justifies using normal approximation in many tests.
Expectation (Expected Value)
- Definition: The average or mean value of a random variable.
- Example: Estimating average salary.
Exponential Distribution
- Use: Time between events (e.g., customer purchases).
- Parameter: Rate (λ).
Skewed Distribution
- Definition: Asymmetry in data; affects mean vs median.
- Use: Recognize and adjust modeling strategy.

Linear Regression
- Use: Predict continuous variables based on independent variables.
- Example: Predicting sales from marketing budget.
Coefficients
- Definition: Quantify the effect of independent variables.
- Example: Coefficient of 0.5 means a $1 increase in budget raises sales by $0.5.
R-Squared (R²)
- Definition: Proportion of variance explained by the model.
- Range: 0 to 1.
- Example: R² = 0.5 → 50% of variation explained.
Covariance
- Definition: Direction of linear relationship between variables.
- Positive: Move together; Negative: Move oppositely.
Correlation Coefficient
- Definition: Strength & direction of linear relationship (-1 to 1).
- Example: 0.8 = strong positive correlation.

Mann-Whitney U Test
- Use: Compare medians of two independent groups.
- Advantage: No assumption of normality.

Bootstrap
- Method: Resample with replacement to estimate uncertainty.
- Use: Estimate CI, standard error, model stability.

Simpson’s Paradox
- Definition: Trend in groups reverses when groups are combined.
- Example: One campaign seems better overall, but worse within gender subgroups.
Overfitting
- Definition: Model performs well on training data but poorly on new data.
- Fix: Use regularization, cross-validation, simpler models.