🔍 1. Hypothesis Testing & Statistical Significance
- p-value
- Definition: Measures the probability of obtaining results as extreme as the observed, assuming the null hypothesis is true.
- Interpretation: A p-value < 0.05 generally indicates statistical significance.
- Example: p = 0.01 → 1% chance results are random → reject null hypothesis.
- t-test
- Use: Compare means between two groups (independent or paired).
- Example: Comparing two marketing campaigns’ effectiveness.
- z-test
- Use: Similar to t-test but for large samples and known population variance.
- Example: Known standard deviation of customer spending.
- Type I & Type II Errors
- Type I: False positive – reject true null hypothesis.
- Type II: False negative – fail to reject false null hypothesis.
- Example: Type I = wrongly think a campaign works; Type II = miss a good campaign.
- Power & Power Analysis
- Power: Probability of detecting a true effect (1 – β).
- Power Analysis: Used to calculate required sample size.
- Example: Detecting a 5% sales increase with 80% power.
- Confidence Interval (CI)
- Definition: A range where the true population parameter lies with a given confidence level (e.g., 95%).
- Example: CI of [80%, 90%] for customer satisfaction.
- Multiple Testing
- Concern: Increases false discovery rate (FDR).
- Solution: Use corrections (e.g., Bonferroni, Benjamini-Hochberg).
- Example: Testing 10 campaigns with FDR control at 5%.
📈 2. Probability Distributions & Expectations
- Central Limit Theorem (CLT)
- Concept: Sample means tend toward normal distribution as n increases.
- Use: Justifies using normal approximation in many tests.
- Expectation (Expected Value)
- Definition: The average or mean value of a random variable.
- Example: Estimating average salary.
- Exponential Distribution
- Use: Time between events (e.g., customer purchases).
- Parameter: Rate (λ).
- Skewed Distribution
- Definition: Asymmetry in data; affects mean vs median.
- Use: Recognize and adjust modeling strategy.
📐 3. Regression & Relationships
- Linear Regression
- Use: Predict continuous variables based on independent variables.
- Example: Predicting sales from marketing budget.
- Coefficients
- Definition: Quantify the effect of independent variables.
- Example: Coefficient of 0.5 means a $1 increase in budget raises sales by $0.5.
- R-Squared (R²)
- Definition: Proportion of variance explained by the model.
- Range: 0 to 1.
- Example: R² = 0.5 → 50% of variation explained.
- Covariance
- Definition: Direction of linear relationship between variables.
- Positive: Move together; Negative: Move oppositely.
- Correlation Coefficient
- Definition: Strength & direction of linear relationship (-1 to 1).
- Example: 0.8 = strong positive correlation.
⚙️ 4. Non-Parametric Tests
- Mann-Whitney U Test
- Use: Compare medians of two independent groups.
- Advantage: No assumption of normality.
🔁 5. Sampling & Bootstrapping
- Bootstrap
- Method: Resample with replacement to estimate uncertainty.
- Use: Estimate CI, standard error, model stability.
📉 6. Data Pitfalls & Paradoxes
- Simpson’s Paradox
- Definition: Trend in groups reverses when groups are combined.
- Example: One campaign seems better overall, but worse within gender subgroups.
- Overfitting
- Definition: Model performs well on training data but poorly on new data.
- Fix: Use regularization, cross-validation, simpler models.
🧠 7. Core Subcategories to Master
- Probability
- Sampling
- Hypothesis Testing
- Confidence Intervals
- Regression Analysis
- Time Series Analysis
- Machine Learning