Descriptive and Inferential Statistics

📊 Statistics: From Description to Inference

Statistics is the science of collecting, analyzing, and interpreting data. It is divided into two main branches: Descriptive Statistics, which summarizes the features of a dataset, and Inferential Statistics, which allows us to draw conclusions about a population based on a sample.

🟢 Level 1: Descriptive Statistics (Summarizing Data)

Descriptive statistics provides a quick overview of the “shape” and “center” of your data.

1. Central Tendency

Mean ( $\mu$ ): The average value. highly sensitive to outliers.
Median: The middle value when data is sorted. Robust to outliers (ideal for income data).
Mode: The most frequent value. useful for categorical data.

2. Dispersion (Spread)

Variance ( $\sigma^2$ ): The average squared deviation from the mean.
Standard Deviation ( $\sigma$ ): The square root of variance. It describes how “spread out” the data is in the original units.
Interquartile Range (IQR): The difference between the 75th percentile (Q3) and the 25th percentile (Q1). It is used to identify outliers.

import pandas as pd

data = [10, 12, 12, 13, 15, 16, 18, 20, 100] # 100 is an outlier
df = pd.DataFrame(data, columns=['Value'])

print(f"Mean: {df['Value'].mean()}")
print(f"Median: {df['Value'].median()}")
print(f"Std Dev: {df['Value'].std()}")

🟡 Level 2: Inferential Statistics (Making Predictions)

Inferential statistics uses probability to determine how likely it is that a pattern observed in a sample also exists in the larger population.

3. Population vs. Sample

Population: The entire group you want to study (e.g., all visitors to your website).
Sample: A subset of the population used for analysis (e.g., 1,000 random visitors).

4. Confidence Intervals (CI)

A confidence interval provides a range of values that is likely to contain the true population parameter (e.g., “We are 95% confident the average load time is between 200ms and 220ms”). $\text{CI} = \bar{x} \pm Z \frac{\sigma}{\sqrt{n}}$

🔴 Level 3: Hypothesis Testing and A/B Testing

5. The Hypothesis Testing Framework

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics.

Null Hypothesis ( $H_0$ ): The “status quo”—assumes no effect or difference.
Alternative Hypothesis ( $H_1$ ): The claim we want to test—assumes there is an effect.

6. P-Values and Significance ( $\alpha$ )

The p-value is the probability of observing our results (or more extreme results) if the null hypothesis is true.

If $p < \alpha$ (usually 0.05), we reject the null hypothesis.
If $p \ge \alpha$ , we fail to reject the null hypothesis.

7. Common Tests

T-Test: Compares the means of two groups (e.g., does Version A of a button result in more clicks than Version B?).
Chi-Square Test: Used for categorical data (e.g., is there a relationship between user location and device type?).