Descriptive and Inferential Statistics
📊 Statistics: From Description to Inference
Statistics is the science of collecting, analyzing, and interpreting data. It is divided into two main branches: Descriptive Statistics, which summarizes the features of a dataset, and Inferential Statistics, which allows us to draw conclusions about a population based on a sample.
🟢 Level 1: Descriptive Statistics (Summarizing Data)
Descriptive statistics provides a quick overview of the “shape” and “center” of your data.
1. Central Tendency
- Mean (): The average value. highly sensitive to outliers.
- Median: The middle value when data is sorted. Robust to outliers (ideal for income data).
- Mode: The most frequent value. useful for categorical data.
2. Dispersion (Spread)
- Variance (): The average squared deviation from the mean.
- Standard Deviation (): The square root of variance. It describes how “spread out” the data is in the original units.
- Interquartile Range (IQR): The difference between the 75th percentile (Q3) and the 25th percentile (Q1). It is used to identify outliers.
import pandas as pd
data = [10, 12, 12, 13, 15, 16, 18, 20, 100] # 100 is an outlier
df = pd.DataFrame(data, columns=['Value'])
print(f"Mean: {df['Value'].mean()}")
print(f"Median: {df['Value'].median()}")
print(f"Std Dev: {df['Value'].std()}")🟡 Level 2: Inferential Statistics (Making Predictions)
Inferential statistics uses probability to determine how likely it is that a pattern observed in a sample also exists in the larger population.
3. Population vs. Sample
- Population: The entire group you want to study (e.g., all visitors to your website).
- Sample: A subset of the population used for analysis (e.g., 1,000 random visitors).
4. Confidence Intervals (CI)
A confidence interval provides a range of values that is likely to contain the true population parameter (e.g., “We are 95% confident the average load time is between 200ms and 220ms”).
🔴 Level 3: Hypothesis Testing and A/B Testing
5. The Hypothesis Testing Framework
Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics.
- Null Hypothesis (): The “status quo”—assumes no effect or difference.
- Alternative Hypothesis (): The claim we want to test—assumes there is an effect.
6. P-Values and Significance ()
The p-value is the probability of observing our results (or more extreme results) if the null hypothesis is true.
- If (usually 0.05), we reject the null hypothesis.
- If , we fail to reject the null hypothesis.
7. Common Tests
- T-Test: Compares the means of two groups (e.g., does Version A of a button result in more clicks than Version B?).
- Chi-Square Test: Used for categorical data (e.g., is there a relationship between user location and device type?).