Probability & Statistics — Course Notes
These are my notes from the Probability and Statistics course at UCM. The course splits into two halves — descriptive statistics (summarizing data you already have) and probability theory (reasoning about uncertainty). Here's how I organized the ideas.
Describing a Single Variable
A variable is any characteristic that varies across a population. The first distinction is type:
- Quantitative: discrete (counts) or continuous (measurements)
- Qualitative: nominal (no ordering) or ordinal (ordered categories)
The appropriate visualization follows from the type: bar charts for qualitative and discrete quantitative; histograms for continuous quantitative.
Frequency Tables
For ungrouped data, each observation xᵢ gets:
- nᵢ — absolute frequency (how many times xᵢ appears)
- fᵢ = nᵢ/n — relative frequency
- Nᵢ — cumulative absolute frequency
- Fᵢ = Nᵢ/n — cumulative relative frequency
For grouped data (class intervals), you pick K ≈ √n intervals (or use the Sturges rule: K ≈ 1 + 3.22 log n), assign each observation to an interval, and use the class mark (midpoint) as the representative value.
Measures of Centre
| Measure | Formula | Use case |
|---|---|---|
| Arithmetic mean | x̄ = (1/n) Σ nᵢxᵢ | General purpose |
| Weighted mean | x̄ₚ = Σ wᵢxᵢ / Σ wᵢ | Non-uniform importance |
| Geometric mean | x̄G = (∏ xᵢⁿⁱ)^(1/n) | Consecutive growth rates |
| Harmonic mean | x̄A = n / Σ (nᵢ/xᵢ) | Fixed-interval problems |
| Median | Middle value after sorting | Robust to outliers |
| Mode | Most frequent value | Categorical data |
For grouped data in intervals, the median is computed by interpolation:
Me = (n/2 − Nᵢ₋₁) · Lᵢ / (Nᵢ − Nᵢ₋₁) + lᵢ₋₁
Measures of Dispersion
Variance is the average squared deviation from the mean:
V² = (1/n) Σ nᵢ(xᵢ − x̄)²
The quasi-variance S² = (n/(n−1)) · V² is the corrected estimator for population variance — you use this when the sample is a subset of a larger population.
The interquartile range IQR = Q₃ − Q₁ captures the spread of the central 50% of data. It's more robust than variance because it ignores the tails entirely.
Box plots use Q₁, Me, Q₃, and whiskers extending to the most extreme values within Q₁ − 1.5·IQR and Q₃ + 1.5·IQR. Points beyond those limits are outliers.
Shape: Skewness and Kurtosis
When the mean, median, and mode coincide, the distribution is symmetric. When they diverge:
- Mo < Me < x̄ → right-skewed (positive skew, long tail to the right)
- x̄ < Me < Mo → left-skewed (negative skew)
The Pearson skewness coefficient Aₚ = (x̄ − Mo) / V quantifies this. The Fisher skewness uses the third central moment m₃; the Bowley coefficient uses quartiles.
Kurtosis measures peakedness relative to the Normal distribution. The relative kurtosis coefficient Ap² = m₄/S⁴ − 3:
- Ap² > 0 → leptokurtic (sharper peak, heavier tails)
- Ap² < 0 → platykurtic (flatter peak, lighter tails)
Describing Two Variables
A double-entry frequency table records nᵢⱼ — the count of individuals with X = xᵢ and Y = yⱼ simultaneously. Marginal frequencies sum across rows or columns.
Covariance measures linear association:
Cov = (1/n) Σᵢ (xᵢ − x̄)(yᵢ − ȳ) = (1/n) Σᵢ xᵢyᵢ − x̄ȳ
Positive covariance → direct relationship; negative → inverse; near zero → no linear relationship (though non-linear relationships may still exist).
The Pearson correlation coefficient r = Cov / (Vₓ · Vᵧ) is dimensionless and lives in [−1, 1]. A value of ±1 indicates an exact linear relationship.
Linear Regression
The model Y = β₀ + β₁X is fit by minimizing the sum of squared residuals (OLS). The estimates are:
β̂₁ = Cov / Vₓ²
β̂₀ = ȳ − β̂₁x̄
The residual for each observation is eᵢ = yᵢ − ŷᵢ.
Combinatorics
Three fundamental counting objects:
| Type | Order? | Repetition? | Formula |
|---|---|---|---|
| Permutations | Yes | No | n! |
| Permutations with rep. | Yes | Yes (within counts) | n! / (n₁! · n₂! · ... · nₖ!) |
| Variations without rep. | Yes | No | n! / (n−r)! |
| Variations with rep. | Yes | Yes | nʳ |
| Combinations | No | No | n! / (r! · (n−r)!) |
Probability
Axioms
The Kolmogorov definition: probability is any function P satisfying:
- P(A) ≥ 0 for every event A
- P(E) = 1 (E = sample space)
- P(⋃ Aᵢ) = Σ P(Aᵢ) for pairwise disjoint events
The Laplace definition P(A) = |A| / |E| is the special case of equally likely outcomes.
Conditional Probability
P(A|B) = P(A ∩ B) / P(B)
Two events are independent when P(A|B) = P(A) — the occurrence of B gives no information about A. Don't confuse independence with incompatibility: incompatible events (A ∩ B = ∅) are almost never independent (unless one has probability zero).
Total Probability and Bayes' Theorem
Given a partition A₁, ..., Aₙ of the sample space:
P(S) = Σᵢ P(S|Aᵢ) · P(Aᵢ)
Bayes' theorem inverts this — it computes the probability of a cause given an observed effect:
P(Aᵢ|B) = P(B|Aᵢ) · P(Aᵢ) / Σⱼ P(B|Aⱼ) · P(Aⱼ)
This is sometimes called the "a posteriori probability" — you update your belief in the cause after observing the consequence.
Discrete Random Variables
A random variable X assigns a real number to each outcome in the sample space. For discrete variables:
- The mass function p(x) gives P(X = x) for each possible value
- The distribution function F(x) = P(X ≤ x) is monotone, right-continuous, with limits 0 and 1
Expectation: E[X] = Σ xᵢ p(xᵢ) Variance: V[X] = E[X²] − (E[X])²
Key Discrete Distributions
Bernoulli — Be(p): a single trial, success probability p.
- E[X] = p, Var[X] = p(1−p)
Binomial — B(n, p): number of successes in n independent identical Bernoulli trials.
P(X=x) = C(n,x) · pˣ · (1−p)^(n−x)
- E[X] = np, Var[X] = np(1−p)
Geometric — Ge(p): number of trials until the first success.
P(X=x) = p(1−p)^(x−1)
- E[X] = 1/p, Var[X] = (1−p)/p²
Poisson — Po(λ): count of events per unit time/space, average rate λ.
P(X=x) = e^(−λ) · λˣ / x!
- E[X] = λ, Var[X] = λ
When n ≥ 30 and p ≤ 0.1, B(n, p) can be approximated by Poisson with λ = np.
Continuous Random Variables
For continuous variables, probabilities are computed as areas under the density function f(x):
P(a < X < b) = ∫ₐᵇ f(x) dx
The probability at any single point is 0, so P(X = x) = 0 for all x.
Uniform — U(a, b): constant density over (a, b).
- E[X] = (a + b) / 2, V[X] = (b − a)² / 12
Exponential — Exp(λ): time until an event; used in reliability modeling.
f(x) = λ e^(−λx) for x > 0
- E[X] = 1/λ, Var[X] = 1/λ²
Normal — N(μ, σ²): the bell curve.
f(x) = (1 / √(2πσ)) · exp(−(x−μ)² / (2σ²))
- E[X] = μ (also the mode and median), V[X] = σ²
To compute probabilities, standardize: Z = (X − μ) / σ ~ N(0, 1), then look up in a table.
Approximation Summary for B(n, p)
| Condition | Use |
|---|---|
| n ≥ 30, p ≤ 0.1 | Poisson with λ = np |
| n ≥ 30, 0.1 < p < 0.9 | Normal N(np, np(1−p)) |
| p ≥ 0.9 | Count failures, apply Poisson |
When approximating a discrete distribution with a continuous one, apply a continuity correction: P(X ≤ x) becomes P(X ≤ x + 0.5).