Kansverdelingen en Random Variabelen - Machine Learning Cursussen

In dit labo maken we kennis met de fundamenten van kansrekenen door te werken met discrete en continue random variabelen. We visualiseren verschillende kansverdelingen en berekenen kansen met behulp van Python.

Leerdoelen:

Random variabelen definiëren en simuleren
Probability Mass Functions (PMF) en Probability Density Functions (PDF) implementeren
Discrete en continue uniforme verdelingen gebruiken
Kansen berekenen en visualiseren

import numpy as np
import pandas as pd
import plotly.express as px
from scipy import stats

# Set random seed for reproducibility
rng = np.random.default_rng(42)

Discrete Random Variabelen¶

Eerlijke dobbelsteen¶

Een klassiek voorbeeld van een discrete random variabele is de uitkomst van een dobbelsteenworp. Bij een eerlijke dobbelsteen heeft elke waarde een gelijke kans: $P(x) = \frac{1}{6}$ .

✍️¶

Implementeer de PMF voor een eerlijke dobbelsteen en visualiseer de verdeling.

# Define outcomes for a fair six-sided die
outcomes = np.arange(1, 7)

# Calculate probabilities (uniform distribution)
probabilities = np.ones(6) / 6

print("Probability Mass Function (PMF):")
for outcome, prob in zip(outcomes, probabilities, strict=False):
    print(f"P(x={outcome}) = {prob:.4f}")

# Verify that probabilities sum to 1
print(f"\nSum of probabilities: {probabilities.sum():.1f}")

Probability Mass Function (PMF):
P(x=1) = 0.1667
P(x=2) = 0.1667
P(x=3) = 0.1667
P(x=4) = 0.1667
P(x=5) = 0.1667
P(x=6) = 0.1667

Sum of probabilities: 1.0

# Visualize the PMF
df = pd.DataFrame({"Outcome": outcomes, "Probability": probabilities})
fig = px.bar(df, x="Outcome", y="Probability", title="PMF: Fair Six-Sided Die")
fig.show()

✍️¶

Implementeer de PMF voor een oneerlijke dobbelsteen $(6) = 2 * P(other)$ en visualiseer de verdeling.

\begin{align} 5 * P(other) + 2 * P(other) &= 1 \cr 7 * P(other) &= 1 \cr P(other) &= 1/7 \end{align}

(1)

# Calculate probabilities for unfair die

prob_other = 1 / 7
prob_six = 2 / 7

unfair_probabilities = np.array([prob_other] * 5 + [prob_six])

print("Unfair Die PMF:")
for outcome, prob in zip(outcomes, unfair_probabilities, strict=False):
    print(f"P(x={outcome}) = {prob:.4f}")

print(f"\nSum of probabilities: {unfair_probabilities.sum():.1f}")

Unfair Die PMF:
P(x=1) = 0.1429
P(x=2) = 0.1429
P(x=3) = 0.1429
P(x=4) = 0.1429
P(x=5) = 0.1429
P(x=6) = 0.2857

Sum of probabilities: 1.0

# Compare fair vs unfair die
df = pd.DataFrame(
    {
        "Outcome": list(outcomes) + list(outcomes),
        "Probability": list(probabilities) + list(unfair_probabilities),
        "Type": ["Fair"] * 6 + ["Unfair"] * 6,
    }
)
fig = px.bar(
    df, x="Outcome", y="Probability", color="Type", barmode="group", title="Fair vs Unfair Die"
)
fig.show()

✍️¶

Implementeer en visualiseer de PMF voor volgende kansverdeling. We hebben te maken met een verkeerslicht met drie standen: rood - orange - groen. Het blijft gemiddeld dubbel zo lang groen dan rood een derde zo lang oranje als rood.

\begin{align} P(red) + P(orange) + P(green) &= 1 \cr P(red) + \frac{1}{3}P(red) + 2P(red) &= 1 \cr (1 + \frac{1}{3} + 2) * P(red) &= 1 \cr \frac{10}{3} * P(red) &= 1 \cr P(red) &= \frac{3}{10} \cr P(red) &= 0.3 \end{align}

(2)

p_red = 0.3
p_orange = 0.1
p_green = 0.6

fig = px.bar(
    x=["red", "orange", "green"],
    y=[p_red, p_orange, p_green],
    labels={"x": "State", "y": "Probability"},
    color=["red", "orange", "green"],
    color_discrete_map={"red": "red", "orange": "orange", "green": "green"},
)
fig.show()

Kansen berekenen met de somregel¶

De somregel stelt: $P(x \in A \cup B) = P(x \in A) + P(x \in B) - P(x \in A \cap B)$

Voor niet-overlappende gebeurtenissen volstaat het om de individuele kansen op te tellen.

✍️¶

Bereken voor de eerlijke dobbelsteen:

De kans op een oneven uitkomst
De kans op een uitkomst groter dan 4
De kans op een oneven uitkomst OF groter dan 4

# 1. Probability of odd outcome
odd_outcomes = [1, 3, 5]
p_odd = sum(probabilities[i - 1] for i in odd_outcomes)
print(f"P(odd) = P(x∈{{1,3,5}}) = {p_odd:.4f}")

# 2. Probability of outcome > 4
greater_than_4 = [5, 6]
p_greater_4 = sum(probabilities[i - 1] for i in greater_than_4)
print(f"P(x > 4) = P(x∈{{5,6}}) = {p_greater_4:.4f}")

# 3. Probability of odd OR greater than 4
# Outcomes: {1, 3, 5, 6}
# Note: 5 is in both sets, but we count it only once
union_outcomes = [1, 3, 5, 6]
p_union = sum(probabilities[i - 1] for i in union_outcomes)
print(f"P(odd ∪ >4) = P(x∈{{1,3,5,6}}) = {p_union:.4f}")

# Verify using sum rule
intersection_outcomes = [5]  # outcomes that are both odd AND > 4
p_intersection = sum(probabilities[i - 1] for i in intersection_outcomes)
p_union_formula = p_odd + p_greater_4 - p_intersection
print("\nVerification using sum rule:")
print("P(odd ∪ >4) = P(odd) + P(>4) - P(odd ∩ >4)")
print(f"            = {p_odd:.4f} + {p_greater_4:.4f} - {p_intersection:.4f}")
print(f"            = {p_union_formula:.4f}")

P(odd) = P(x∈{1,3,5}) = 0.5000
P(x > 4) = P(x∈{5,6}) = 0.3333
P(odd ∪ >4) = P(x∈{1,3,5,6}) = 0.6667

Verification using sum rule:
P(odd ∪ >4) = P(odd) + P(>4) - P(odd ∩ >4)
            = 0.5000 + 0.3333 - 0.1667
            = 0.6667

Continue Random Variabelen¶

Voor continue random variabelen gebruiken we een Probability Density Function (PDF) in plaats van een PMF. De PDF geeft geen directe kans, maar een kansdichtheid. We moeten integreren om de eigenlijke kans te berekenen.

Uniforme verdeling¶

Een continue uniforme verdeling op interval $[a, b]$ heeft PDF:

u(x; a, b) = \begin{cases} \frac{1}{b-a} & \text{als } x \in [a,b] \\ 0 & \text{anders} \end{cases}

(3)

✍️¶

Implementeer en visualiseer een uniforme verdeling op interval $[2, 8]$ .

# Define uniform distribution parameters
a, b = 2, 8

# Create x values for plotting
x = np.linspace(0, 10, 1000)
pdf = np.where((x >= a) & (x <= b), 1 / (b - a), 0)

# Plot the PDF
df = pd.DataFrame({"x": x, "p(x)": pdf})
fig = px.line(df, x="x", y="p(x)", title="Continuous Uniform Distribution PDF")
fig.show()

print(f"PDF height in [{a},{b}]: {1 / (b - a):.4f}")
print(f"Total probability (integral over domain): {(b - a) * (1 / (b - a)):.1f}")

PDF height in [2,8]: 0.1667
Total probability (integral over domain): 1.0

✍️¶

Bereken de kans dat $x \in [3, 5]$ voor deze uniforme verdeling. Doe dit zowel met de formule als met scipy.stats.

# Manual calculation
interval_start, interval_end = 3, 5
p_interval_manual = (interval_end - interval_start) / (b - a)
print("Manual calculation:")
print(f"P(x ∈ [{interval_start},{interval_end}]) = (5-3)/(8-2) = {p_interval_manual:.4f}")

# Using scipy.stats
uniform_dist = stats.uniform(loc=a, scale=b - a)
p_interval_scipy = uniform_dist.cdf(interval_end) - uniform_dist.cdf(interval_start)
print("\nUsing scipy.stats:")
print(f"P(x ∈ [{interval_start},{interval_end}]) = {p_interval_scipy:.4f}")

Manual calculation:
P(x ∈ [3,5]) = (5-3)/(8-2) = 0.3333

Using scipy.stats:
P(x ∈ [3,5]) = 0.3333

# Visualize the probability
df_interval = df[(df["x"] >= interval_start) & (df["x"] <= interval_end)]

# Show full distribution
fig = px.line(
    df, x="x", y="p(x)", title=f"P(x ∈ [{interval_start},{interval_end}]) = {p_interval_manual:.3f}"
)
# Add highlighted interval
fig.add_scatter(
    x=df_interval["x"], y=df_interval["p(x)"], fill="tozeroy", name="Probability", mode="lines"
)
fig.show()

PDF waarden kunnen groter zijn dan 1¶

Een belangrijke eigenschap van PDFs is dat de functiewaarde groter kan zijn dan 1, omdat het gaat om een dichtheid en niet om een kans.

✍️¶

Creëer een uniforme verdeling op interval $[0, 0.5]$ en toon aan dat de PDF-waarde groter is dan 1, maar de totale kans nog steeds 1 is.

# Narrow uniform distribution
a_narrow, b_narrow = 0, 0.5

x_narrow = np.linspace(-0.5, 1, 1000)
pdf_narrow = np.where((x_narrow >= a_narrow) & (x_narrow <= b_narrow), 1 / (b_narrow - a_narrow), 0)

pdf_height = 1 / (b_narrow - a_narrow)
print(f"PDF height: {pdf_height:.4f}")
print(f"This is greater than 1: {pdf_height > 1}")
print("\nBut the total probability is:")
print(
    f"∫ p(x)dx = {pdf_height:.1f} × {b_narrow - a_narrow:.1f} = {pdf_height * (b_narrow - a_narrow):.1f}"
)

# Visualize
df = pd.DataFrame({"x": x_narrow, "p(x)": pdf_narrow})
fig = px.area(df, x="x", y="p(x)", title="PDF Values Can Exceed 1")
fig.show()

PDF height: 2.0000
This is greater than 1: True

But the total probability is:
∫ p(x)dx = 2.0 × 0.5 = 1.0

Normale verdeling (Gaussverdeling)¶

De normale verdeling is één van de belangrijkste continue verdelingen in statistiek en machine learning:

f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}

(4)

waarbij $\mu$ het gemiddelde en $\sigma$ de standaardafwijking voorstelt.

✍️¶

Visualiseer normale verdelingen met verschillende parameters en bereken kansen.

# Define different normal distributions
x = np.linspace(-6, 8, 1000)

# Create dataframe with all distributions
data = []
for mu, sigma, label in [(0, 1, "N(0,1²)"), (0, 2, "N(0,2²)"), (2, 1, "N(2,1²)")]:
    pdf = stats.norm.pdf(x, mu, sigma)
    data.extend(
        [{"x": xi, "p(x)": pi, "Distribution": label} for xi, pi in zip(x, pdf, strict=False)]
    )

df = pd.DataFrame(data)
fig = px.line(
    df,
    x="x",
    y="p(x)",
    color="Distribution",
    title="Normal (Gaussian) Distributions with Different Parameters",
)
fig.show()

✍️¶

Voor een standaard normale verdeling $\mathcal{N}(0, 1^2)$ , bereken:

$P(x \leq 0)$
$P(-1 \leq x \leq 1)$
$P(x > 2)$

# Standard normal distribution
std_normal = stats.norm(loc=0, scale=1)

# 1. P(x ≤ 0)
p1 = std_normal.cdf(0)
print(f"1. P(x ≤ 0) = {p1:.4f}")

# 2. P(-1 ≤ x ≤ 1)
p2 = std_normal.cdf(1) - std_normal.cdf(-1)
print(f"2. P(-1 ≤ x ≤ 1) = {p2:.4f}")

# 3. P(x > 2)
p3 = 1 - std_normal.cdf(2)
print(f"3. P(x > 2) = {p3:.4f}")

# Visualize each probability
x_plot = np.linspace(-4, 4, 1000)
pdf_plot = std_normal.pdf(x_plot)
df = pd.DataFrame({"x": x_plot, "p(x)": pdf_plot})

# 1. P(x ≤ 0)
df_p1 = df[df["x"] <= 0]
fig = px.line(df, x="x", y="p(x)", title=f"P(x ≤ 0) = {p1:.4f}")
fig.add_scatter(x=df_p1["x"], y=df_p1["p(x)"], fill="tozeroy", name="Probability")
fig.show()

# 2. P(-1 ≤ x ≤ 1)
df_p2 = df[(df["x"] >= -1) & (df["x"] <= 1)]
fig = px.line(df, x="x", y="p(x)", title=f"P(-1 ≤ x ≤ 1) = {p2:.4f}")
fig.add_scatter(x=df_p2["x"], y=df_p2["p(x)"], fill="tozeroy", name="Probability")
fig.show()

# 3. P(x > 2)
df_p3 = df[df["x"] > 2]
fig = px.line(df, x="x", y="p(x)", title=f"P(x > 2) = {p3:.4f}")
fig.add_scatter(x=df_p3["x"], y=df_p3["p(x)"], fill="tozeroy", name="Probability")
fig.show()

1. P(x ≤ 0) = 0.5000
2. P(-1 ≤ x ≤ 1) = 0.6827
3. P(x > 2) = 0.0228

Empirische vs. Theoretische Verdelingen¶

In de praktijk hebben we te maken met steekproeven. We kunnen de empirische verdeling visualiseren met histogrammen en vergelijken met theoretische verdelingen.

✍️¶

Trek steekproeven uit een normale verdeling en onderzoek hoe de keuze van bins de empirische verdeling beïnvloedt.

# Population parameters
mu, sigma = 175, 10  # height in cm

# Draw samples and compare
sample_sizes = [50, 500]
x_theory = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 1000)
y_theory = stats.norm.pdf(x_theory, mu, sigma)

# Create sample
sample = rng.normal(mu, sigma, sample_sizes[1])

# Plot histogram
fig = px.histogram(
    sample, nbins=20, histnorm="probability density", title="Empirical vs Theoretical Distribution"
)
# Add theoretical line
df_theory = pd.DataFrame({"Height (cm)": x_theory, "Density": y_theory})
fig.add_scatter(
    x=df_theory["Height (cm)"], y=df_theory["Density"], mode="lines", name="Theoretical"
)
fig.show()

✍️¶

Onderzoek het effect van verschillende bin-groottes op de interpretatie van een histogram.

# Fixed sample size, varying bins
sample_size = 200
sample = rng.normal(mu, sigma, sample_size)

# Try different bin sizes
for n_bins in [5, 20, 50]:
    fig = px.histogram(
        sample, nbins=n_bins, histnorm="probability density", title=f"Histogram with {n_bins} bins"
    )
    fig.show()
    bin_width = (sample.max() - sample.min()) / n_bins
    print(f"Number of bins: {n_bins}, Bin width: {bin_width:.1f} cm")

print(f"\nSample mean: {np.mean(sample):.2f} cm (theoretical: {mu} cm)")
print(f"Sample std: {np.std(sample):.2f} cm (theoretical: {sigma} cm)")

Number of bins: 5, Bin width: 11.0 cm

Number of bins: 20, Bin width: 2.8 cm

Number of bins: 50, Bin width: 1.1 cm

Sample mean: 173.98 cm (theoretical: 175 cm)
Sample std: 10.39 cm (theoretical: 10 cm)

Binomiale verdeling¶

De binomiale verdeling beschrijft het aantal successen in $n$ onafhankelijke experimenten met successproblabiliteit $p$ .

✍️¶

Simuleer 100 worpen met een eerlijke munt ( $p=0.5$ , $n=100$ ) en visualiseer de verdeling van het aantal “kop”.

# Binomial distribution: 100 coin flips
n_trials = 100
p_success = 0.5
n_experiments = 1000

# Simulate multiple experiments
results = [rng.binomial(n_trials, p_success) for _ in range(n_experiments)]

# Visualize
fig = px.histogram(
    results, nbins=30, histnorm="probability", title=f"Binomial Distribution: {n_trials} coin flips"
)
fig.show()

# Compare with theoretical distribution
x_theory = np.arange(0, n_trials + 1)
pmf_theory = stats.binom.pmf(x_theory, n_trials, p_success)

print(f"Theoretical mean: {n_trials * p_success:.1f}")
print(f"Empirical mean: {np.mean(results):.2f}")
print(f"Theoretical std: {np.sqrt(n_trials * p_success * (1 - p_success)):.2f}")
print(f"Empirical std: {np.std(results):.2f}")

Theoretical mean: 50.0
Empirical mean: 49.97
Theoretical std: 5.00
Empirical std: 5.14

Exponentiële verdeling¶

De exponentiële verdeling modelleert de tijd tussen gebeurtenissen in een Poisson-proces (bv. tijd tussen aankomsten).

PDF: $f(x; \lambda) = \lambda e^{-\lambda x}$ voor $x \geq 0$

✍️¶

Visualiseer exponentiële verdelingen met verschillende rate parameters $\lambda \in \{0.5, 1, 2\}$ en bereken $P(X > 2)$ voor elk.

# Exponential distributions with different rate parameters
x = np.linspace(0, 8, 1000)

# Create dataframe with all distributions
data = []
for lambda_rate in [0.5, 1, 2]:
    pdf = stats.expon.pdf(x, scale=1 / lambda_rate)
    data.extend(
        [{"x": xi, "p(x)": pi, "λ": f"λ={lambda_rate}"} for xi, pi in zip(x, pdf, strict=False)]
    )

df = pd.DataFrame(data)
fig = px.line(df, x="x", y="p(x)", color="λ", title="Exponential Distributions")
fig.show()

# Calculate P(X > 2) for each lambda
print("Probability that X > 2:")
for lambda_rate in [0.5, 1, 2]:
    exp_dist = stats.expon(scale=1 / lambda_rate)
    p_greater_2 = 1 - exp_dist.cdf(2)
    print(f"λ={lambda_rate}: P(X > 2) = {p_greater_2:.4f}")

Probability that X > 2:
λ=0.5: P(X > 2) = 0.3679
λ=1: P(X > 2) = 0.1353
λ=2: P(X > 2) = 0.0183

Simulatie en wet van grote aantallen¶

Bij ML Principles zagen we reeds een illustratie van de Law of Large Numbers.

✍️¶

Trek steekproeven van verschillende groottes uit een uniforme verdeling $\mathcal{U}(0, 10)$ en bereken het steekproefgemiddelde. Toon aan dat het gemiddelde naar de theoretische verwachtingswaarde convergeert naarmate de steekproef groeit.

# Law of Large Numbers demonstration
a_uniform, b_uniform = 0, 10
theoretical_mean = (a_uniform + b_uniform) / 2

# Different sample sizes
sample_sizes = [10, 50, 100, 500, 1000, 5000, 10000]
sample_means = []

for n in sample_sizes:
    sample = rng.uniform(a_uniform, b_uniform, n)
    sample_means.append(np.mean(sample))
    print(f"n={n:5d}: mean = {sample_means[-1]:.4f}")

print(f"\nTheoretical mean: {theoretical_mean:.1f}")

# Visualize convergence
df = pd.DataFrame({"Sample Size": sample_sizes, "Sample Mean": sample_means})
fig = px.line(df, x="Sample Size", y="Sample Mean", title="Law of Large Numbers")
fig.add_hline(y=theoretical_mean, line_dash="dash", annotation_text="Theoretical mean")
fig.show()

n=   10: mean = 7.5263
n=   50: mean = 5.3856
n=  100: mean = 4.8818
n=  500: mean = 4.9201
n= 1000: mean = 4.7795
n= 5000: mean = 4.9974
n=10000: mean = 5.0299

Theoretical mean: 5.0