In dit labo bestuderen we hoe we kansverdelingen kunnen karakteriseren met behulp van samenvattende statistieken. We berekenen verwachtingswaarden, varianties, covarianties en correlaties.
Leerdoelen:
Verwachtingswaarde (gemiddelde) berekenen en interpreteren
Variantie en standaardafwijking als spreidingsmaten gebruiken
Mediaan, modus en percentielen bepalen
Covariantie en correlatie berekenen voor lineaire samenhang
Het verschil begrijpen tussen ongecorreleerd en onafhankelijk
import numpy as np
import pandas as pd
import plotly.express as px
from scipy import stats
# Set random seed
rng = np.random.default_rng(42)Centrummaten¶
Centrummaten geven het “zwaartepunt” van een kansverdeling aan.
Verwachtingswaarde (Expected Value)¶
De verwachtingswaarde of is het gemiddelde dat we verwachten bij oneindig veel observaties.
Discreet:
Continu:
✍️¶
De theoretische (populatie-)verdeling van het aantal klanten dat een bepaalde winkel per uur ontvangt, ziet er als volgt uit:
P(5 klanten) = 0.10
P(10 klanten) = 0.20
P(15 klanten) = 0.35
P(20 klanten) = 0.25
P(25 klanten) = 0.10
Bereken de verwachtingswaarde (het gemiddeld verwacht aantal klanten per uur).
# Number of customers per hour distribution
customers = np.array([5, 10, 15, 20, 25])
probabilities = np.array([0.10, 0.20, 0.35, 0.25, 0.10])
expectation = np.sum(customers * probabilities)
print(f"\nExpected number of customers per hour E[x] = {expectation:.2f}")
Expected number of customers per hour E[x] = 15.25
# Visualize with expected value
df = pd.DataFrame({"Customers": customers, "Probability": probabilities})
fig = px.bar(df, x="Customers", y="Probability", title="Distribution of Customers per Hour")
fig.add_vline(
x=expectation,
line_dash="dash",
line_color="red",
annotation_text=f"E[x] = {expectation:.1f}",
annotation_position="top right",
)
fig.update_layout(
yaxis_range=[0, 0.40], xaxis_title="Number of Customers", yaxis_title="Probability"
)
fig.show()✍️¶
Simuleer 10000 uren en toon aan dat het steekproefgemiddelde convergeert naar de verwachtingswaarde (Wet van Grote Aantallen).
# Simulate customer counts over many hours
n_hours = 10000
observed_customers = rng.choice(customers, size=n_hours, p=probabilities)
# Calculate cumulative average
cumulative_avg = np.cumsum(observed_customers) / np.arange(1, n_hours + 1)
# Plot convergence
df = pd.DataFrame({"Number of Hours": range(1, n_hours + 1), "Average Customers": cumulative_avg})
fig = px.line(
df,
x="Number of Hours",
y="Average Customers",
title="Law of Large Numbers: Sample Mean Converges to E[x]",
)
fig.add_hline(
y=expectation,
line_dash="dash",
line_color="red",
annotation_text=f"E[x] = {expectation:.1f}",
annotation_position="right",
)
fig.update_layout(
yaxis_range=[expectation - 5, expectation + 5], yaxis_title="Average Customers per Hour"
)
fig.show()Mediaan en Modus¶
Mediaan: De waarde waarbij 50% van de kansmassa links en rechts ligt
Modus: De meest waarschijnlijke waarde (hoogste kans bij discreet, maximum van PDF bij continu)
✍️¶
Vergelijk verwachtingswaarde, mediaan en modus voor de volgende drie distributies:
Symmetrische verdeling: Normale verdeling
Rechts-scheve verdeling: Log-normale verdeling met parameters ,
Links-scheve verdeling: Beta verdeling met parameters ,
Bereken voor elke distributie het gemiddelde (mean), de mediaan en de modus, en visualiseer deze.
# Define three specific distributions to compare
x_range = np.linspace(-4, 6, 1000)
# 1. Symmetric: Normal distribution N(0, 1)
dist1 = stats.norm(loc=0, scale=1)
pdf1 = dist1.pdf(x_range)
mean1 = dist1.mean()
median1 = dist1.median()
mode1_idx = np.argmax(pdf1)
mode1 = x_range[mode1_idx]
# 2. Right-skewed: Log-Normal distribution
dist2 = stats.lognorm(s=0.8, scale=1)
x_range2 = np.linspace(0, 6, 1000)
pdf2 = dist2.pdf(x_range2)
mean2 = dist2.mean()
median2 = dist2.median()
mode2_idx = np.argmax(pdf2)
mode2 = x_range2[mode2_idx]
# 3. Left-skewed: Beta distribution Beta(8, 2)
dist3 = stats.beta(a=8, b=2)
x_range3 = np.linspace(0, 1, 1000)
pdf3 = dist3.pdf(x_range3)
mean3 = dist3.mean()
median3 = dist3.median()
mode3_idx = np.argmax(pdf3)
mode3 = x_range3[mode3_idx]
# Create dataframes for each distribution
df1 = pd.DataFrame({"x": x_range, "Density": pdf1, "Distribution": "Symmetric: N(0,1)"})
df1_stats = pd.DataFrame(
{
"x": [mean1, median1, mode1],
"y": [0, 0, 0],
"Measure": ["Mean", "Median", "Mode"],
"Value": [mean1, median1, mode1],
}
)
df2 = pd.DataFrame({"x": x_range2, "Density": pdf2, "Distribution": "Right-skewed: LogNormal"})
df2_stats = pd.DataFrame(
{
"x": [mean2, median2, mode2],
"y": [0, 0, 0],
"Measure": ["Mean", "Median", "Mode"],
"Value": [mean2, median2, mode2],
}
)
df3 = pd.DataFrame({"x": x_range3, "Density": pdf3, "Distribution": "Left-skewed: Beta(8,2)"})
df3_stats = pd.DataFrame(
{
"x": [mean3, median3, mode3],
"y": [0, 0, 0],
"Measure": ["Mean", "Median", "Mode"],
"Value": [mean3, median3, mode3],
}
)
# Plot 1: Symmetric
fig1 = px.area(df1, x="x", y="Density", title="Symmetric: N(0,1)")
for _, row in df1_stats.iterrows():
fig1.add_vline(
x=row["x"],
line_dash="dash",
annotation_text=f"{row['Measure']}={row['Value']:.2f}",
annotation_position="top",
)
fig1.show()
# Plot 2: Right-skewed
fig2 = px.area(
df2, x="x", y="Density", title="Right-skewed: LogNormal", color_discrete_sequence=["green"]
)
for _, row in df2_stats.iterrows():
fig2.add_vline(
x=row["x"],
line_dash="dash",
annotation_text=f"{row['Measure']}={row['Value']:.2f}",
annotation_position="top",
)
fig2.show()
# Plot 3: Left-skewed
fig3 = px.area(
df3, x="x", y="Density", title="Left-skewed: Beta(8,2)", color_discrete_sequence=["purple"]
)
for _, row in df3_stats.iterrows():
fig3.add_vline(
x=row["x"],
line_dash="dash",
annotation_text=f"{row['Measure']}={row['Value']:.2f}",
annotation_position="top",
)
fig3.show()Spreidingsmaten¶
Spreidingsmaten geven aan hoe sterk waarden variëren rond het centrum.
Variantie en Standaardafwijking¶
Populatievariantie (theoretische verdeling):
Steekproefvariantie (empirische data):
Standaardafwijking: (populatie) of (steekproef)
✍️¶
Bereken eerst de populatievariantie en -standaardafwijking voor de theoretische kansverdeling van het aantal klanten per uur.
Simuleer vervolgens data en bereken de steekproefvariantie en -standaardafwijking.
# Population
print("1: Population")
# E[(x - μ)²]
deviations_squared = (customers - expectation) ** 2
population_variance = np.sum(deviations_squared * probabilities)
population_std = np.sqrt(population_variance)
print(f"Variance: {population_variance:.2f}")
print(f"Std: {population_std:.2f}")
# Sample
print("2: Sample")
# Simulate a sample of customer counts
n_sample = 100
sample_data = rng.choice(customers, size=n_sample, p=probabilities)
# Calculate sample statistics
sample_mean = np.mean(sample_data)
sample_variance = np.sum((sample_data - sample_mean) ** 2) / (n_sample - 1)
sample_std = np.sqrt(sample_variance)
print(f"Variance: {sample_variance:.2f}")
print(f"Std: {sample_std:.2f}")1: Population
Variance: 31.19
Std: 5.58
2: Sample
Variance: 32.88
Std: 5.73
Vergelijking van spreiding tussen continue verdelingen¶
✍️¶
Visualiseer normale verdelingen met verschillende standaardafwijkingen (bv. 2, 5, 10 en 15) maar hetzelfde gemiddelde (bv. 50 klanten per dag).
# Normal distributions with different standard deviations
mu = 50 # average customers per day
sigmas = [2, 5, 10, 15]
x = np.linspace(mu - 50, mu + 50, 1000)
# Create combined dataframe
data = []
for sigma in sigmas:
pdf = stats.norm.pdf(x, mu, sigma)
for xi, pi in zip(x, pdf, strict=False):
data.append({"x": xi, "Density": pi, "σ": f"σ = {sigma}"})
df = pd.DataFrame(data)
fig = px.line(
df, x="x", y="Density", color="σ", title="Effect of Standard Deviation on Distribution Spread"
)
fig.add_vline(
x=mu,
line_dash="dash",
line_color="black",
annotation_text=f"μ = {mu}",
annotation_position="top right",
)
fig.update_layout(
xaxis_range=[mu - 50, mu + 50], xaxis_title="Customers per Day", yaxis_title="Density"
)
fig.show()Interkwartielafstand (IQR)¶
De IQR is een robuuste maat voor spreiding, minder gevoelig voor uitschieters:
Uitschieters worden vaak gedefinieerd als waarden buiten
✍️¶
Bereken IQR en detecteer uitschieters in volgende dataset met extreme waarden.
# Generate data with outliers
normal_data = rng.normal(0, 1, 200)
outliers = np.array([5, 5.5, -4.5, -5, 6])
data = np.concatenate([normal_data, outliers])# Calculate statistics
q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50) # median
q3 = np.percentile(data, 75)
iqr = q3 - q1
# Outlier bounds
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
# Identify outliers
outlier_mask = (data < lower_bound) | (data > upper_bound)
outlier_values = data[outlier_mask]
normal_values = data[~outlier_mask]
print("Quartile Analysis:")
print(f" Q1 (25th percentile) = {q1:.3f}")
print(f" Q2 (50th percentile, median) = {q2:.3f}")
print(f" Q3 (75th percentile) = {q3:.3f}")
print(f" IQR = Q3 - Q1 = {iqr:.3f}")
print("\nOutlier Detection:")
print(f" Lower bound = Q1 - 1.5×IQR = {lower_bound:.3f}")
print(f" Upper bound = Q3 + 1.5×IQR = {upper_bound:.3f}")
print(f" Number of outliers: {len(outlier_values)}")
print(f" Outlier values: {sorted(outlier_values)}")Quartile Analysis:
Q1 (25th percentile) = -0.716
Q2 (50th percentile, median) = -0.016
Q3 (75th percentile) = 0.732
IQR = Q3 - Q1 = 1.447
Outlier Detection:
Lower bound = Q1 - 1.5×IQR = -2.887
Upper bound = Q3 + 1.5×IQR = 2.903
Number of outliers: 7
Outlier values: [np.float64(-5.0), np.float64(-4.5), np.float64(2.9446914996193203), np.float64(3.512096117817214), np.float64(5.0), np.float64(5.5), np.float64(6.0)]
# Visualize with box plot
fig1 = px.box(y=data, title="Box Plot with Outliers", points="all")
fig1.update_layout(yaxis_title="Value", showlegend=False)
fig1.show()
# Histogram with IQR and outliers
df_normal = pd.DataFrame({"Value": normal_values, "Type": "Normal"})
df_outliers = pd.DataFrame({"Value": outlier_values, "Type": "Outliers"})
df_combined = pd.concat([df_normal, df_outliers])
fig2 = px.histogram(df_normal, x="Value", nbins=30, title="Histogram with IQR and Outliers")
fig2.add_scatter(
x=outlier_values,
y=[5] * len(outlier_values),
mode="markers",
name="Outliers",
marker={"color": "red", "size": 12, "symbol": "triangle-up"},
)
fig2.add_vline(x=q2, line_dash="dash", line_color="red", annotation_text=f"Median={q2:.2f}")
fig2.add_vrect(
x0=q1,
x1=q3,
fillcolor="green",
opacity=0.2,
annotation_text=f"IQR={iqr:.2f}",
annotation_position="top left",
)
fig2.add_vline(x=lower_bound, line_dash="dot", line_color="orange")
fig2.add_vline(x=upper_bound, line_dash="dot", line_color="orange")
fig2.update_layout(yaxis_title="Frequency")
fig2.show()Covariantie en Correlatie¶
Voor twee random variabelen meten we de lineaire samenhang.
Covariantie¶
Positief: gelijktijdige afwijkingen hebben dezelfde richting
Negatief: gelijktijdige afwijkingen hebben tegengestelde richting
~0: geen lineaire samenhang
✍️¶
Gegeven zijn drie datasets:
Positieve correlatie: met
Negatieve correlatie: met
Geen correlatie: en zijn onafhankelijk normaal verdeeld
Bereken de covariantie voor elk van deze datasets.
# Generating three datasets with different relationships
n = 1000
# Dataset 1: Positive correlation - y = 2x + noise
x1 = rng.normal(0, 1, n)
y1 = 2 * x1 + rng.normal(0, 0.5, n)
# Dataset 2: Negative correlation - y = -1.5x + noise
x2 = rng.normal(0, 1, n)
y2 = -1.5 * x2 + rng.normal(0, 0.5, n)
# Dataset 3: No correlation - independent
x3 = rng.normal(0, 1, n)
y3 = rng.normal(0, 1, n)# Calculate covariances
cov1 = np.cov(x1, y1)[0, 1]
cov2 = np.cov(x2, y2)[0, 1]
cov3 = np.cov(x3, y3)[0, 1]
# Visualize
df1 = pd.DataFrame({"x": x1, "y": y1, "Dataset": "Positive: y = 2x + ε"})
df2 = pd.DataFrame({"x": x2, "y": y2, "Dataset": "Negative: y = -1.5x + ε"})
df3 = pd.DataFrame({"x": x3, "y": y3, "Dataset": "No correlation"})
fig1 = px.scatter(
df1,
x="x",
y="y",
title=f"Positive Covariance: Cov(x,y) = {cov1:.3f}",
color_discrete_sequence=["green"],
)
fig1.add_hline(y=0, line_dash="dash", line_color="gray")
fig1.add_vline(x=0, line_dash="dash", line_color="gray")
fig1.update_layout(yaxis={"scaleanchor": "x", "scaleratio": 1})
fig1.show()
fig2 = px.scatter(
df2,
x="x",
y="y",
title=f"Negative Covariance: Cov(x,y) = {cov2:.3f}",
color_discrete_sequence=["red"],
)
fig2.add_hline(y=0, line_dash="dash", line_color="gray")
fig2.add_vline(x=0, line_dash="dash", line_color="gray")
fig2.update_layout(yaxis={"scaleanchor": "x", "scaleratio": 1})
fig2.show()
fig3 = px.scatter(
df3,
x="x",
y="y",
title=f"Near Zero Covariance: Cov(x,y) = {cov3:.3f}",
color_discrete_sequence=["blue"],
)
fig3.add_hline(y=0, line_dash="dash", line_color="gray")
fig3.add_vline(x=0, line_dash="dash", line_color="gray")
fig3.update_layout(yaxis={"scaleanchor": "x", "scaleratio": 1})
fig3.show()Correlatie (Pearson)¶
Correlatiecoëfficiënt: Genormaliseerde covariantie
Eigenschappen:
: perfecte lineaire relatie
: geen lineaire correlatie
Schaal-onafhankelijk (in tegenstelling tot covariantie)
✍️¶
Bereken correlaties en vergelijk met covarianties.
# Calculate correlations
corr1 = np.corrcoef(x1, y1)[0, 1]
corr2 = np.corrcoef(x2, y2)[0, 1]
corr3 = np.corrcoef(x3, y3)[0, 1]
# Also calculate manually
std_x1, std_y1 = np.std(x1), np.std(y1)
corr1_manual = cov1 / (std_x1 * std_y1)
# Create comparison table
comparison_df = pd.DataFrame(
{
"Dataset": ["Positive", "Negative", "No correlation"],
"Covariance": [cov1, cov2, cov3],
"Correlation": [corr1, corr2, corr3],
}
)
print("\nComparison Table:")
print(comparison_df.to_string(index=False))
Comparison Table:
Dataset Covariance Correlation
Positive 1.927147 0.969135
Negative -1.475871 -0.948614
No correlation -0.024658 -0.025136
Correlatie bij verschillende sterktes¶
✍️¶
Gegeven zijn vier datasets met verschillende correlatiesterktes:
Dataset 1: zeer sterke positieve correlatie (ρ ≈ 0.95)
Dataset 2: matige positieve correlatie (ρ ≈ 0.7)
Dataset 3: zwakke positieve correlatie (ρ ≈ 0.3)
Dataset 4: geen correlatie (ρ ≈ 0)
Visualiseer deze datasets en vergelijk de correlatiecoëfficiënten.
# Generate data with four specific correlation strengths
n = 500
rho_targets = [0.95, 0.7, 0.3, 0]
datasets = []
for rho_true in rho_targets:
# Generate correlated data
x = rng.normal(0, 1, n)
noise_scale = np.sqrt(1 - rho_true**2) if rho_true < 1 else 0.01
y = rho_true * x + noise_scale * rng.normal(0, 1, n)
# Calculate actual correlation and covariance
rho_actual = np.corrcoef(x, y)[0, 1]
cov = np.cov(x, y)[0, 1]
datasets.append({"x": x, "y": y, "rho_target": rho_true, "rho_actual": rho_actual, "cov": cov})# Plot each dataset
for i, ds in enumerate(datasets):
df = pd.DataFrame({"x": ds["x"], "y": ds["y"]})
fig = px.scatter(
df,
x="x",
y="y",
opacity=0.5,
title=f"Target ρ = {ds['rho_target']:.2f}, Actual ρ = {ds['rho_actual']:.3f}, Cov = {ds['cov']:.3f}",
)
# Add regression line
z = np.polyfit(ds["x"], ds["y"], 1)
p = np.poly1d(z)
x_line = np.linspace(ds["x"].min(), ds["x"].max(), 100)
df_line = pd.DataFrame({"x": x_line, "y": p(x_line)})
fig.add_scatter(
x=df_line["x"],
y=df_line["y"],
mode="lines",
line={"color": "red", "width": 2},
name="Best fit",
)
# Add reference lines
fig.add_hline(y=0, line_dash="dash", line_color="gray")
fig.add_vline(x=0, line_dash="dash", line_color="gray")
fig.update_layout(yaxis={"scaleanchor": "x", "scaleratio": 1})
fig.show()Covariantie ≈ 0 betekent NIET onafhankelijkheid!¶
Covariantie/correlatie meet alleen lineaire samenhang. Er kan een sterk niet-lineair verband bestaan met correlatie ≈ 0.
✍️¶
Gegeven zijn drie datasets met niet-lineaire relaties maar correlatie ≈ 0:
Circulaire relatie: liggen op concentrische cirkels
Kwadratische relatie:
Absolute waarde:
Bereken de correlatie voor elk van deze datasets en observeer dat deze bijna nul is ondanks duidelijke afhankelijkheid.
# Generate three datasets with non-linear relationships
n = 1000
# Dataset 1: Circular pattern
theta = rng.uniform(0, 2 * np.pi, n)
r = rng.uniform(0.5, 2, n)
x_circle = r * np.cos(theta)
y_circle = r * np.sin(theta)
# Dataset 2: Quadratic relationship y = x²
x_quad = rng.uniform(-3, 3, n)
y_quad = x_quad**2 + rng.normal(0, 0.5, n)
# Dataset 3: Absolute value y = |x|
x_abs = rng.uniform(-3, 3, n)
y_abs = np.abs(x_abs) + rng.normal(0, 0.3, n)# Calculate correlations
corr_circle = np.corrcoef(x_circle, y_circle)[0, 1]
corr_quad = np.corrcoef(x_quad, y_quad)[0, 1]
corr_abs = np.corrcoef(x_abs, y_abs)[0, 1]
# Visualize
df1 = pd.DataFrame({"x": x_circle, "y": y_circle})
fig1 = px.scatter(
df1,
x="x",
y="y",
opacity=0.5,
title=f"Circular Pattern: ρ = {corr_circle:.4f} ≈ 0",
color_discrete_sequence=["purple"],
)
fig1.add_hline(y=0, line_dash="dash", line_color="gray")
fig1.add_vline(x=0, line_dash="dash", line_color="gray")
fig1.update_layout(yaxis={"scaleanchor": "x", "scaleratio": 1})
fig1.show()
df2 = pd.DataFrame({"x": x_quad, "y": y_quad})
fig2 = px.scatter(
df2,
x="x",
y="y",
opacity=0.5,
title=f"Quadratic y = x²: ρ = {corr_quad:.4f} ≈ 0",
color_discrete_sequence=["purple"],
)
fig2.add_hline(y=0, line_dash="dash", line_color="gray")
fig2.add_vline(x=0, line_dash="dash", line_color="gray")
fig2.show()
df3 = pd.DataFrame({"x": x_abs, "y": y_abs})
fig3 = px.scatter(
df3,
x="x",
y="y",
opacity=0.5,
title=f"Absolute y = |x|: ρ = {corr_abs:.4f} ≈ 0",
color_discrete_sequence=["purple"],
)
fig3.add_hline(y=0, line_dash="dash", line_color="gray")
fig3.add_vline(x=0, line_dash="dash", line_color="gray")
fig3.show()Covariantiematrix¶
Voor een vector van random variabelen kunnen we alle covarianties in een matrix samenvatten:
Diagonaal: varianties
Off-diagonal: covarianties
✍️¶
Bereken en visualiseer een covariantiematrix voor meerdere variabelen.
# Generate multivariate data with specific covariance structure
n = 1000
mean = [0, 0, 0]
cov_true = np.array(
[
[1.0, 0.7, -0.3], # var(x1), cov(x1,x2), cov(x1,x3)
[0.7, 1.5, 0.5], # cov(x2,x1), var(x2), cov(x2,x3)
[-0.3, 0.5, 0.8], # cov(x3,x1), cov(x3,x2), var(x3)
]
)
data = rng.multivariate_normal(mean, cov_true, n)# Calculate sample covariance matrix
cov_sample = np.cov(data.T)
# Calculate correlation matrix
corr_sample = np.corrcoef(data.T)
# Visualize covariance matrix
labels = ["x₁", "x₂", "x₃"]
df_cov = pd.DataFrame(cov_sample, index=labels, columns=labels)
fig1 = px.imshow(
df_cov,
text_auto=".3f",
color_continuous_scale="RdBu_r",
title="Covariance Matrix",
zmin=-1.5,
zmax=1.5,
labels={"color": "Covariance"},
)
fig1.show()
# Visualize correlation matrix
df_corr = pd.DataFrame(corr_sample, index=labels, columns=labels)
fig2 = px.imshow(
df_corr,
text_auto=".3f",
color_continuous_scale="RdBu_r",
title="Correlation Matrix",
zmin=-1,
zmax=1,
labels=dict(color="Correlation"),
)
fig2.show()