Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Centrummaten, Spreidingsmaten en Samenhang

In dit labo bestuderen we hoe we kansverdelingen kunnen karakteriseren met behulp van samenvattende statistieken. We berekenen verwachtingswaarden, varianties, covarianties en correlaties.

Leerdoelen:

  • Verwachtingswaarde (gemiddelde) berekenen en interpreteren

  • Variantie en standaardafwijking als spreidingsmaten gebruiken

  • Mediaan, modus en percentielen bepalen

  • Covariantie en correlatie berekenen voor lineaire samenhang

  • Het verschil begrijpen tussen ongecorreleerd en onafhankelijk

import numpy as np
import pandas as pd
import plotly.express as px
from scipy import stats

# Set random seed
rng = np.random.default_rng(42)

Centrummaten

Centrummaten geven het “zwaartepunt” van een kansverdeling aan.

Verwachtingswaarde (Expected Value)

De verwachtingswaarde E[x]\mathbb{E}[x] of μ\mu is het gemiddelde dat we verwachten bij oneindig veel observaties.

  • Discreet: E[x]=xxP(x)\mathbb{E}[x] = \sum_x x \cdot P(x)

  • Continu: E[x]=xp(x)dx\mathbb{E}[x] = \int x \cdot p(x) dx

✍️

De theoretische (populatie-)verdeling van het aantal klanten dat een bepaalde winkel per uur ontvangt, ziet er als volgt uit:

  • P(5 klanten) = 0.10

  • P(10 klanten) = 0.20

  • P(15 klanten) = 0.35

  • P(20 klanten) = 0.25

  • P(25 klanten) = 0.10

Bereken de verwachtingswaarde (het gemiddeld verwacht aantal klanten per uur).

# Number of customers per hour distribution
customers = np.array([5, 10, 15, 20, 25])
probabilities = np.array([0.10, 0.20, 0.35, 0.25, 0.10])

expectation = np.sum(customers * probabilities)

print(f"\nExpected number of customers per hour E[x] = {expectation:.2f}")

Expected number of customers per hour E[x] = 15.25
# Visualize with expected value
df = pd.DataFrame({"Customers": customers, "Probability": probabilities})
fig = px.bar(df, x="Customers", y="Probability", title="Distribution of Customers per Hour")
fig.add_vline(
    x=expectation,
    line_dash="dash",
    line_color="red",
    annotation_text=f"E[x] = {expectation:.1f}",
    annotation_position="top right",
)
fig.update_layout(
    yaxis_range=[0, 0.40], xaxis_title="Number of Customers", yaxis_title="Probability"
)
fig.show()
Loading...

✍️

Simuleer 10000 uren en toon aan dat het steekproefgemiddelde convergeert naar de verwachtingswaarde (Wet van Grote Aantallen).

# Simulate customer counts over many hours
n_hours = 10000
observed_customers = rng.choice(customers, size=n_hours, p=probabilities)

# Calculate cumulative average
cumulative_avg = np.cumsum(observed_customers) / np.arange(1, n_hours + 1)

# Plot convergence
df = pd.DataFrame({"Number of Hours": range(1, n_hours + 1), "Average Customers": cumulative_avg})
fig = px.line(
    df,
    x="Number of Hours",
    y="Average Customers",
    title="Law of Large Numbers: Sample Mean Converges to E[x]",
)
fig.add_hline(
    y=expectation,
    line_dash="dash",
    line_color="red",
    annotation_text=f"E[x] = {expectation:.1f}",
    annotation_position="right",
)
fig.update_layout(
    yaxis_range=[expectation - 5, expectation + 5], yaxis_title="Average Customers per Hour"
)
fig.show()
Loading...

Mediaan en Modus

  • Mediaan: De waarde waarbij 50% van de kansmassa links en rechts ligt

  • Modus: De meest waarschijnlijke waarde (hoogste kans bij discreet, maximum van PDF bij continu)

✍️

Vergelijk verwachtingswaarde, mediaan en modus voor de volgende drie distributies:

  1. Symmetrische verdeling: Normale verdeling N(0,12)\mathcal{N}(0, 1^2)

  2. Rechts-scheve verdeling: Log-normale verdeling met parameters s=0.8s=0.8, scale=1\text{scale}=1

  3. Links-scheve verdeling: Beta verdeling met parameters α=8\alpha=8, β=2\beta=2

Bereken voor elke distributie het gemiddelde (mean), de mediaan en de modus, en visualiseer deze.

# Define three specific distributions to compare
x_range = np.linspace(-4, 6, 1000)

# 1. Symmetric: Normal distribution N(0, 1)
dist1 = stats.norm(loc=0, scale=1)
pdf1 = dist1.pdf(x_range)
mean1 = dist1.mean()
median1 = dist1.median()
mode1_idx = np.argmax(pdf1)
mode1 = x_range[mode1_idx]

# 2. Right-skewed: Log-Normal distribution
dist2 = stats.lognorm(s=0.8, scale=1)
x_range2 = np.linspace(0, 6, 1000)
pdf2 = dist2.pdf(x_range2)
mean2 = dist2.mean()
median2 = dist2.median()
mode2_idx = np.argmax(pdf2)
mode2 = x_range2[mode2_idx]

# 3. Left-skewed: Beta distribution Beta(8, 2)
dist3 = stats.beta(a=8, b=2)
x_range3 = np.linspace(0, 1, 1000)
pdf3 = dist3.pdf(x_range3)
mean3 = dist3.mean()
median3 = dist3.median()
mode3_idx = np.argmax(pdf3)
mode3 = x_range3[mode3_idx]

# Create dataframes for each distribution
df1 = pd.DataFrame({"x": x_range, "Density": pdf1, "Distribution": "Symmetric: N(0,1)"})
df1_stats = pd.DataFrame(
    {
        "x": [mean1, median1, mode1],
        "y": [0, 0, 0],
        "Measure": ["Mean", "Median", "Mode"],
        "Value": [mean1, median1, mode1],
    }
)

df2 = pd.DataFrame({"x": x_range2, "Density": pdf2, "Distribution": "Right-skewed: LogNormal"})
df2_stats = pd.DataFrame(
    {
        "x": [mean2, median2, mode2],
        "y": [0, 0, 0],
        "Measure": ["Mean", "Median", "Mode"],
        "Value": [mean2, median2, mode2],
    }
)

df3 = pd.DataFrame({"x": x_range3, "Density": pdf3, "Distribution": "Left-skewed: Beta(8,2)"})
df3_stats = pd.DataFrame(
    {
        "x": [mean3, median3, mode3],
        "y": [0, 0, 0],
        "Measure": ["Mean", "Median", "Mode"],
        "Value": [mean3, median3, mode3],
    }
)

# Plot 1: Symmetric
fig1 = px.area(df1, x="x", y="Density", title="Symmetric: N(0,1)")
for _, row in df1_stats.iterrows():
    fig1.add_vline(
        x=row["x"],
        line_dash="dash",
        annotation_text=f"{row['Measure']}={row['Value']:.2f}",
        annotation_position="top",
    )
fig1.show()

# Plot 2: Right-skewed
fig2 = px.area(
    df2, x="x", y="Density", title="Right-skewed: LogNormal", color_discrete_sequence=["green"]
)
for _, row in df2_stats.iterrows():
    fig2.add_vline(
        x=row["x"],
        line_dash="dash",
        annotation_text=f"{row['Measure']}={row['Value']:.2f}",
        annotation_position="top",
    )
fig2.show()

# Plot 3: Left-skewed
fig3 = px.area(
    df3, x="x", y="Density", title="Left-skewed: Beta(8,2)", color_discrete_sequence=["purple"]
)
for _, row in df3_stats.iterrows():
    fig3.add_vline(
        x=row["x"],
        line_dash="dash",
        annotation_text=f"{row['Measure']}={row['Value']:.2f}",
        annotation_position="top",
    )
fig3.show()
Loading...
Loading...
Loading...

Spreidingsmaten

Spreidingsmaten geven aan hoe sterk waarden variëren rond het centrum.

Variantie en Standaardafwijking

Populatievariantie (theoretische verdeling): σ2=Var(x)=E[(xμ)2]\sigma^2 = \text{Var}(x) = \mathbb{E}[(x - \mu)^2]

Steekproefvariantie (empirische data): s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2

Standaardafwijking: σ=Var(x)\sigma = \sqrt{\text{Var}(x)} (populatie) of s=s2s = \sqrt{s^2} (steekproef)

✍️

  1. Bereken eerst de populatievariantie en -standaardafwijking voor de theoretische kansverdeling van het aantal klanten per uur.

  2. Simuleer vervolgens data en bereken de steekproefvariantie en -standaardafwijking.

# Population
print("1: Population")

# E[(x - μ)²]
deviations_squared = (customers - expectation) ** 2
population_variance = np.sum(deviations_squared * probabilities)
population_std = np.sqrt(population_variance)

print(f"Variance: {population_variance:.2f}")
print(f"Std: {population_std:.2f}")


# Sample
print("2: Sample")

# Simulate a sample of customer counts
n_sample = 100
sample_data = rng.choice(customers, size=n_sample, p=probabilities)

# Calculate sample statistics
sample_mean = np.mean(sample_data)
sample_variance = np.sum((sample_data - sample_mean) ** 2) / (n_sample - 1)
sample_std = np.sqrt(sample_variance)

print(f"Variance: {sample_variance:.2f}")
print(f"Std: {sample_std:.2f}")
1: Population
Variance: 31.19
Std: 5.58
2: Sample
Variance: 32.88
Std: 5.73

Vergelijking van spreiding tussen continue verdelingen

✍️

Visualiseer normale verdelingen met verschillende standaardafwijkingen (bv. 2, 5, 10 en 15) maar hetzelfde gemiddelde (bv. 50 klanten per dag).

# Normal distributions with different standard deviations
mu = 50  # average customers per day
sigmas = [2, 5, 10, 15]

x = np.linspace(mu - 50, mu + 50, 1000)

# Create combined dataframe
data = []
for sigma in sigmas:
    pdf = stats.norm.pdf(x, mu, sigma)
    for xi, pi in zip(x, pdf, strict=False):
        data.append({"x": xi, "Density": pi, "σ": f"σ = {sigma}"})

df = pd.DataFrame(data)
fig = px.line(
    df, x="x", y="Density", color="σ", title="Effect of Standard Deviation on Distribution Spread"
)
fig.add_vline(
    x=mu,
    line_dash="dash",
    line_color="black",
    annotation_text=f"μ = {mu}",
    annotation_position="top right",
)
fig.update_layout(
    xaxis_range=[mu - 50, mu + 50], xaxis_title="Customers per Day", yaxis_title="Density"
)
fig.show()
Loading...

Interkwartielafstand (IQR)

De IQR is een robuuste maat voor spreiding, minder gevoelig voor uitschieters:

IQR=Q3Q1=P75P25\text{IQR} = Q_3 - Q_1 = P_{75} - P_{25}

Uitschieters worden vaak gedefinieerd als waarden buiten [Q11.5×IQR,Q3+1.5×IQR][Q_1 - 1.5 \times \text{IQR}, Q_3 + 1.5 \times \text{IQR}]

✍️

Bereken IQR en detecteer uitschieters in volgende dataset met extreme waarden.

# Generate data with outliers
normal_data = rng.normal(0, 1, 200)
outliers = np.array([5, 5.5, -4.5, -5, 6])
data = np.concatenate([normal_data, outliers])
# Calculate statistics
q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50)  # median
q3 = np.percentile(data, 75)
iqr = q3 - q1

# Outlier bounds
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

# Identify outliers
outlier_mask = (data < lower_bound) | (data > upper_bound)
outlier_values = data[outlier_mask]
normal_values = data[~outlier_mask]

print("Quartile Analysis:")
print(f"  Q1 (25th percentile) = {q1:.3f}")
print(f"  Q2 (50th percentile, median) = {q2:.3f}")
print(f"  Q3 (75th percentile) = {q3:.3f}")
print(f"  IQR = Q3 - Q1 = {iqr:.3f}")
print("\nOutlier Detection:")
print(f"  Lower bound = Q1 - 1.5×IQR = {lower_bound:.3f}")
print(f"  Upper bound = Q3 + 1.5×IQR = {upper_bound:.3f}")
print(f"  Number of outliers: {len(outlier_values)}")
print(f"  Outlier values: {sorted(outlier_values)}")
Quartile Analysis:
  Q1 (25th percentile) = -0.716
  Q2 (50th percentile, median) = -0.016
  Q3 (75th percentile) = 0.732
  IQR = Q3 - Q1 = 1.447

Outlier Detection:
  Lower bound = Q1 - 1.5×IQR = -2.887
  Upper bound = Q3 + 1.5×IQR = 2.903
  Number of outliers: 7
  Outlier values: [np.float64(-5.0), np.float64(-4.5), np.float64(2.9446914996193203), np.float64(3.512096117817214), np.float64(5.0), np.float64(5.5), np.float64(6.0)]
# Visualize with box plot
fig1 = px.box(y=data, title="Box Plot with Outliers", points="all")
fig1.update_layout(yaxis_title="Value", showlegend=False)
fig1.show()

# Histogram with IQR and outliers
df_normal = pd.DataFrame({"Value": normal_values, "Type": "Normal"})
df_outliers = pd.DataFrame({"Value": outlier_values, "Type": "Outliers"})
df_combined = pd.concat([df_normal, df_outliers])

fig2 = px.histogram(df_normal, x="Value", nbins=30, title="Histogram with IQR and Outliers")
fig2.add_scatter(
    x=outlier_values,
    y=[5] * len(outlier_values),
    mode="markers",
    name="Outliers",
    marker={"color": "red", "size": 12, "symbol": "triangle-up"},
)
fig2.add_vline(x=q2, line_dash="dash", line_color="red", annotation_text=f"Median={q2:.2f}")
fig2.add_vrect(
    x0=q1,
    x1=q3,
    fillcolor="green",
    opacity=0.2,
    annotation_text=f"IQR={iqr:.2f}",
    annotation_position="top left",
)
fig2.add_vline(x=lower_bound, line_dash="dot", line_color="orange")
fig2.add_vline(x=upper_bound, line_dash="dot", line_color="orange")
fig2.update_layout(yaxis_title="Frequency")
fig2.show()
Loading...
Loading...

Covariantie en Correlatie

Voor twee random variabelen meten we de lineaire samenhang.

Covariantie

Cov(x,y)=E[(xE[x])(yE[y])]\text{Cov}(x, y) = \mathbb{E}[(x - \mathbb{E}[x])(y - \mathbb{E}[y])]
  • Positief: gelijktijdige afwijkingen hebben dezelfde richting

  • Negatief: gelijktijdige afwijkingen hebben tegengestelde richting

  • ~0: geen lineaire samenhang

✍️

Gegeven zijn drie datasets:

  1. Positieve correlatie: y=2x+ϵy = 2x + \epsilon met ϵN(0,0.5)\epsilon \sim N(0, 0.5)

  2. Negatieve correlatie: y=1.5x+ϵy = -1.5x + \epsilon met ϵN(0,0.5)\epsilon \sim N(0, 0.5)

  3. Geen correlatie: xx en yy zijn onafhankelijk normaal verdeeld

Bereken de covariantie voor elk van deze datasets.

# Generating three datasets with different relationships
n = 1000

# Dataset 1: Positive correlation - y = 2x + noise
x1 = rng.normal(0, 1, n)
y1 = 2 * x1 + rng.normal(0, 0.5, n)

# Dataset 2: Negative correlation - y = -1.5x + noise
x2 = rng.normal(0, 1, n)
y2 = -1.5 * x2 + rng.normal(0, 0.5, n)

# Dataset 3: No correlation - independent
x3 = rng.normal(0, 1, n)
y3 = rng.normal(0, 1, n)
# Calculate covariances
cov1 = np.cov(x1, y1)[0, 1]
cov2 = np.cov(x2, y2)[0, 1]
cov3 = np.cov(x3, y3)[0, 1]

# Visualize
df1 = pd.DataFrame({"x": x1, "y": y1, "Dataset": "Positive: y = 2x + ε"})
df2 = pd.DataFrame({"x": x2, "y": y2, "Dataset": "Negative: y = -1.5x + ε"})
df3 = pd.DataFrame({"x": x3, "y": y3, "Dataset": "No correlation"})

fig1 = px.scatter(
    df1,
    x="x",
    y="y",
    title=f"Positive Covariance: Cov(x,y) = {cov1:.3f}",
    color_discrete_sequence=["green"],
)
fig1.add_hline(y=0, line_dash="dash", line_color="gray")
fig1.add_vline(x=0, line_dash="dash", line_color="gray")
fig1.update_layout(yaxis={"scaleanchor": "x", "scaleratio": 1})
fig1.show()

fig2 = px.scatter(
    df2,
    x="x",
    y="y",
    title=f"Negative Covariance: Cov(x,y) = {cov2:.3f}",
    color_discrete_sequence=["red"],
)
fig2.add_hline(y=0, line_dash="dash", line_color="gray")
fig2.add_vline(x=0, line_dash="dash", line_color="gray")
fig2.update_layout(yaxis={"scaleanchor": "x", "scaleratio": 1})
fig2.show()

fig3 = px.scatter(
    df3,
    x="x",
    y="y",
    title=f"Near Zero Covariance: Cov(x,y) = {cov3:.3f}",
    color_discrete_sequence=["blue"],
)
fig3.add_hline(y=0, line_dash="dash", line_color="gray")
fig3.add_vline(x=0, line_dash="dash", line_color="gray")
fig3.update_layout(yaxis={"scaleanchor": "x", "scaleratio": 1})
fig3.show()
Loading...
Loading...
Loading...

Correlatie (Pearson)

Correlatiecoëfficiënt: Genormaliseerde covariantie

ρxy=Cov(x,y)σxσy\rho_{xy} = \frac{\text{Cov}(x, y)}{\sigma_x \sigma_y}

Eigenschappen:

  • 1ρ1-1 \leq \rho \leq 1

  • ρ=1|\rho| = 1: perfecte lineaire relatie

  • ρ=0\rho = 0: geen lineaire correlatie

  • Schaal-onafhankelijk (in tegenstelling tot covariantie)

✍️

Bereken correlaties en vergelijk met covarianties.

# Calculate correlations
corr1 = np.corrcoef(x1, y1)[0, 1]
corr2 = np.corrcoef(x2, y2)[0, 1]
corr3 = np.corrcoef(x3, y3)[0, 1]

# Also calculate manually
std_x1, std_y1 = np.std(x1), np.std(y1)
corr1_manual = cov1 / (std_x1 * std_y1)

# Create comparison table
comparison_df = pd.DataFrame(
    {
        "Dataset": ["Positive", "Negative", "No correlation"],
        "Covariance": [cov1, cov2, cov3],
        "Correlation": [corr1, corr2, corr3],
    }
)
print("\nComparison Table:")
print(comparison_df.to_string(index=False))

Comparison Table:
       Dataset  Covariance  Correlation
      Positive    1.927147     0.969135
      Negative   -1.475871    -0.948614
No correlation   -0.024658    -0.025136

Correlatie bij verschillende sterktes

✍️

Gegeven zijn vier datasets met verschillende correlatiesterktes:

  • Dataset 1: zeer sterke positieve correlatie (ρ ≈ 0.95)

  • Dataset 2: matige positieve correlatie (ρ ≈ 0.7)

  • Dataset 3: zwakke positieve correlatie (ρ ≈ 0.3)

  • Dataset 4: geen correlatie (ρ ≈ 0)

Visualiseer deze datasets en vergelijk de correlatiecoëfficiënten.

# Generate data with four specific correlation strengths
n = 500
rho_targets = [0.95, 0.7, 0.3, 0]

datasets = []
for rho_true in rho_targets:
    # Generate correlated data
    x = rng.normal(0, 1, n)
    noise_scale = np.sqrt(1 - rho_true**2) if rho_true < 1 else 0.01
    y = rho_true * x + noise_scale * rng.normal(0, 1, n)

    # Calculate actual correlation and covariance
    rho_actual = np.corrcoef(x, y)[0, 1]
    cov = np.cov(x, y)[0, 1]

    datasets.append({"x": x, "y": y, "rho_target": rho_true, "rho_actual": rho_actual, "cov": cov})
# Plot each dataset
for i, ds in enumerate(datasets):
    df = pd.DataFrame({"x": ds["x"], "y": ds["y"]})

    fig = px.scatter(
        df,
        x="x",
        y="y",
        opacity=0.5,
        title=f"Target ρ = {ds['rho_target']:.2f}, Actual ρ = {ds['rho_actual']:.3f}, Cov = {ds['cov']:.3f}",
    )

    # Add regression line
    z = np.polyfit(ds["x"], ds["y"], 1)
    p = np.poly1d(z)
    x_line = np.linspace(ds["x"].min(), ds["x"].max(), 100)
    df_line = pd.DataFrame({"x": x_line, "y": p(x_line)})
    fig.add_scatter(
        x=df_line["x"],
        y=df_line["y"],
        mode="lines",
        line={"color": "red", "width": 2},
        name="Best fit",
    )

    # Add reference lines
    fig.add_hline(y=0, line_dash="dash", line_color="gray")
    fig.add_vline(x=0, line_dash="dash", line_color="gray")
    fig.update_layout(yaxis={"scaleanchor": "x", "scaleratio": 1})
    fig.show()
Loading...
Loading...
Loading...
Loading...

Covariantie ≈ 0 betekent NIET onafhankelijkheid!

Covariantie/correlatie meet alleen lineaire samenhang. Er kan een sterk niet-lineair verband bestaan met correlatie ≈ 0.

✍️

Gegeven zijn drie datasets met niet-lineaire relaties maar correlatie ≈ 0:

  1. Circulaire relatie: (x,y)(x, y) liggen op concentrische cirkels

  2. Kwadratische relatie: y=x2+ϵy = x^2 + \epsilon

  3. Absolute waarde: y=x+ϵy = |x| + \epsilon

Bereken de correlatie voor elk van deze datasets en observeer dat deze bijna nul is ondanks duidelijke afhankelijkheid.

# Generate three datasets with non-linear relationships
n = 1000

# Dataset 1: Circular pattern
theta = rng.uniform(0, 2 * np.pi, n)
r = rng.uniform(0.5, 2, n)
x_circle = r * np.cos(theta)
y_circle = r * np.sin(theta)

# Dataset 2: Quadratic relationship y = x²
x_quad = rng.uniform(-3, 3, n)
y_quad = x_quad**2 + rng.normal(0, 0.5, n)

# Dataset 3: Absolute value y = |x|
x_abs = rng.uniform(-3, 3, n)
y_abs = np.abs(x_abs) + rng.normal(0, 0.3, n)
# Calculate correlations
corr_circle = np.corrcoef(x_circle, y_circle)[0, 1]
corr_quad = np.corrcoef(x_quad, y_quad)[0, 1]
corr_abs = np.corrcoef(x_abs, y_abs)[0, 1]

# Visualize
df1 = pd.DataFrame({"x": x_circle, "y": y_circle})
fig1 = px.scatter(
    df1,
    x="x",
    y="y",
    opacity=0.5,
    title=f"Circular Pattern: ρ = {corr_circle:.4f} ≈ 0",
    color_discrete_sequence=["purple"],
)
fig1.add_hline(y=0, line_dash="dash", line_color="gray")
fig1.add_vline(x=0, line_dash="dash", line_color="gray")
fig1.update_layout(yaxis={"scaleanchor": "x", "scaleratio": 1})
fig1.show()

df2 = pd.DataFrame({"x": x_quad, "y": y_quad})
fig2 = px.scatter(
    df2,
    x="x",
    y="y",
    opacity=0.5,
    title=f"Quadratic y = x²: ρ = {corr_quad:.4f} ≈ 0",
    color_discrete_sequence=["purple"],
)
fig2.add_hline(y=0, line_dash="dash", line_color="gray")
fig2.add_vline(x=0, line_dash="dash", line_color="gray")
fig2.show()

df3 = pd.DataFrame({"x": x_abs, "y": y_abs})
fig3 = px.scatter(
    df3,
    x="x",
    y="y",
    opacity=0.5,
    title=f"Absolute y = |x|: ρ = {corr_abs:.4f} ≈ 0",
    color_discrete_sequence=["purple"],
)
fig3.add_hline(y=0, line_dash="dash", line_color="gray")
fig3.add_vline(x=0, line_dash="dash", line_color="gray")
fig3.show()
Loading...
Loading...
Loading...

Covariantiematrix

Voor een vector van random variabelen kunnen we alle covarianties in een matrix samenvatten:

Σij=Cov(xi,xj)\pmb{\Sigma}_{ij} = \text{Cov}(x_i, x_j)
  • Diagonaal: varianties

  • Off-diagonal: covarianties

✍️

Bereken en visualiseer een covariantiematrix voor meerdere variabelen.

# Generate multivariate data with specific covariance structure
n = 1000
mean = [0, 0, 0]
cov_true = np.array(
    [
        [1.0, 0.7, -0.3],  # var(x1), cov(x1,x2), cov(x1,x3)
        [0.7, 1.5, 0.5],  # cov(x2,x1), var(x2), cov(x2,x3)
        [-0.3, 0.5, 0.8],  # cov(x3,x1), cov(x3,x2), var(x3)
    ]
)

data = rng.multivariate_normal(mean, cov_true, n)
# Calculate sample covariance matrix
cov_sample = np.cov(data.T)

# Calculate correlation matrix
corr_sample = np.corrcoef(data.T)

# Visualize covariance matrix
labels = ["x₁", "x₂", "x₃"]
df_cov = pd.DataFrame(cov_sample, index=labels, columns=labels)
fig1 = px.imshow(
    df_cov,
    text_auto=".3f",
    color_continuous_scale="RdBu_r",
    title="Covariance Matrix",
    zmin=-1.5,
    zmax=1.5,
    labels={"color": "Covariance"},
)
fig1.show()

# Visualize correlation matrix
df_corr = pd.DataFrame(corr_sample, index=labels, columns=labels)
fig2 = px.imshow(
    df_corr,
    text_auto=".3f",
    color_continuous_scale="RdBu_r",
    title="Correlation Matrix",
    zmin=-1,
    zmax=1,
    labels=dict(color="Correlation"),
)
fig2.show()
Loading...
Loading...