Airbnb prijsvoorspelling

Gebaseerd op: Airbnb : price prediction using XGBoost 🔥

Probleem definitie¶

Airbnb gebruikt machine learning om optimale prijsvoorstellen te doen aan hosts. Een goed gekalibreerd prijsmodel helpt hosts hun omzet te maximaliseren en tegelijkertijd concurrerend te blijven.

Taak, Ervaring¶

De bedoeling is om numerieke waarden te voorspellen (prijs). Dit gaat over een regressietaak. We willen trainen aan de hand van effectieve prijzen. We hebben dus te maken met gesuperviseerd leren.

Data collection¶

We gebruiken een dataset van Airbnb die beschikbaar is op Kaggle met prijzen uit verschillende grote Amerikaanse steden. We richten ons op volgende variabelen:

Numeriek: bedrooms, bathrooms, review scores, etc.
Categorisch: property type, room type, city
Text: amenities
Geografisch: latitude, longitude, neighborhood
Target variable: log_price

import os
import re

import kagglehub
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import xgboost as xgb
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import OneHotEncoder

# Load data
path = kagglehub.dataset_download("stevezhenghp/airbnb-price-prediction")

Downloading from https://www.kaggle.com/api/v1/datasets/download/stevezhenghp/airbnb-price-prediction?dataset_version_number=1...

100%|██████████| 31.3M/31.3M [00:02<00:00, 13.1MB/s]

Extracting files...

df = pd.read_csv(os.path.join(path, "train.csv"))

df.head()

Data exploration¶

# Basic information about our dataset
print("Dataset Information:")
print("=" * 50)
df.info()

print("\nTarget Variable Statistics:")
print("=" * 50)
print(df["log_price"].describe())

# Check for missing values
print("\nMissing Values:")
print("=" * 50)
missing_info = df.isnull().sum()
missing_info = missing_info[missing_info > 0].sort_values(ascending=False)
print(missing_info)

Dataset Information:
==================================================
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74111 entries, 0 to 74110
Data columns (total 29 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      74111 non-null  int64  
 1   log_price               74111 non-null  float64
 2   property_type           74111 non-null  object 
 3   room_type               74111 non-null  object 
 4   amenities               74111 non-null  object 
 5   accommodates            74111 non-null  int64  
 6   bathrooms               73911 non-null  float64
 7   bed_type                74111 non-null  object 
 8   cancellation_policy     74111 non-null  object 
 9   cleaning_fee            74111 non-null  bool   
 10  city                    74111 non-null  object 
 11  description             74111 non-null  object 
 12  first_review            58247 non-null  object 
 13  host_has_profile_pic    73923 non-null  object 
 14  host_identity_verified  73923 non-null  object 
 15  host_response_rate      55812 non-null  object 
 16  host_since              73923 non-null  object 
 17  instant_bookable        74111 non-null  object 
 18  last_review             58284 non-null  object 
 19  latitude                74111 non-null  float64
 20  longitude               74111 non-null  float64
 21  name                    74111 non-null  object 
 22  neighbourhood           67239 non-null  object 
 23  number_of_reviews       74111 non-null  int64  
 24  review_scores_rating    57389 non-null  float64
 25  thumbnail_url           65895 non-null  object 
 26  zipcode                 73145 non-null  object 
 27  bedrooms                74020 non-null  float64
 28  beds                    73980 non-null  float64
dtypes: bool(1), float64(7), int64(3), object(18)
memory usage: 15.9+ MB

Target Variable Statistics:
==================================================
count    74111.000000
mean         4.782069
std          0.717394
min          0.000000
25%          4.317488
50%          4.709530
75%          5.220356
max          7.600402
Name: log_price, dtype: float64

Missing Values:
==================================================
host_response_rate        18299
review_scores_rating      16722
first_review              15864
last_review               15827
thumbnail_url              8216
neighbourhood              6872
zipcode                     966
bathrooms                   200
host_has_profile_pic        188
host_identity_verified      188
host_since                  188
beds                        131
bedrooms                     91
dtype: int64

Verdeling van de target variabele `log_price`¶

De log-transformatie zorgt voor een meer symmetrische verdeling. Bij lagere prijzen zijn verschillen van een bepaalde grootte belangrijker dan in de hogere regionen. De log-transformatie vertaalt dit naar de numerieke schaal van de target variabele.

# Histogram of log prices
fig = px.histogram(
    df,
    x="log_price",
    nbins=50,
    title="Distribution of Log-Transformed Airbnb Prices",
    labels={"log_price": "Log Price", "count": "Number of Listings"},
    opacity=0.7,
    marginal="box",  # Add box plot on top
)

fig.update_layout(showlegend=False, height=500)

fig.show()

# Calculate actual price statistics (inverse log transformation)
df["price"] = np.exp(df["log_price"])
print("Actual Price Statistics:")
print(f"Mean: ${df.price.mean():.2f}")
print(f"Median: ${df.price.median():.2f}")
print(f"Min: ${df.price.min():.2f}")
print(f"Max: ${df.price.max():.2f}")

Actual Price Statistics:
Mean: $160.37
Median: $111.00
Min: $1.00
Max: $1999.00

# Histogram of actual prices
fig = px.histogram(
    df,
    x="price",
    nbins=50,
    title="Distribution of Actual Airbnb Prices",
    labels={"price": "Actual Price", "count": "Number of Listings"},
    opacity=0.7,
    marginal="box",  # Add box plot on top
)

fig.update_layout(showlegend=False, height=500)

fig.show()

Categorische features¶

# Room type distribution
room_type_counts = df["room_type"].value_counts()

fig = px.bar(
    x=room_type_counts.index,
    y=room_type_counts.values,
    title="Distribution of Room Types",
    labels={"x": "Room Type", "y": "Count"},
    color=room_type_counts.index,
    text=room_type_counts.values,
)

fig.update_traces(texttemplate="%{text:.0f}", textposition="outside")
fig.update_layout(showlegend=False, height=500)
fig.show()

# Property type distribution
property_type_counts = df["property_type"].value_counts().head(10)

fig = px.bar(
    x=property_type_counts.index,
    y=property_type_counts.values,
    title="Top 10 Property Types",
    labels={"x": "Property Type", "y": "Count"},
    color=property_type_counts.index,
    text=property_type_counts.values,
)

fig.update_traces(texttemplate="%{text:.0f}", textposition="outside")
fig.update_layout(showlegend=False, height=500)
fig.update_xaxes(tickangle=45)
fig.show()

# City distribution
city_counts = df["city"].value_counts()

fig = px.bar(
    x=city_counts.index,
    y=city_counts.values,
    title="Distribution of Listings by City",
    labels={"x": "City", "y": "Number of Listings"},
    color=city_counts.index,
    text=city_counts.values,
)

fig.update_traces(texttemplate="%{text:.0f}", textposition="outside")
fig.update_layout(showlegend=False, height=500)
fig.show()

# Average price by city
city_avg_price = df.groupby("city")["log_price"].mean().sort_values(ascending=False)

fig = px.bar(
    x=city_avg_price.index,
    y=city_avg_price.values,
    title="Average Log Price by City",
    labels={"x": "City", "y": "Average Log Price"},
    color=city_avg_price.values,
    color_continuous_scale="Viridis",
    text=[f"{val:.3f}" for val in city_avg_price.values],
)

fig.update_traces(texttemplate="%{text}", textposition="outside")
fig.update_layout(showlegend=False, height=500)
fig.show()

Geografische analyse¶

def create_price_map(city_name, df_sample):
    """Create an interactive map showing Airbnb prices for a specific city."""
    # Sample data for performance (use fraction based on city size)
    sample_frac = 0.3 if city_name in ["NYC", "LA"] else 0.8

    city_data = df_sample[df_sample["city"] == city_name].sample(frac=sample_frac, random_state=42)

    # Create the map
    fig = px.scatter_map(
        city_data,
        lat="latitude",
        lon="longitude",
        color="log_price",
        color_continuous_scale="Viridis",
        range_color=[df["log_price"].min(), df["log_price"].max()],
        hover_data={
            "log_price": ":.3f",
            "room_type": True,
            "bedrooms": True,
            "neighbourhood": True,
        },
        title=f"Airbnb Prices in {city_name}",
        labels={"log_price": "Log Price", "room_type": "Room Type"},
        zoom=10,
        height=600,
    )

    fig.update_layout(
        mapbox_style="open-street-map",
        coloraxis_colorbar={
            "title": "Log Price",
            "thicknessmode": "pixels",
            "thickness": 30,
            "lenmode": "fraction",
            "len": 0.8,
        },
    )

    return fig


# Create maps for major cities
cities_to_visualize = ["NYC", "LA", "Chicago", "Boston"]

for city in cities_to_visualize:
    if city in df["city"].unique():
        fig = create_price_map(city, df)
        fig.show()

Numerieke features: Correlatie analyse¶

# Select numerical columns for correlation analysis
numerical_cols = [
    "log_price",
    "bedrooms",
    "bathrooms",
    "review_scores_rating",
    "number_of_reviews",
    "beds",
]

# Calculate correlation matrix
corr_matrix = df[numerical_cols].corr()

# Create heatmap
fig = px.imshow(
    corr_matrix,
    text_auto=True,
    aspect="auto",
    title="Correlation Heatmap of Numerical Features",
    color_continuous_scale="RdBu_r",
    zmin=-1,
    zmax=1,
)

fig.update_layout(height=800, width=800)
fig.show()

# Show correlation with target variable
target_corr = corr_matrix["log_price"].drop("log_price").sort_values(ascending=False)

fig = px.bar(
    x=target_corr.index,
    y=target_corr.values,
    title="Feature Correlation with Log Price",
    labels={"x": "Feature", "y": "Correlation Coefficient"},
    color=target_corr.values,
    color_continuous_scale="RdBu_r",
    text=[f"{val:.3f}" for val in target_corr.values],
)

fig.update_traces(texttemplate="%{text}", textposition="outside")
fig.update_layout(showlegend=False, height=500)
fig.show()

Data preparation¶

Missing values¶

# 1. Bathrooms - Fill with median (1.0 is most common)
print(f"Bathrooms missing: {df['bathrooms'].isnull().sum()}")
print(f"Bathrooms distribution:\n{df['bathrooms'].value_counts()}")
df["bathrooms"] = df["bathrooms"].fillna(1.0)
print(f"After filling: {df['bathrooms'].isnull().sum()} missing\n")

# 2. Review scores - Fill with 0 (indicates no reviews)
print(f"Review scores missing: {df['review_scores_rating'].isnull().sum()}")
print(f"Review scores distribution:\n{df['review_scores_rating'].value_counts().head()}")
df["review_scores_rating"] = df["review_scores_rating"].fillna(0)
print(f"After filling: {df['review_scores_rating'].isnull().sum()} missing\n")

# 3. Bedrooms - Fill with median (1.0 is most common)
print(f"Bedrooms missing: {df['bedrooms'].isnull().sum()}")
df["bedrooms"] = df["bedrooms"].fillna(1.0)
print(f"After filling: {df['bedrooms'].isnull().sum()} missing\n")

# 4. Beds - Fill with median (1.0 is most common)
print(f"Beds missing: {df['beds'].isnull().sum()}")
df["beds"] = df["beds"].fillna(1.0)
print(f"After filling: {df['beds'].isnull().sum()} missing\n")

# 5. Host response rate - Fill with mean
print(f"Host response rate missing: {df['host_response_rate'].isnull().sum()}")
if df["host_response_rate"].isnull().sum() > 0:
    # Convert percentage strings to numeric
    df["host_response_rate"] = df["host_response_rate"].apply(
        lambda x: float(str(x).rstrip("%")) / 100 if pd.notnull(x) and isinstance(x, str) else x
    )
    mean_response_rate = df["host_response_rate"].mean()
    df["host_response_rate"] = df["host_response_rate"].fillna(mean_response_rate)
    print(f"After filling: {df['host_response_rate'].isnull().sum()} missing\n")

# 6. Host has profile pic - Fill with mode
print(f"Host has profile pic missing: {df['host_has_profile_pic'].isnull().sum()}")
if df["host_has_profile_pic"].isnull().sum() > 0:
    mode_profile_pic = df["host_has_profile_pic"].mode()[0]
    df["host_has_profile_pic"] = df["host_has_profile_pic"].fillna(mode_profile_pic)
    print(f"After filling: {df['host_has_profile_pic'].isnull().sum()} missing\n")

# 7. Host identity verified - Fill with mode
print(f"Host identity verified missing: {df['host_identity_verified'].isnull().sum()}")
if df["host_identity_verified"].isnull().sum() > 0:
    mode_identity_verified = df["host_identity_verified"].mode()[0]
    df["host_identity_verified"] = df["host_identity_verified"].fillna(mode_identity_verified)
    print(f"After filling: {df['host_identity_verified'].isnull().sum()} missing\n")

print("Final missing values check:")
print(df.isnull().sum().sort_values(ascending=False).head(10))

Bathrooms missing: 200
Bathrooms distribution:
bathrooms
1.0    58099
2.0     7936
1.5     3801
2.5     1567
3.0     1066
3.5      429
4.0      286
0.5      209
0.0      198
4.5      116
5.0       72
8.0       41
5.5       39
6.0       24
6.5       12
7.0       10
7.5        6
Name: count, dtype: int64
After filling: 0 missing

Review scores missing: 16722
Review scores distribution:
review_scores_rating
100.0    16215
98.0      4374
97.0      4087
96.0      4081
95.0      3713
Name: count, dtype: int64
After filling: 0 missing

Bedrooms missing: 91
After filling: 0 missing

Beds missing: 131
After filling: 0 missing

Host response rate missing: 18299
After filling: 0 missing

Host has profile pic missing: 188
After filling: 0 missing

Host identity verified missing: 188
After filling: 0 missing

Final missing values check:
first_review           15864
last_review            15827
thumbnail_url           8216
neighbourhood           6872
zipcode                  966
host_since               188
accommodates               0
bathrooms                  0
cancellation_policy        0
id                         0
dtype: int64

Data types¶

# Convert boolean columns from 't/f' strings to 1/0 integers
boolean_columns = [
    "cleaning_fee",
    "instant_bookable",
    "host_has_profile_pic",
    "host_identity_verified",
]

for col in boolean_columns:
    print(f"Converting {col}:")
    print(f"Before: {df[col].unique()}")
    df[col] = df[col].map({"t": 1, "f": 0, True: 1, False: 0}).astype(int)
    print(f"After: {df[col].unique()}")
    print(f"Data type: {df[col].dtype}\n")

# Convert review scores from 0-100 scale to 0-1 scale
print(
    f"Review scores range before: {df.review_scores_rating.min()} - {df.review_scores_rating.max()}"
)
df["review_scores_rating"] = df["review_scores_rating"] / 100
print(
    f"Review scores range after: {df.review_scores_rating.min()} - {df.review_scores_rating.max()}"
)

# Normalize number of reviews (divide by max to get 0-1 scale)
max_reviews = df.number_of_reviews.max()
print(f"Max reviews: {max_reviews}")
df["number_of_reviews"] = df.number_of_reviews / max_reviews
print(f"Normalized reviews range: {df.number_of_reviews.min()} - {df.number_of_reviews.max()}")

Converting cleaning_fee:
Before: [ True False]
After: [1 0]
Data type: int64

Converting instant_bookable:
Before: ['f' 't']
After: [0 1]
Data type: int64

Converting host_has_profile_pic:
Before: ['t' 'f']
After: [1 0]
Data type: int64

Converting host_identity_verified:
Before: ['t' 'f']
After: [1 0]
Data type: int64

Review scores range before: 0.0 - 100.0
Review scores range after: 0.0 - 1.0
Max reviews: 605
Normalized reviews range: 0.0 - 1.0

Feature engineering¶

Amenities¶

Deze variabele bestaat uit een reeks van voorzieningen en is in die vorm niet bruikbaar. We zetten de meest frequente om naar binaire features (aan-/afwezig)

df.amenities.values[:3]

array(['{"Wireless Internet","Air conditioning",Kitchen,Heating,"Family/kid friendly",Essentials,"Hair dryer",Iron,"translation missing: en.hosting_amenity_50"}',
       '{"Wireless Internet","Air conditioning",Kitchen,Heating,"Family/kid friendly",Washer,Dryer,"Smoke detector","Fire extinguisher",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"translation missing: en.hosting_amenity_50"}',
       '{TV,"Cable TV","Wireless Internet","Air conditioning",Kitchen,Breakfast,"Buzzer/wireless intercom",Heating,"Family/kid friendly","Smoke detector","Carbon monoxide detector","Fire extinguisher",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_50"}'],
      dtype=object)

def extract_amenities(amenities_str):
    """Extract individual amenities from the amenities string."""
    if pd.isna(amenities_str):
        return []
    # Remove quotes and braces, split by comma
    amenities_list = re.sub(r'["{}]', "", amenities_str).split(",")
    # Clean and filter out empty strings and translation missing
    amenities_list = [
        amenity.strip()
        for amenity in amenities_list
        if amenity.strip() and "translation missing" not in amenity
    ]
    return amenities_list


# Extract all unique amenities across the dataset
all_amenities = set()
for amenities_str in df["amenities"]:
    all_amenities.update(extract_amenities(amenities_str))

print(f"Total unique amenities found: {len(all_amenities)}")
print("\nTop 20 most common amenities:")
amenity_counts = {}
for amenities_str in df["amenities"]:
    for amenity in extract_amenities(amenities_str):
        amenity_counts[amenity] = amenity_counts.get(amenity, 0) + 1

# Sort amenities by count and display top 20
top_amenities = sorted(amenity_counts.items(), key=lambda x: x[1], reverse=True)[:20]
for amenity, count in top_amenities:
    print(f"- {amenity}: {count} listings ({count / len(df) * 100:.1f}%)")

Total unique amenities found: 128

Top 20 most common amenities:
- Wireless Internet: 71265 listings (96.2%)
- Kitchen: 67526 listings (91.1%)
- Heating: 67073 listings (90.5%)
- Essentials: 64005 listings (86.4%)
- Smoke detector: 61727 listings (83.3%)
- Air conditioning: 55210 listings (74.5%)
- TV: 52458 listings (70.8%)
- Shampoo: 49465 listings (66.7%)
- Hangers: 49173 listings (66.4%)
- Carbon monoxide detector: 47190 listings (63.7%)
- Internet: 44648 listings (60.2%)
- Laptop friendly workspace: 43703 listings (59.0%)
- Hair dryer: 43330 listings (58.5%)
- Washer: 43169 listings (58.2%)
- Dryer: 42711 listings (57.6%)
- Iron: 41687 listings (56.2%)
- Family/kid friendly: 37026 listings (50.0%)
- Fire extinguisher: 30724 listings (41.5%)
- First aid kit: 27532 listings (37.1%)
- Cable TV: 24253 listings (32.7%)

# Create binary features for top amenities
top_amenities_list = [amenity for amenity, count in top_amenities[:15]]  # Top 15 amenities

for amenity in top_amenities_list:
    feature_name = (
        f"amenity_{amenity.lower().replace(' ', '_').replace('/', '_').replace('-', '_')}"
    )
    df[feature_name] = df["amenities"].apply(lambda x: 1 if amenity in str(x) else 0)

print(f"Added {len(top_amenities_list)} amenity features")
print("New columns:", [col for col in df.columns if col.startswith("amenity_")][:10])

Added 15 amenity features
New columns: ['amenity_wireless_internet', 'amenity_kitchen', 'amenity_heating', 'amenity_essentials', 'amenity_smoke_detector', 'amenity_air_conditioning', 'amenity_tv', 'amenity_shampoo', 'amenity_hangers', 'amenity_carbon_monoxide_detector']

Categorische encodering¶

# Property type: group rare categories
print("Property type distribution:")
property_counts = df["property_type"].value_counts()
print(property_counts)

# Group rare property types
threshold = 300  # Group types with fewer than 300 listings
rare_types = property_counts[property_counts < threshold].index
df["property_type"] = df["property_type"].replace(rare_types, "Other")

Property type distribution:
property_type
Apartment             49003
House                 16511
Condominium            2658
Townhouse              1692
Loft                   1244
Other                   607
Guesthouse              498
Bed & Breakfast         462
Bungalow                366
Villa                   179
Dorm                    142
Guest suite             123
Camper/RV                94
Timeshare                77
Cabin                    72
In-law                   71
Hostel                   70
Boutique hotel           69
Boat                     65
Serviced apartment       21
Tent                     18
Castle                   13
Vacation home            11
Yurt                      9
Hut                       8
Treehouse                 7
Chalet                    6
Earth House               4
Tipi                      3
Cave                      2
Train                     2
Casa particular           1
Parking Space             1
Lighthouse                1
Island                    1
Name: count, dtype: int64

# One-hot encoding
categorical_columns = ["cancellation_policy", "city", "property_type", "room_type"]

encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
encoded_categorical = encoder.fit_transform(df[categorical_columns])

# Get feature names
feature_names = encoder.get_feature_names_out(categorical_columns)

# Create final DataFrame
df_cat = pd.DataFrame(encoded_categorical, columns=feature_names)

# Bed type encoding (simplified)
print("Bed type distribution:")
print(df["bed_type"].value_counts())

# Convert to binary: Real Bed (2) vs Other (1)
df["bed_type"] = (
    df["bed_type"]
    .map({"Real Bed": 2, "Futon": 1, "Pull-out Sofa": 1, "Airbed": 1, "Couch": 1})
    .fillna(1)
)

print("\nConverted bed_type to ordinal scale (1-2):")
print(df["bed_type"].value_counts())

Bed type distribution:
bed_type
Real Bed         72028
Futon              753
Pull-out Sofa      585
Airbed             477
Couch              268
Name: count, dtype: int64

Converted bed_type to ordinal scale (1-2):
bed_type
2    72028
1     2083
Name: count, dtype: int64

df_cat.head()

Neighborbood pricing¶

Nieuwe feature met ordinale buurt-specifieke prijs-niveaus

# Create price per bedroom feature
df["price_per_bedroom"] = df["log_price"] / df["bedrooms"]

# Handle infinite values (division by zero)
df["price_per_bedroom"] = df["price_per_bedroom"].replace([np.inf, -np.inf], np.nan)

# Calculate average price per bedroom by neighborhood
neighborhood_avg = df.groupby("neighbourhood")["price_per_bedroom"].mean()

# Handle any remaining infinite values in neighborhood averages
neighborhood_avg = neighborhood_avg.replace([np.inf, -np.inf], np.nan)
neighborhood_avg = neighborhood_avg.fillna(neighborhood_avg.mean())

print("Top 10 most expensive neighborhoods:")
print(neighborhood_avg.sort_values(ascending=False).head(10))
print("\nTop 10 least expensive neighborhoods:")
print(neighborhood_avg.sort_values(ascending=True).head(10))

Top 10 most expensive neighborhoods:
neighbourhood
Wilmington           7.170120
Harvard Square       5.480639
Coolidge Corner      5.416100
Government Center    5.357209
Lighthouse HIll      5.298317
Watertown            5.293305
Fort Wadsworth       5.101796
Printers Row         5.075174
Bethesda, MD         5.043425
Mission Bay          5.008402
Name: price_per_bedroom, dtype: float64

Top 10 least expensive neighborhoods:
neighbourhood
Mt Rainier/Brentwood, MD    0.945725
Mill Basin                  1.035768
Rossville                   1.100252
West Athens                 1.288258
Castleton Corners           1.398678
Chevy Chase, MD             1.426180
Emerson Hill                1.534508
Observatory Circle          1.678846
Gateway                     1.684834
Galewood                    1.705165
Name: price_per_bedroom, dtype: float64

# Create neighborhood price level categories
def categorize_neighborhood(price_per_bedroom):
    """Categorize neighborhoods into price levels."""
    if pd.isna(price_per_bedroom):
        return 2  # Default to middle category

    percentiles = neighborhood_avg.quantile([0.25, 0.5, 0.75])

    if price_per_bedroom <= percentiles[0.25]:
        return 1  # Low price area
    if price_per_bedroom <= percentiles[0.75]:
        return 2  # Medium price area
    return 3  # High price area


# Apply categorization
df["neighborhood_price_level"] = df["price_per_bedroom"].map(lambda x: categorize_neighborhood(x))

print("Neighborhood price level distribution:")
print(df["neighborhood_price_level"].value_counts())

# Clean up - remove temporary columns
df = df.drop(["price_per_bedroom", "neighbourhood"], axis=1)

Neighborhood price level distribution:
neighborhood_price_level
3    39928
1    18978
2    15205
Name: count, dtype: int64

Finale data¶

cols = [
    "log_price",
    "accommodates",
    "bathrooms",
    "bed_type",
    "city",  # keep city in for now, to be used for stratified sampling (see below)
    "cleaning_fee",
    "host_has_profile_pic",
    "host_identity_verified",
    "host_response_rate",
    "instant_bookable",
    "number_of_reviews",
    "review_scores_rating",
    "bedrooms",
    "beds",
    "amenity_wireless_internet",
    "amenity_kitchen",
    "amenity_heating",
    "amenity_essentials",
    "amenity_smoke_detector",
    "amenity_air_conditioning",
    "amenity_tv",
    "amenity_shampoo",
    "amenity_hangers",
    "amenity_carbon_monoxide_detector",
    "amenity_internet",
    "amenity_laptop_friendly_workspace",
    "amenity_washer",
    "amenity_hair_dryer",
    "amenity_dryer",
    "neighborhood_price_level",
]

df = pd.concat([df[cols], df_cat], axis=1)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74111 entries, 0 to 74110
Data columns (total 53 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   log_price                            74111 non-null  float64
 1   accommodates                         74111 non-null  int64  
 2   bathrooms                            74111 non-null  float64
 3   bed_type                             74111 non-null  int64  
 4   city                                 74111 non-null  object 
 5   cleaning_fee                         74111 non-null  int64  
 6   host_has_profile_pic                 74111 non-null  int64  
 7   host_identity_verified               74111 non-null  int64  
 8   host_response_rate                   74111 non-null  float64
 9   instant_bookable                     74111 non-null  int64  
 10  number_of_reviews                    74111 non-null  float64
 11  review_scores_rating                 74111 non-null  float64
 12  bedrooms                             74111 non-null  float64
 13  beds                                 74111 non-null  float64
 14  amenity_wireless_internet            74111 non-null  int64  
 15  amenity_kitchen                      74111 non-null  int64  
 16  amenity_heating                      74111 non-null  int64  
 17  amenity_essentials                   74111 non-null  int64  
 18  amenity_smoke_detector               74111 non-null  int64  
 19  amenity_air_conditioning             74111 non-null  int64  
 20  amenity_tv                           74111 non-null  int64  
 21  amenity_shampoo                      74111 non-null  int64  
 22  amenity_hangers                      74111 non-null  int64  
 23  amenity_carbon_monoxide_detector     74111 non-null  int64  
 24  amenity_internet                     74111 non-null  int64  
 25  amenity_laptop_friendly_workspace    74111 non-null  int64  
 26  amenity_washer                       74111 non-null  int64  
 27  amenity_hair_dryer                   74111 non-null  int64  
 28  amenity_dryer                        74111 non-null  int64  
 29  neighborhood_price_level             74111 non-null  int64  
 30  cancellation_policy_flexible         74111 non-null  float64
 31  cancellation_policy_moderate         74111 non-null  float64
 32  cancellation_policy_strict           74111 non-null  float64
 33  cancellation_policy_super_strict_30  74111 non-null  float64
 34  cancellation_policy_super_strict_60  74111 non-null  float64
 35  city_Boston                          74111 non-null  float64
 36  city_Chicago                         74111 non-null  float64
 37  city_DC                              74111 non-null  float64
 38  city_LA                              74111 non-null  float64
 39  city_NYC                             74111 non-null  float64
 40  city_SF                              74111 non-null  float64
 41  property_type_Apartment              74111 non-null  float64
 42  property_type_Bed & Breakfast        74111 non-null  float64
 43  property_type_Bungalow               74111 non-null  float64
 44  property_type_Condominium            74111 non-null  float64
 45  property_type_Guesthouse             74111 non-null  float64
 46  property_type_House                  74111 non-null  float64
 47  property_type_Loft                   74111 non-null  float64
 48  property_type_Other                  74111 non-null  float64
 49  property_type_Townhouse              74111 non-null  float64
 50  room_type_Entire home/apt            74111 non-null  float64
 51  room_type_Private room               74111 non-null  float64
 52  room_type_Shared room                74111 non-null  float64
dtypes: float64(30), int64(22), object(1)
memory usage: 30.0+ MB

# Create train+validation/test splits
train_df, test_df = train_test_split(
    df,
    test_size=0.2,
    random_state=42,
    stratify=df["city"],  # Stratify by city for balanced representation
)

print(f"Training set shape: {train_df.shape}")
print(f"Test set shape: {test_df.shape}")
print(f"\nSplit verification: {train_df.shape[0] + test_df.shape[0]} total samples")

Training set shape: (59288, 53)
Test set shape: (14823, 53)

Split verification: 74111 total samples

## Drop stratification column from features
train_df = train_df.drop("city", axis=1)
X_train = train_df.loc[:, train_df.columns != "log_price"]
y_train = train_df["log_price"]
print(f"Training features shape: {X_train.shape}")
print(f"Training target shape: {y_train.shape}")

test_df = test_df.drop("city", axis=1)
X_test = test_df.loc[:, test_df.columns != "log_price"]
y_test = test_df["log_price"]
print(f"Test features shape: {X_test.shape}")
print(f"Test target shape: {y_test.shape}")

Training features shape: (59288, 51)
Training target shape: (59288,)
Test features shape: (14823, 51)
Test target shape: (14823,)

Model definition¶

Lineaire regressie¶

# Linear Regression with cross-validation
linear_model = LinearRegression()

# 5-fold cross-validation
cv_scores = cross_val_score(linear_model, X_train, y_train, cv=5, scoring="r2")

print(f"Linear Regression Cross-Validation R² Scores: {cv_scores}")
print(f"Mean R²: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

Linear Regression Cross-Validation R² Scores: [0.59495647 0.60093226 0.59434806 0.5850233  0.60406696]
Mean R²: 0.5959 (+/- 0.0131)

XGBoost Regression¶

# XGBoost with cross-validation
xgb_model = xgb.XGBRegressor(random_state=42, n_estimators=100, learning_rate=0.1, max_depth=6)

cv_scores_xgb = cross_val_score(xgb_model, X_train, y_train, cv=5, scoring="r2")

print(f"XGBoost Cross-Validation R² Scores: {cv_scores_xgb}")
print(f"Mean R²: {cv_scores_xgb.mean():.4f} (+/- {cv_scores_xgb.std() * 2:.4f})")

# Compare all models
print("\nFinal Model Comparison:")
print(f"Linear Regression: {cv_scores.mean():.4f}")
print(f"XGBoost:          {cv_scores_xgb.mean():.4f}")

# Train final model on full training data
xgb_model.fit(X_train, y_train)

XGBoost Cross-Validation R² Scores: [0.71489725 0.7151093  0.71387543 0.70976343 0.71964383]
Mean R²: 0.7147 (+/- 0.0063)

Final Model Comparison:
Linear Regression: 0.5959
XGBoost:          0.7147

Scoring¶

# Make predictions on validation set
test_predictions = xgb_model.predict(X_test)

# Calculate metrics
mae = mean_absolute_error(y_test, test_predictions)
mse = mean_squared_error(y_test, test_predictions)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, test_predictions)

print("Validation Set Performance:")
print(f"MAE:  {mae:.4f}")
print(f"MSE:  {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R²:   {r2:.4f}")

# Convert back to actual prices for interpretation
y_val_actual = np.exp(y_test)
val_predictions_actual = np.exp(test_predictions)
mae_actual = mean_absolute_error(y_val_actual, val_predictions_actual)

print("\nIn actual dollars:")
print(f"Mean actual price: ${y_val_actual.mean():.2f}")
print(f"MAE in dollars: ${mae_actual:.2f}")
print(f"This means our model is off by about ${mae_actual:.2f} on average")

Validation Set Performance:
MAE:  0.2808
MSE:  0.1494
RMSE: 0.3866
R²:   0.7144

In actual dollars:
Mean actual price: $160.56
MAE in dollars: $53.48
This means our model is off by about $53.48 on average

# Feature importance from XGBoost
feature_importance = pd.DataFrame(
    {"feature": X_train.columns, "importance": xgb_model.feature_importances_}
).sort_values("importance", ascending=False)

# Plot top 15 most important features
top_features = feature_importance.head(15)

fig = px.bar(
    x=top_features["importance"],
    y=top_features["feature"],
    orientation="h",
    title="Top 15 Most Important Features (XGBoost)",
    labels={"x": "Feature Importance", "y": "Feature"},
    text=[f"{val:.4f}" for val in top_features["importance"]],
)

fig.update_traces(texttemplate="%{text}", textposition="outside")
fig.update_layout(height=600, yaxis={"categoryorder": "total ascending"})
fig.show()

print("\nBusiness Insights from Feature Importance:")
for i, row in top_features.head(10).iterrows():
    print(f"{i + 1}. {row['feature']}: {row['importance']:.4f}")


Business Insights from Feature Importance:
49. room_type_Entire home/apt: 0.5896
28. neighborhood_price_level: 0.1311
2. bathrooms: 0.0718
11. bedrooms: 0.0272
37. city_LA: 0.0188
39. city_SF: 0.0147
1. accommodates: 0.0107
35. city_Chicago: 0.0097
36. city_DC: 0.0092
19. amenity_tv: 0.0079

# Create prediction vs actual plot
fig = px.scatter(
    x=y_test,
    y=test_predictions,
    title="Predicted vs Actual Log Prices (Test Set)",
    labels={"x": "Actual Log Price", "y": "Predicted Log Price"},
    trendline="ols",
    opacity=0.6,
)

# Add perfect prediction line
fig.add_trace(
    go.Scatter(
        x=[y_test.min(), y_test.max()],
        y=[y_test.min(), y_test.max()],
        mode="lines",
        name="Perfect Prediction",
        line={"color": "red", "dash": "dash"},
    )
)

fig.update_layout(height=500, width=800)
fig.show()

# Residual plot
residuals = y_test - test_predictions

fig = px.scatter(
    x=test_predictions,
    y=residuals,
    title="Residual Plot (Test Set)",
    labels={"x": "Predicted Log Price", "y": "Residual (Actual - Predicted)"},
    opacity=0.6,
)

# Add zero line
fig.add_hline(y=0, line_dash="dash", line_color="red")
fig.update_layout(height=500, width=800)
fig.show()