Gebaseerd op: Airbnb : price prediction using XGBoost 🔥
Probleem definitie¶
Airbnb gebruikt machine learning om optimale prijsvoorstellen te doen aan hosts. Een goed gekalibreerd prijsmodel helpt hosts hun omzet te maximaliseren en tegelijkertijd concurrerend te blijven.
Taak, Ervaring¶
De bedoeling is om numerieke waarden te voorspellen (prijs). Dit gaat over een regressietaak. We willen trainen aan de hand van effectieve prijzen. We hebben dus te maken met gesuperviseerd leren.
Data collection¶
We gebruiken een dataset van Airbnb die beschikbaar is op Kaggle met prijzen uit verschillende grote Amerikaanse steden. We richten ons op volgende variabelen:
Numeriek: bedrooms, bathrooms, review scores, etc.
Categorisch: property type, room type, city
Text: amenities
Geografisch: latitude, longitude, neighborhood
Target variable:
log_price
import os
import re
import kagglehub
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import xgboost as xgb
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import OneHotEncoder
# Load data
path = kagglehub.dataset_download("stevezhenghp/airbnb-price-prediction")Downloading from https://www.kaggle.com/api/v1/datasets/download/stevezhenghp/airbnb-price-prediction?dataset_version_number=1...
100%|██████████| 31.3M/31.3M [00:02<00:00, 13.1MB/s]Extracting files...
df = pd.read_csv(os.path.join(path, "train.csv"))df.head()
Data exploration¶
# Basic information about our dataset
print("Dataset Information:")
print("=" * 50)
df.info()
print("\nTarget Variable Statistics:")
print("=" * 50)
print(df["log_price"].describe())
# Check for missing values
print("\nMissing Values:")
print("=" * 50)
missing_info = df.isnull().sum()
missing_info = missing_info[missing_info > 0].sort_values(ascending=False)
print(missing_info)Dataset Information:
==================================================
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74111 entries, 0 to 74110
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 74111 non-null int64
1 log_price 74111 non-null float64
2 property_type 74111 non-null object
3 room_type 74111 non-null object
4 amenities 74111 non-null object
5 accommodates 74111 non-null int64
6 bathrooms 73911 non-null float64
7 bed_type 74111 non-null object
8 cancellation_policy 74111 non-null object
9 cleaning_fee 74111 non-null bool
10 city 74111 non-null object
11 description 74111 non-null object
12 first_review 58247 non-null object
13 host_has_profile_pic 73923 non-null object
14 host_identity_verified 73923 non-null object
15 host_response_rate 55812 non-null object
16 host_since 73923 non-null object
17 instant_bookable 74111 non-null object
18 last_review 58284 non-null object
19 latitude 74111 non-null float64
20 longitude 74111 non-null float64
21 name 74111 non-null object
22 neighbourhood 67239 non-null object
23 number_of_reviews 74111 non-null int64
24 review_scores_rating 57389 non-null float64
25 thumbnail_url 65895 non-null object
26 zipcode 73145 non-null object
27 bedrooms 74020 non-null float64
28 beds 73980 non-null float64
dtypes: bool(1), float64(7), int64(3), object(18)
memory usage: 15.9+ MB
Target Variable Statistics:
==================================================
count 74111.000000
mean 4.782069
std 0.717394
min 0.000000
25% 4.317488
50% 4.709530
75% 5.220356
max 7.600402
Name: log_price, dtype: float64
Missing Values:
==================================================
host_response_rate 18299
review_scores_rating 16722
first_review 15864
last_review 15827
thumbnail_url 8216
neighbourhood 6872
zipcode 966
bathrooms 200
host_has_profile_pic 188
host_identity_verified 188
host_since 188
beds 131
bedrooms 91
dtype: int64
Verdeling van de target variabele log_price¶
De log-transformatie zorgt voor een meer symmetrische verdeling. Bij lagere prijzen zijn verschillen van een bepaalde grootte belangrijker dan in de hogere regionen. De log-transformatie vertaalt dit naar de numerieke schaal van de target variabele.
# Histogram of log prices
fig = px.histogram(
df,
x="log_price",
nbins=50,
title="Distribution of Log-Transformed Airbnb Prices",
labels={"log_price": "Log Price", "count": "Number of Listings"},
opacity=0.7,
marginal="box", # Add box plot on top
)
fig.update_layout(showlegend=False, height=500)
fig.show()
# Calculate actual price statistics (inverse log transformation)
df["price"] = np.exp(df["log_price"])
print("Actual Price Statistics:")
print(f"Mean: ${df.price.mean():.2f}")
print(f"Median: ${df.price.median():.2f}")
print(f"Min: ${df.price.min():.2f}")
print(f"Max: ${df.price.max():.2f}")Actual Price Statistics:
Mean: $160.37
Median: $111.00
Min: $1.00
Max: $1999.00
# Histogram of actual prices
fig = px.histogram(
df,
x="price",
nbins=50,
title="Distribution of Actual Airbnb Prices",
labels={"price": "Actual Price", "count": "Number of Listings"},
opacity=0.7,
marginal="box", # Add box plot on top
)
fig.update_layout(showlegend=False, height=500)
fig.show()Categorische features¶
# Room type distribution
room_type_counts = df["room_type"].value_counts()
fig = px.bar(
x=room_type_counts.index,
y=room_type_counts.values,
title="Distribution of Room Types",
labels={"x": "Room Type", "y": "Count"},
color=room_type_counts.index,
text=room_type_counts.values,
)
fig.update_traces(texttemplate="%{text:.0f}", textposition="outside")
fig.update_layout(showlegend=False, height=500)
fig.show()
# Property type distribution
property_type_counts = df["property_type"].value_counts().head(10)
fig = px.bar(
x=property_type_counts.index,
y=property_type_counts.values,
title="Top 10 Property Types",
labels={"x": "Property Type", "y": "Count"},
color=property_type_counts.index,
text=property_type_counts.values,
)
fig.update_traces(texttemplate="%{text:.0f}", textposition="outside")
fig.update_layout(showlegend=False, height=500)
fig.update_xaxes(tickangle=45)
fig.show()# City distribution
city_counts = df["city"].value_counts()
fig = px.bar(
x=city_counts.index,
y=city_counts.values,
title="Distribution of Listings by City",
labels={"x": "City", "y": "Number of Listings"},
color=city_counts.index,
text=city_counts.values,
)
fig.update_traces(texttemplate="%{text:.0f}", textposition="outside")
fig.update_layout(showlegend=False, height=500)
fig.show()
# Average price by city
city_avg_price = df.groupby("city")["log_price"].mean().sort_values(ascending=False)
fig = px.bar(
x=city_avg_price.index,
y=city_avg_price.values,
title="Average Log Price by City",
labels={"x": "City", "y": "Average Log Price"},
color=city_avg_price.values,
color_continuous_scale="Viridis",
text=[f"{val:.3f}" for val in city_avg_price.values],
)
fig.update_traces(texttemplate="%{text}", textposition="outside")
fig.update_layout(showlegend=False, height=500)
fig.show()Geografische analyse¶
def create_price_map(city_name, df_sample):
"""Create an interactive map showing Airbnb prices for a specific city."""
# Sample data for performance (use fraction based on city size)
sample_frac = 0.3 if city_name in ["NYC", "LA"] else 0.8
city_data = df_sample[df_sample["city"] == city_name].sample(frac=sample_frac, random_state=42)
# Create the map
fig = px.scatter_map(
city_data,
lat="latitude",
lon="longitude",
color="log_price",
color_continuous_scale="Viridis",
range_color=[df["log_price"].min(), df["log_price"].max()],
hover_data={
"log_price": ":.3f",
"room_type": True,
"bedrooms": True,
"neighbourhood": True,
},
title=f"Airbnb Prices in {city_name}",
labels={"log_price": "Log Price", "room_type": "Room Type"},
zoom=10,
height=600,
)
fig.update_layout(
mapbox_style="open-street-map",
coloraxis_colorbar={
"title": "Log Price",
"thicknessmode": "pixels",
"thickness": 30,
"lenmode": "fraction",
"len": 0.8,
},
)
return fig
# Create maps for major cities
cities_to_visualize = ["NYC", "LA", "Chicago", "Boston"]
for city in cities_to_visualize:
if city in df["city"].unique():
fig = create_price_map(city, df)
fig.show()Numerieke features: Correlatie analyse¶
# Select numerical columns for correlation analysis
numerical_cols = [
"log_price",
"bedrooms",
"bathrooms",
"review_scores_rating",
"number_of_reviews",
"beds",
]
# Calculate correlation matrix
corr_matrix = df[numerical_cols].corr()
# Create heatmap
fig = px.imshow(
corr_matrix,
text_auto=True,
aspect="auto",
title="Correlation Heatmap of Numerical Features",
color_continuous_scale="RdBu_r",
zmin=-1,
zmax=1,
)
fig.update_layout(height=800, width=800)
fig.show()
# Show correlation with target variable
target_corr = corr_matrix["log_price"].drop("log_price").sort_values(ascending=False)
fig = px.bar(
x=target_corr.index,
y=target_corr.values,
title="Feature Correlation with Log Price",
labels={"x": "Feature", "y": "Correlation Coefficient"},
color=target_corr.values,
color_continuous_scale="RdBu_r",
text=[f"{val:.3f}" for val in target_corr.values],
)
fig.update_traces(texttemplate="%{text}", textposition="outside")
fig.update_layout(showlegend=False, height=500)
fig.show()# 1. Bathrooms - Fill with median (1.0 is most common)
print(f"Bathrooms missing: {df['bathrooms'].isnull().sum()}")
print(f"Bathrooms distribution:\n{df['bathrooms'].value_counts()}")
df["bathrooms"] = df["bathrooms"].fillna(1.0)
print(f"After filling: {df['bathrooms'].isnull().sum()} missing\n")
# 2. Review scores - Fill with 0 (indicates no reviews)
print(f"Review scores missing: {df['review_scores_rating'].isnull().sum()}")
print(f"Review scores distribution:\n{df['review_scores_rating'].value_counts().head()}")
df["review_scores_rating"] = df["review_scores_rating"].fillna(0)
print(f"After filling: {df['review_scores_rating'].isnull().sum()} missing\n")
# 3. Bedrooms - Fill with median (1.0 is most common)
print(f"Bedrooms missing: {df['bedrooms'].isnull().sum()}")
df["bedrooms"] = df["bedrooms"].fillna(1.0)
print(f"After filling: {df['bedrooms'].isnull().sum()} missing\n")
# 4. Beds - Fill with median (1.0 is most common)
print(f"Beds missing: {df['beds'].isnull().sum()}")
df["beds"] = df["beds"].fillna(1.0)
print(f"After filling: {df['beds'].isnull().sum()} missing\n")
# 5. Host response rate - Fill with mean
print(f"Host response rate missing: {df['host_response_rate'].isnull().sum()}")
if df["host_response_rate"].isnull().sum() > 0:
# Convert percentage strings to numeric
df["host_response_rate"] = df["host_response_rate"].apply(
lambda x: float(str(x).rstrip("%")) / 100 if pd.notnull(x) and isinstance(x, str) else x
)
mean_response_rate = df["host_response_rate"].mean()
df["host_response_rate"] = df["host_response_rate"].fillna(mean_response_rate)
print(f"After filling: {df['host_response_rate'].isnull().sum()} missing\n")
# 6. Host has profile pic - Fill with mode
print(f"Host has profile pic missing: {df['host_has_profile_pic'].isnull().sum()}")
if df["host_has_profile_pic"].isnull().sum() > 0:
mode_profile_pic = df["host_has_profile_pic"].mode()[0]
df["host_has_profile_pic"] = df["host_has_profile_pic"].fillna(mode_profile_pic)
print(f"After filling: {df['host_has_profile_pic'].isnull().sum()} missing\n")
# 7. Host identity verified - Fill with mode
print(f"Host identity verified missing: {df['host_identity_verified'].isnull().sum()}")
if df["host_identity_verified"].isnull().sum() > 0:
mode_identity_verified = df["host_identity_verified"].mode()[0]
df["host_identity_verified"] = df["host_identity_verified"].fillna(mode_identity_verified)
print(f"After filling: {df['host_identity_verified'].isnull().sum()} missing\n")
print("Final missing values check:")
print(df.isnull().sum().sort_values(ascending=False).head(10))Bathrooms missing: 200
Bathrooms distribution:
bathrooms
1.0 58099
2.0 7936
1.5 3801
2.5 1567
3.0 1066
3.5 429
4.0 286
0.5 209
0.0 198
4.5 116
5.0 72
8.0 41
5.5 39
6.0 24
6.5 12
7.0 10
7.5 6
Name: count, dtype: int64
After filling: 0 missing
Review scores missing: 16722
Review scores distribution:
review_scores_rating
100.0 16215
98.0 4374
97.0 4087
96.0 4081
95.0 3713
Name: count, dtype: int64
After filling: 0 missing
Bedrooms missing: 91
After filling: 0 missing
Beds missing: 131
After filling: 0 missing
Host response rate missing: 18299
After filling: 0 missing
Host has profile pic missing: 188
After filling: 0 missing
Host identity verified missing: 188
After filling: 0 missing
Final missing values check:
first_review 15864
last_review 15827
thumbnail_url 8216
neighbourhood 6872
zipcode 966
host_since 188
accommodates 0
bathrooms 0
cancellation_policy 0
id 0
dtype: int64
Data types¶
# Convert boolean columns from 't/f' strings to 1/0 integers
boolean_columns = [
"cleaning_fee",
"instant_bookable",
"host_has_profile_pic",
"host_identity_verified",
]
for col in boolean_columns:
print(f"Converting {col}:")
print(f"Before: {df[col].unique()}")
df[col] = df[col].map({"t": 1, "f": 0, True: 1, False: 0}).astype(int)
print(f"After: {df[col].unique()}")
print(f"Data type: {df[col].dtype}\n")
# Convert review scores from 0-100 scale to 0-1 scale
print(
f"Review scores range before: {df.review_scores_rating.min()} - {df.review_scores_rating.max()}"
)
df["review_scores_rating"] = df["review_scores_rating"] / 100
print(
f"Review scores range after: {df.review_scores_rating.min()} - {df.review_scores_rating.max()}"
)
# Normalize number of reviews (divide by max to get 0-1 scale)
max_reviews = df.number_of_reviews.max()
print(f"Max reviews: {max_reviews}")
df["number_of_reviews"] = df.number_of_reviews / max_reviews
print(f"Normalized reviews range: {df.number_of_reviews.min()} - {df.number_of_reviews.max()}")Converting cleaning_fee:
Before: [ True False]
After: [1 0]
Data type: int64
Converting instant_bookable:
Before: ['f' 't']
After: [0 1]
Data type: int64
Converting host_has_profile_pic:
Before: ['t' 'f']
After: [1 0]
Data type: int64
Converting host_identity_verified:
Before: ['t' 'f']
After: [1 0]
Data type: int64
Review scores range before: 0.0 - 100.0
Review scores range after: 0.0 - 1.0
Max reviews: 605
Normalized reviews range: 0.0 - 1.0
Feature engineering¶
Amenities¶
Deze variabele bestaat uit een reeks van voorzieningen en is in die vorm niet bruikbaar. We zetten de meest frequente om naar binaire features (aan-/afwezig)
df.amenities.values[:3]array(['{"Wireless Internet","Air conditioning",Kitchen,Heating,"Family/kid friendly",Essentials,"Hair dryer",Iron,"translation missing: en.hosting_amenity_50"}',
'{"Wireless Internet","Air conditioning",Kitchen,Heating,"Family/kid friendly",Washer,Dryer,"Smoke detector","Fire extinguisher",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"translation missing: en.hosting_amenity_50"}',
'{TV,"Cable TV","Wireless Internet","Air conditioning",Kitchen,Breakfast,"Buzzer/wireless intercom",Heating,"Family/kid friendly","Smoke detector","Carbon monoxide detector","Fire extinguisher",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_50"}'],
dtype=object)def extract_amenities(amenities_str):
"""Extract individual amenities from the amenities string."""
if pd.isna(amenities_str):
return []
# Remove quotes and braces, split by comma
amenities_list = re.sub(r'["{}]', "", amenities_str).split(",")
# Clean and filter out empty strings and translation missing
amenities_list = [
amenity.strip()
for amenity in amenities_list
if amenity.strip() and "translation missing" not in amenity
]
return amenities_list
# Extract all unique amenities across the dataset
all_amenities = set()
for amenities_str in df["amenities"]:
all_amenities.update(extract_amenities(amenities_str))
print(f"Total unique amenities found: {len(all_amenities)}")
print("\nTop 20 most common amenities:")
amenity_counts = {}
for amenities_str in df["amenities"]:
for amenity in extract_amenities(amenities_str):
amenity_counts[amenity] = amenity_counts.get(amenity, 0) + 1
# Sort amenities by count and display top 20
top_amenities = sorted(amenity_counts.items(), key=lambda x: x[1], reverse=True)[:20]
for amenity, count in top_amenities:
print(f"- {amenity}: {count} listings ({count / len(df) * 100:.1f}%)")Total unique amenities found: 128
Top 20 most common amenities:
- Wireless Internet: 71265 listings (96.2%)
- Kitchen: 67526 listings (91.1%)
- Heating: 67073 listings (90.5%)
- Essentials: 64005 listings (86.4%)
- Smoke detector: 61727 listings (83.3%)
- Air conditioning: 55210 listings (74.5%)
- TV: 52458 listings (70.8%)
- Shampoo: 49465 listings (66.7%)
- Hangers: 49173 listings (66.4%)
- Carbon monoxide detector: 47190 listings (63.7%)
- Internet: 44648 listings (60.2%)
- Laptop friendly workspace: 43703 listings (59.0%)
- Hair dryer: 43330 listings (58.5%)
- Washer: 43169 listings (58.2%)
- Dryer: 42711 listings (57.6%)
- Iron: 41687 listings (56.2%)
- Family/kid friendly: 37026 listings (50.0%)
- Fire extinguisher: 30724 listings (41.5%)
- First aid kit: 27532 listings (37.1%)
- Cable TV: 24253 listings (32.7%)
# Create binary features for top amenities
top_amenities_list = [amenity for amenity, count in top_amenities[:15]] # Top 15 amenities
for amenity in top_amenities_list:
feature_name = (
f"amenity_{amenity.lower().replace(' ', '_').replace('/', '_').replace('-', '_')}"
)
df[feature_name] = df["amenities"].apply(lambda x: 1 if amenity in str(x) else 0)
print(f"Added {len(top_amenities_list)} amenity features")
print("New columns:", [col for col in df.columns if col.startswith("amenity_")][:10])Added 15 amenity features
New columns: ['amenity_wireless_internet', 'amenity_kitchen', 'amenity_heating', 'amenity_essentials', 'amenity_smoke_detector', 'amenity_air_conditioning', 'amenity_tv', 'amenity_shampoo', 'amenity_hangers', 'amenity_carbon_monoxide_detector']
Categorische encodering¶
# Property type: group rare categories
print("Property type distribution:")
property_counts = df["property_type"].value_counts()
print(property_counts)
# Group rare property types
threshold = 300 # Group types with fewer than 300 listings
rare_types = property_counts[property_counts < threshold].index
df["property_type"] = df["property_type"].replace(rare_types, "Other")Property type distribution:
property_type
Apartment 49003
House 16511
Condominium 2658
Townhouse 1692
Loft 1244
Other 607
Guesthouse 498
Bed & Breakfast 462
Bungalow 366
Villa 179
Dorm 142
Guest suite 123
Camper/RV 94
Timeshare 77
Cabin 72
In-law 71
Hostel 70
Boutique hotel 69
Boat 65
Serviced apartment 21
Tent 18
Castle 13
Vacation home 11
Yurt 9
Hut 8
Treehouse 7
Chalet 6
Earth House 4
Tipi 3
Cave 2
Train 2
Casa particular 1
Parking Space 1
Lighthouse 1
Island 1
Name: count, dtype: int64
# One-hot encoding
categorical_columns = ["cancellation_policy", "city", "property_type", "room_type"]
encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
encoded_categorical = encoder.fit_transform(df[categorical_columns])
# Get feature names
feature_names = encoder.get_feature_names_out(categorical_columns)
# Create final DataFrame
df_cat = pd.DataFrame(encoded_categorical, columns=feature_names)# Bed type encoding (simplified)
print("Bed type distribution:")
print(df["bed_type"].value_counts())
# Convert to binary: Real Bed (2) vs Other (1)
df["bed_type"] = (
df["bed_type"]
.map({"Real Bed": 2, "Futon": 1, "Pull-out Sofa": 1, "Airbed": 1, "Couch": 1})
.fillna(1)
)
print("\nConverted bed_type to ordinal scale (1-2):")
print(df["bed_type"].value_counts())Bed type distribution:
bed_type
Real Bed 72028
Futon 753
Pull-out Sofa 585
Airbed 477
Couch 268
Name: count, dtype: int64
Converted bed_type to ordinal scale (1-2):
bed_type
2 72028
1 2083
Name: count, dtype: int64
df_cat.head()Neighborbood pricing¶
Nieuwe feature met ordinale buurt-specifieke prijs-niveaus
# Create price per bedroom feature
df["price_per_bedroom"] = df["log_price"] / df["bedrooms"]
# Handle infinite values (division by zero)
df["price_per_bedroom"] = df["price_per_bedroom"].replace([np.inf, -np.inf], np.nan)
# Calculate average price per bedroom by neighborhood
neighborhood_avg = df.groupby("neighbourhood")["price_per_bedroom"].mean()
# Handle any remaining infinite values in neighborhood averages
neighborhood_avg = neighborhood_avg.replace([np.inf, -np.inf], np.nan)
neighborhood_avg = neighborhood_avg.fillna(neighborhood_avg.mean())
print("Top 10 most expensive neighborhoods:")
print(neighborhood_avg.sort_values(ascending=False).head(10))
print("\nTop 10 least expensive neighborhoods:")
print(neighborhood_avg.sort_values(ascending=True).head(10))Top 10 most expensive neighborhoods:
neighbourhood
Wilmington 7.170120
Harvard Square 5.480639
Coolidge Corner 5.416100
Government Center 5.357209
Lighthouse HIll 5.298317
Watertown 5.293305
Fort Wadsworth 5.101796
Printers Row 5.075174
Bethesda, MD 5.043425
Mission Bay 5.008402
Name: price_per_bedroom, dtype: float64
Top 10 least expensive neighborhoods:
neighbourhood
Mt Rainier/Brentwood, MD 0.945725
Mill Basin 1.035768
Rossville 1.100252
West Athens 1.288258
Castleton Corners 1.398678
Chevy Chase, MD 1.426180
Emerson Hill 1.534508
Observatory Circle 1.678846
Gateway 1.684834
Galewood 1.705165
Name: price_per_bedroom, dtype: float64
# Create neighborhood price level categories
def categorize_neighborhood(price_per_bedroom):
"""Categorize neighborhoods into price levels."""
if pd.isna(price_per_bedroom):
return 2 # Default to middle category
percentiles = neighborhood_avg.quantile([0.25, 0.5, 0.75])
if price_per_bedroom <= percentiles[0.25]:
return 1 # Low price area
if price_per_bedroom <= percentiles[0.75]:
return 2 # Medium price area
return 3 # High price area
# Apply categorization
df["neighborhood_price_level"] = df["price_per_bedroom"].map(lambda x: categorize_neighborhood(x))
print("Neighborhood price level distribution:")
print(df["neighborhood_price_level"].value_counts())
# Clean up - remove temporary columns
df = df.drop(["price_per_bedroom", "neighbourhood"], axis=1)Neighborhood price level distribution:
neighborhood_price_level
3 39928
1 18978
2 15205
Name: count, dtype: int64
Finale data¶
cols = [
"log_price",
"accommodates",
"bathrooms",
"bed_type",
"city", # keep city in for now, to be used for stratified sampling (see below)
"cleaning_fee",
"host_has_profile_pic",
"host_identity_verified",
"host_response_rate",
"instant_bookable",
"number_of_reviews",
"review_scores_rating",
"bedrooms",
"beds",
"amenity_wireless_internet",
"amenity_kitchen",
"amenity_heating",
"amenity_essentials",
"amenity_smoke_detector",
"amenity_air_conditioning",
"amenity_tv",
"amenity_shampoo",
"amenity_hangers",
"amenity_carbon_monoxide_detector",
"amenity_internet",
"amenity_laptop_friendly_workspace",
"amenity_washer",
"amenity_hair_dryer",
"amenity_dryer",
"neighborhood_price_level",
]
df = pd.concat([df[cols], df_cat], axis=1)
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74111 entries, 0 to 74110
Data columns (total 53 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 log_price 74111 non-null float64
1 accommodates 74111 non-null int64
2 bathrooms 74111 non-null float64
3 bed_type 74111 non-null int64
4 city 74111 non-null object
5 cleaning_fee 74111 non-null int64
6 host_has_profile_pic 74111 non-null int64
7 host_identity_verified 74111 non-null int64
8 host_response_rate 74111 non-null float64
9 instant_bookable 74111 non-null int64
10 number_of_reviews 74111 non-null float64
11 review_scores_rating 74111 non-null float64
12 bedrooms 74111 non-null float64
13 beds 74111 non-null float64
14 amenity_wireless_internet 74111 non-null int64
15 amenity_kitchen 74111 non-null int64
16 amenity_heating 74111 non-null int64
17 amenity_essentials 74111 non-null int64
18 amenity_smoke_detector 74111 non-null int64
19 amenity_air_conditioning 74111 non-null int64
20 amenity_tv 74111 non-null int64
21 amenity_shampoo 74111 non-null int64
22 amenity_hangers 74111 non-null int64
23 amenity_carbon_monoxide_detector 74111 non-null int64
24 amenity_internet 74111 non-null int64
25 amenity_laptop_friendly_workspace 74111 non-null int64
26 amenity_washer 74111 non-null int64
27 amenity_hair_dryer 74111 non-null int64
28 amenity_dryer 74111 non-null int64
29 neighborhood_price_level 74111 non-null int64
30 cancellation_policy_flexible 74111 non-null float64
31 cancellation_policy_moderate 74111 non-null float64
32 cancellation_policy_strict 74111 non-null float64
33 cancellation_policy_super_strict_30 74111 non-null float64
34 cancellation_policy_super_strict_60 74111 non-null float64
35 city_Boston 74111 non-null float64
36 city_Chicago 74111 non-null float64
37 city_DC 74111 non-null float64
38 city_LA 74111 non-null float64
39 city_NYC 74111 non-null float64
40 city_SF 74111 non-null float64
41 property_type_Apartment 74111 non-null float64
42 property_type_Bed & Breakfast 74111 non-null float64
43 property_type_Bungalow 74111 non-null float64
44 property_type_Condominium 74111 non-null float64
45 property_type_Guesthouse 74111 non-null float64
46 property_type_House 74111 non-null float64
47 property_type_Loft 74111 non-null float64
48 property_type_Other 74111 non-null float64
49 property_type_Townhouse 74111 non-null float64
50 room_type_Entire home/apt 74111 non-null float64
51 room_type_Private room 74111 non-null float64
52 room_type_Shared room 74111 non-null float64
dtypes: float64(30), int64(22), object(1)
memory usage: 30.0+ MB
# Create train+validation/test splits
train_df, test_df = train_test_split(
df,
test_size=0.2,
random_state=42,
stratify=df["city"], # Stratify by city for balanced representation
)
print(f"Training set shape: {train_df.shape}")
print(f"Test set shape: {test_df.shape}")
print(f"\nSplit verification: {train_df.shape[0] + test_df.shape[0]} total samples")Training set shape: (59288, 53)
Test set shape: (14823, 53)
Split verification: 74111 total samples
## Drop stratification column from features
train_df = train_df.drop("city", axis=1)
X_train = train_df.loc[:, train_df.columns != "log_price"]
y_train = train_df["log_price"]
print(f"Training features shape: {X_train.shape}")
print(f"Training target shape: {y_train.shape}")
test_df = test_df.drop("city", axis=1)
X_test = test_df.loc[:, test_df.columns != "log_price"]
y_test = test_df["log_price"]
print(f"Test features shape: {X_test.shape}")
print(f"Test target shape: {y_test.shape}")Training features shape: (59288, 51)
Training target shape: (59288,)
Test features shape: (14823, 51)
Test target shape: (14823,)
# Linear Regression with cross-validation
linear_model = LinearRegression()
# 5-fold cross-validation
cv_scores = cross_val_score(linear_model, X_train, y_train, cv=5, scoring="r2")
print(f"Linear Regression Cross-Validation R² Scores: {cv_scores}")
print(f"Mean R²: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")Linear Regression Cross-Validation R² Scores: [0.59495647 0.60093226 0.59434806 0.5850233 0.60406696]
Mean R²: 0.5959 (+/- 0.0131)
XGBoost Regression¶
# XGBoost with cross-validation
xgb_model = xgb.XGBRegressor(random_state=42, n_estimators=100, learning_rate=0.1, max_depth=6)
cv_scores_xgb = cross_val_score(xgb_model, X_train, y_train, cv=5, scoring="r2")
print(f"XGBoost Cross-Validation R² Scores: {cv_scores_xgb}")
print(f"Mean R²: {cv_scores_xgb.mean():.4f} (+/- {cv_scores_xgb.std() * 2:.4f})")
# Compare all models
print("\nFinal Model Comparison:")
print(f"Linear Regression: {cv_scores.mean():.4f}")
print(f"XGBoost: {cv_scores_xgb.mean():.4f}")
# Train final model on full training data
xgb_model.fit(X_train, y_train)XGBoost Cross-Validation R² Scores: [0.71489725 0.7151093 0.71387543 0.70976343 0.71964383]
Mean R²: 0.7147 (+/- 0.0063)
Final Model Comparison:
Linear Regression: 0.5959
XGBoost: 0.7147
Scoring¶
# Make predictions on validation set
test_predictions = xgb_model.predict(X_test)
# Calculate metrics
mae = mean_absolute_error(y_test, test_predictions)
mse = mean_squared_error(y_test, test_predictions)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, test_predictions)
print("Validation Set Performance:")
print(f"MAE: {mae:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R²: {r2:.4f}")
# Convert back to actual prices for interpretation
y_val_actual = np.exp(y_test)
val_predictions_actual = np.exp(test_predictions)
mae_actual = mean_absolute_error(y_val_actual, val_predictions_actual)
print("\nIn actual dollars:")
print(f"Mean actual price: ${y_val_actual.mean():.2f}")
print(f"MAE in dollars: ${mae_actual:.2f}")
print(f"This means our model is off by about ${mae_actual:.2f} on average")Validation Set Performance:
MAE: 0.2808
MSE: 0.1494
RMSE: 0.3866
R²: 0.7144
In actual dollars:
Mean actual price: $160.56
MAE in dollars: $53.48
This means our model is off by about $53.48 on average
# Feature importance from XGBoost
feature_importance = pd.DataFrame(
{"feature": X_train.columns, "importance": xgb_model.feature_importances_}
).sort_values("importance", ascending=False)
# Plot top 15 most important features
top_features = feature_importance.head(15)
fig = px.bar(
x=top_features["importance"],
y=top_features["feature"],
orientation="h",
title="Top 15 Most Important Features (XGBoost)",
labels={"x": "Feature Importance", "y": "Feature"},
text=[f"{val:.4f}" for val in top_features["importance"]],
)
fig.update_traces(texttemplate="%{text}", textposition="outside")
fig.update_layout(height=600, yaxis={"categoryorder": "total ascending"})
fig.show()
print("\nBusiness Insights from Feature Importance:")
for i, row in top_features.head(10).iterrows():
print(f"{i + 1}. {row['feature']}: {row['importance']:.4f}")
Business Insights from Feature Importance:
49. room_type_Entire home/apt: 0.5896
28. neighborhood_price_level: 0.1311
2. bathrooms: 0.0718
11. bedrooms: 0.0272
37. city_LA: 0.0188
39. city_SF: 0.0147
1. accommodates: 0.0107
35. city_Chicago: 0.0097
36. city_DC: 0.0092
19. amenity_tv: 0.0079
# Create prediction vs actual plot
fig = px.scatter(
x=y_test,
y=test_predictions,
title="Predicted vs Actual Log Prices (Test Set)",
labels={"x": "Actual Log Price", "y": "Predicted Log Price"},
trendline="ols",
opacity=0.6,
)
# Add perfect prediction line
fig.add_trace(
go.Scatter(
x=[y_test.min(), y_test.max()],
y=[y_test.min(), y_test.max()],
mode="lines",
name="Perfect Prediction",
line={"color": "red", "dash": "dash"},
)
)
fig.update_layout(height=500, width=800)
fig.show()
# Residual plot
residuals = y_test - test_predictions
fig = px.scatter(
x=test_predictions,
y=residuals,
title="Residual Plot (Test Set)",
labels={"x": "Predicted Log Price", "y": "Residual (Actual - Predicted)"},
opacity=0.6,
)
# Add zero line
fig.add_hline(y=0, line_dash="dash", line_color="red")
fig.update_layout(height=500, width=800)
fig.show()