Exploratory Data Analysis - Machine Learning Cursussen

Je geeft je antwoord telkens na de ✍️. In de bijhorende code-cell(en) toon je hoe je aan dat antwoord kwam. Becommentarieer je code!

Dataset Schema¶

Column Name	Description
Date	Date of the booking
Time	Time of the booking
Booking ID	Unique identifier for each ride booking
Booking Status	Status of booking (Completed, Cancelled by Customer, Cancelled by Driver, etc.)
Customer ID	Unique identifier for customers
Vehicle Type	Type of vehicle (Go Mini, Go Sedan, Auto, eBike/Bike, UberXL, Premier Sedan)
Pickup Location	Starting location of the ride
Drop Location	Destination location of the ride
Avg VTAT	Average time for driver to reach pickup location (in minutes)
Avg CTAT	Average trip duration from pickup to destination (in minutes)
Cancelled Rides by Customer	Customer-initiated cancellation flag
Reason for cancelling by Customer	Reason for customer cancellation
Cancelled Rides by Driver	Driver-initiated cancellation flag
Driver Cancellation Reason	Reason for driver cancellation
Incomplete Rides	Incomplete ride flag
Incomplete Rides Reason	Reason for incomplete rides
Booking Value	Total fare amount for the ride
Ride Distance	Distance covered during the ride (in km)
Driver Ratings	Rating given to driver (1-5 scale)
Customer Rating	Rating given by customer (1-5 scale)
Payment Method	Method used for payment (UPI, Cash, Credit Card, Uber Wallet, Debit Card)

import os

import kagglehub
import numpy as np
import pandas as pd
import plotly.express as px

# Download data from kaggle
path = kagglehub.dataset_download("yashdevladdha/uber-ride-analytics-dashboard")

# Load data into Pandas DataFrame
csv_file = os.path.join(path, "ncr_ride_bookings.csv")
df = pd.read_csv(csv_file)
print("✅ Data loaded successfully!")

✅ Data loaded successfully!

Q1¶

❓ Welke variabele(n) uit de dataframe ontbreken in de lijst:

Time
Booking ID
Booking Status
Customer ID
Vehicle Type
Pickup Location
Drop Location
Avg VTAT
Avg CTAT
Cancelled Rides by Customer
Cancelled Rides by Driver
Incomplete Rides
Incomplete Rides Reason
Booking Value
Ride Distance
Driver Ratings
Customer Rating
Payment Method

✍️

Date
Reason for cancelling by Customer
Driver Cancellation Reason

# Display DataFrame information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 21 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Date                               150000 non-null  object 
 1   Time                               150000 non-null  object 
 2   Booking ID                         150000 non-null  object 
 3   Booking Status                     150000 non-null  object 
 4   Customer ID                        150000 non-null  object 
 5   Vehicle Type                       150000 non-null  object 
 6   Pickup Location                    150000 non-null  object 
 7   Drop Location                      150000 non-null  object 
 8   Avg VTAT                           139500 non-null  float64
 9   Avg CTAT                           102000 non-null  float64
 10  Cancelled Rides by Customer        10500 non-null   float64
 11  Reason for cancelling by Customer  10500 non-null   object 
 12  Cancelled Rides by Driver          27000 non-null   float64
 13  Driver Cancellation Reason         27000 non-null   object 
 14  Incomplete Rides                   9000 non-null    float64
 15  Incomplete Rides Reason            9000 non-null    object 
 16  Booking Value                      102000 non-null  float64
 17  Ride Distance                      102000 non-null  float64
 18  Driver Ratings                     93000 non-null   float64
 19  Customer Rating                    93000 non-null   float64
 20  Payment Method                     102000 non-null  object 
dtypes: float64(9), object(12)
memory usage: 24.0+ MB

Q2¶

❓ Geef een voorbeeld (variabele naam) van een variabele met een nominale meetschaal.

✍️
Vehicle Type

# Display first few rows of the DataFrame
df.head()

Q3¶

❓ Hoeveel unieke waarden telt de variabele die je opgaf bij Q2?

✍️
7

# Count unique vehicle types
len(df["Vehicle Type"].unique())

7

Q4¶

❓ Welke variabele(n) hebben het meeste missing values?

✍️

Incomplete Rides
Incomplete Rides Reason

# Display DataFrame information
# df.info()

Q5¶

❓ Hoeveel observaties zouden we overhouden als we een analyse maken met de volgende variabelen waarbij we alle observaties met missing values weglaten?

Vehicle Type
Avg CTAT

✍️
102000

# Create a subset and drop any missings
subset_df = df.loc[
    ~((df["Vehicle Type"].isna()) | (df["Avg CTAT"].isna())), ("Vehicle Type", "Avg CTAT")
]

subset_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 102000 entries, 1 to 149999
Data columns (total 2 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Vehicle Type  102000 non-null  object 
 1   Avg CTAT      102000 non-null  float64
dtypes: float64(1), object(1)
memory usage: 2.3+ MB

Q6¶

❓ Wat is de Timestamp van de oudste observatie van een betaling (zie Payment Method) met een Debit Card?

✍️
2024-01-01 05:28:38

# Convert Date and Time columns to datetime by concatenating strings
df["DateTime"] = pd.to_datetime(df["Date"] + " " + df["Time"])

# Calculate the minimum DateTime for a payment with debit card
df.loc[df["Payment Method"] == "Debit Card", "DateTime"].min()

Timestamp('2024-01-01 05:28:38')

Q7¶

❓ Welke soort Payment Method heeft het kleinste aandeel in de observaties (naam + percentage)?

✍️
Debit Card: 5.49%

px.pie(df, names="Payment Method")

df["Payment Method"].value_counts() / len(df) * 100

Payment Method
UPI            30.606000
Cash           16.911333
Uber Wallet     8.184000
Credit Card     6.806000
Debit Card      5.492667
Name: count, dtype: float64

Q8¶

❓Welke Pickup Location is het meest populair (naam + aantal observaties)?

✍️
Khandsa: 949

df["Pickup Location"].value_counts()

Pickup Location
Khandsa             949
Barakhamba Road     946
Saket               931
Badarpur            921
Pragati Maidan      920
                   ... 
Mandi House         800
Noida Sector 18     799
Laxmi Nagar         797
Ashok Vihar         796
Ghitorni Village    790
Name: count, Length: 176, dtype: int64

Q9¶

❓Geef een voorbeeld (variabele naam) van een variabele met een interval meetschaal.

✍️
Booking Value

Q10¶

❓Geef volgende descriptieve waarden voor deze variabele (precisie: 1 decimaal):

Aantal (non-missing) observaties
Gemiddelde
Standaardafwijking
Minimum
Maximum
25% Percentiel
50% Percentiel
75% Percentiel

✍️

Aantal (non-missing) observaties: 102000
Gemiddelde: 508.3
Standaardafwijking: 395.8
Minimum: 50.0
Maximum: 4277.0
25% Percentiel: 234.0
50% Percentiel: 414.0
75% Percentiel: 689.0

# Number of non-missing values for Booking Value
df["Booking Value"].dropna().count()

np.int64(102000)

# Get summary statistics for numerical columns
df.describe().T

Q11¶

❓ Bepaal of er uitschieters zijn voor de variabele Booking Value en zo ja, geef hun aantal volgens de onderstaande methode. Je gebruikt hiervoor volgende grenzen: $$

\begin{align} \text{lower}&: Q1 - 3 \times IQR \cr \text{upper}&: Q3 + 3 \times IQR \end{align}

(1)

$$ Geef zowel de grenzen als het aantal outliers in je antwoord.

✍️

Lower bound: -1131.0
Upper bound: 2054.0
Number of outliers: 911

# Calculate outliers for Booking Value
Q1 = df["Booking Value"].quantile(0.25)
Q3 = df["Booking Value"].quantile(0.75)
IQR = Q3 - Q1

# Define outlier boundaries
lower_bound = Q1 - 3 * IQR
upper_bound = Q3 + 3 * IQR

# Count outliers
outliers = df.loc[
    (df["Booking Value"] < lower_bound) | (df["Booking Value"] > upper_bound), "Booking Value"
]
num_outliers = len(outliers)

print(f"Lower bound: {lower_bound}")
print(f"Upper bound: {upper_bound}")
print(f"Number of outliers: {num_outliers}")

Lower bound: -1131.0
Upper bound: 2054.0
Number of outliers: 911

Q12¶

❓ Bereken de gemiddelde Avg VTAT voor iedere dag van de week voor Uber XL op de meest populaire pickup locatie (precisie: 1 decimaal).

✍️

Friday: 8.2
Monday: 9.9
Saturday: 8.0
Sunday: 6.8
Thursday: 5.8
Tuesday: 13.3
Wednesday: 13.0

# Most frequent Pickup Location
location = df["Pickup Location"].mode()[0]
print(location)

# Subset of the Date and Avg VTAT data for the most frequent Pickup Location with Uber XL vehicle type
subset_df = df.loc[
    (df["Pickup Location"] == location) & (df["Vehicle Type"] == "Uber XL"), ("Date", "Avg VTAT")
]

# Translate Date to day of the week
subset_df["Day of Week"] = pd.to_datetime(subset_df["Date"]).dt.day_name()

# Aggregate per day of the week
subset_df.groupby("Day of Week")["Avg VTAT"].mean()

Khandsa

Day of Week
Friday        8.175000
Monday        9.933333
Saturday      8.014286
Sunday        6.800000
Thursday      5.800000
Tuesday      13.300000
Wednesday    12.966667
Name: Avg VTAT, dtype: float64

Q13¶

❓ Beschouw de observaties voor Avg CTAT, Booking Value en Ride Distance. Pas een natuurlijke log-transformatie toe en imputeer alle missing values door de mediaan van iedere variabele.

Wat is de hoogste absolute paarsgewijze correlatie ( $-1 < x < 1$ ; precisie: 4 decimalen)?
Gaat het over een positief of negatief verband?

✍️

Hoogste absolute paarsgewijze correlatie: 0.1216
Verband: Positief

# Create a subset for Avg CTAT, Booking Value, and Ride Distance
subset_df = df.loc[:, ("Avg CTAT", "Booking Value", "Ride Distance")]

# Log transform
subset_df_log = subset_df.apply(np.log)

# Impute missing values with the median value
subset_df_log["Avg CTAT"] = subset_df_log["Avg CTAT"].fillna(subset_df_log["Avg CTAT"].median())
subset_df_log["Booking Value"] = subset_df_log["Booking Value"].fillna(
    subset_df_log["Booking Value"].median()
)
subset_df_log["Ride Distance"] = subset_df_log["Ride Distance"].fillna(
    subset_df_log["Ride Distance"].median()
)

subset_df_log.corr()