Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Je geeft je antwoord telkens na de ✍️. In de bijhorende code-cell(en) toon je hoe je aan dat antwoord kwam. Becommentarieer je code!

Dataset Schema

Column NameDescription
DateDate of the booking
TimeTime of the booking
Booking IDUnique identifier for each ride booking
Booking StatusStatus of booking (Completed, Cancelled by Customer, Cancelled by Driver, etc.)
Customer IDUnique identifier for customers
Vehicle TypeType of vehicle (Go Mini, Go Sedan, Auto, eBike/Bike, UberXL, Premier Sedan)
Pickup LocationStarting location of the ride
Drop LocationDestination location of the ride
Avg VTATAverage time for driver to reach pickup location (in minutes)
Avg CTATAverage trip duration from pickup to destination (in minutes)
Cancelled Rides by CustomerCustomer-initiated cancellation flag
Reason for cancelling by CustomerReason for customer cancellation
Cancelled Rides by DriverDriver-initiated cancellation flag
Driver Cancellation ReasonReason for driver cancellation
Incomplete RidesIncomplete ride flag
Incomplete Rides ReasonReason for incomplete rides
Booking ValueTotal fare amount for the ride
Ride DistanceDistance covered during the ride (in km)
Driver RatingsRating given to driver (1-5 scale)
Customer RatingRating given by customer (1-5 scale)
Payment MethodMethod used for payment (UPI, Cash, Credit Card, Uber Wallet, Debit Card)
import os

import kagglehub
import numpy as np
import pandas as pd
import plotly.express as px
# Download data from kaggle
path = kagglehub.dataset_download("yashdevladdha/uber-ride-analytics-dashboard")

# Load data into Pandas DataFrame
csv_file = os.path.join(path, "ncr_ride_bookings.csv")
df = pd.read_csv(csv_file)
print("✅ Data loaded successfully!")
✅ Data loaded successfully!

Q1

❓ Welke variabele(n) uit de dataframe ontbreken in de lijst:

  • Time

  • Booking ID

  • Booking Status

  • Customer ID

  • Vehicle Type

  • Pickup Location

  • Drop Location

  • Avg VTAT

  • Avg CTAT

  • Cancelled Rides by Customer

  • Cancelled Rides by Driver

  • Incomplete Rides

  • Incomplete Rides Reason

  • Booking Value

  • Ride Distance

  • Driver Ratings

  • Customer Rating

  • Payment Method

✍️

  • Date

  • Reason for cancelling by Customer

  • Driver Cancellation Reason

# Display DataFrame information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 21 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Date                               150000 non-null  object 
 1   Time                               150000 non-null  object 
 2   Booking ID                         150000 non-null  object 
 3   Booking Status                     150000 non-null  object 
 4   Customer ID                        150000 non-null  object 
 5   Vehicle Type                       150000 non-null  object 
 6   Pickup Location                    150000 non-null  object 
 7   Drop Location                      150000 non-null  object 
 8   Avg VTAT                           139500 non-null  float64
 9   Avg CTAT                           102000 non-null  float64
 10  Cancelled Rides by Customer        10500 non-null   float64
 11  Reason for cancelling by Customer  10500 non-null   object 
 12  Cancelled Rides by Driver          27000 non-null   float64
 13  Driver Cancellation Reason         27000 non-null   object 
 14  Incomplete Rides                   9000 non-null    float64
 15  Incomplete Rides Reason            9000 non-null    object 
 16  Booking Value                      102000 non-null  float64
 17  Ride Distance                      102000 non-null  float64
 18  Driver Ratings                     93000 non-null   float64
 19  Customer Rating                    93000 non-null   float64
 20  Payment Method                     102000 non-null  object 
dtypes: float64(9), object(12)
memory usage: 24.0+ MB

Q2

❓ Geef een voorbeeld (variabele naam) van een variabele met een nominale meetschaal.

✍️
Vehicle Type

# Display first few rows of the DataFrame
df.head()
Loading...

Q3

❓ Hoeveel unieke waarden telt de variabele die je opgaf bij Q2?

✍️
7

# Count unique vehicle types
len(df["Vehicle Type"].unique())
7

Q4

❓ Welke variabele(n) hebben het meeste missing values?

✍️

  • Incomplete Rides

  • Incomplete Rides Reason

# Display DataFrame information
# df.info()

Q5

❓ Hoeveel observaties zouden we overhouden als we een analyse maken met de volgende variabelen waarbij we alle observaties met missing values weglaten?

  • Vehicle Type

  • Avg CTAT

✍️
102000

# Create a subset and drop any missings
subset_df = df.loc[
    ~((df["Vehicle Type"].isna()) | (df["Avg CTAT"].isna())), ("Vehicle Type", "Avg CTAT")
]

subset_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 102000 entries, 1 to 149999
Data columns (total 2 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Vehicle Type  102000 non-null  object 
 1   Avg CTAT      102000 non-null  float64
dtypes: float64(1), object(1)
memory usage: 2.3+ MB

Q6

❓ Wat is de Timestamp van de oudste observatie van een betaling (zie Payment Method) met een Debit Card?

✍️
2024-01-01 05:28:38

# Convert Date and Time columns to datetime by concatenating strings
df["DateTime"] = pd.to_datetime(df["Date"] + " " + df["Time"])

# Calculate the minimum DateTime for a payment with debit card
df.loc[df["Payment Method"] == "Debit Card", "DateTime"].min()
Timestamp('2024-01-01 05:28:38')

Q7

❓ Welke soort Payment Method heeft het kleinste aandeel in de observaties (naam + percentage)?

✍️
Debit Card: 5.49%

px.pie(df, names="Payment Method")
Loading...
df["Payment Method"].value_counts() / len(df) * 100
Payment Method UPI 30.606000 Cash 16.911333 Uber Wallet 8.184000 Credit Card 6.806000 Debit Card 5.492667 Name: count, dtype: float64

Q8

❓Welke Pickup Location is het meest populair (naam + aantal observaties)?

✍️
Khandsa: 949

df["Pickup Location"].value_counts()
Pickup Location Khandsa 949 Barakhamba Road 946 Saket 931 Badarpur 921 Pragati Maidan 920 ... Mandi House 800 Noida Sector 18 799 Laxmi Nagar 797 Ashok Vihar 796 Ghitorni Village 790 Name: count, Length: 176, dtype: int64

Q9

❓Geef een voorbeeld (variabele naam) van een variabele met een interval meetschaal.

✍️
Booking Value

Q10

❓Geef volgende descriptieve waarden voor deze variabele (precisie: 1 decimaal):

  • Aantal (non-missing) observaties

  • Gemiddelde

  • Standaardafwijking

  • Minimum

  • Maximum

  • 25% Percentiel

  • 50% Percentiel

  • 75% Percentiel

✍️

  • Aantal (non-missing) observaties: 102000

  • Gemiddelde: 508.3

  • Standaardafwijking: 395.8

  • Minimum: 50.0

  • Maximum: 4277.0

  • 25% Percentiel: 234.0

  • 50% Percentiel: 414.0

  • 75% Percentiel: 689.0

# Number of non-missing values for Booking Value
df["Booking Value"].dropna().count()
np.int64(102000)
# Get summary statistics for numerical columns
df.describe().T
Loading...

Q11

❓ Bepaal of er uitschieters zijn voor de variabele Booking Value en zo ja, geef hun aantal volgens de onderstaande methode. Je gebruikt hiervoor volgende grenzen: $$

lower:Q13×IQRupper:Q3+3×IQR\begin{align} \text{lower}&: Q1 - 3 \times IQR \cr \text{upper}&: Q3 + 3 \times IQR \end{align}

$$ Geef zowel de grenzen als het aantal outliers in je antwoord.

✍️

  • Lower bound: -1131.0

  • Upper bound: 2054.0

  • Number of outliers: 911

# Calculate outliers for Booking Value
Q1 = df["Booking Value"].quantile(0.25)
Q3 = df["Booking Value"].quantile(0.75)
IQR = Q3 - Q1

# Define outlier boundaries
lower_bound = Q1 - 3 * IQR
upper_bound = Q3 + 3 * IQR

# Count outliers
outliers = df.loc[
    (df["Booking Value"] < lower_bound) | (df["Booking Value"] > upper_bound), "Booking Value"
]
num_outliers = len(outliers)

print(f"Lower bound: {lower_bound}")
print(f"Upper bound: {upper_bound}")
print(f"Number of outliers: {num_outliers}")
Lower bound: -1131.0
Upper bound: 2054.0
Number of outliers: 911

Q12

❓ Bereken de gemiddelde Avg VTAT voor iedere dag van de week voor Uber XL op de meest populaire pickup locatie (precisie: 1 decimaal).

✍️

  • Friday: 8.2

  • Monday: 9.9

  • Saturday: 8.0

  • Sunday: 6.8

  • Thursday: 5.8

  • Tuesday: 13.3

  • Wednesday: 13.0

# Most frequent Pickup Location
location = df["Pickup Location"].mode()[0]
print(location)

# Subset of the Date and Avg VTAT data for the most frequent Pickup Location with Uber XL vehicle type
subset_df = df.loc[
    (df["Pickup Location"] == location) & (df["Vehicle Type"] == "Uber XL"), ("Date", "Avg VTAT")
]

# Translate Date to day of the week
subset_df["Day of Week"] = pd.to_datetime(subset_df["Date"]).dt.day_name()

# Aggregate per day of the week
subset_df.groupby("Day of Week")["Avg VTAT"].mean()
Khandsa
Day of Week Friday 8.175000 Monday 9.933333 Saturday 8.014286 Sunday 6.800000 Thursday 5.800000 Tuesday 13.300000 Wednesday 12.966667 Name: Avg VTAT, dtype: float64

Q13

❓ Beschouw de observaties voor Avg CTAT, Booking Value en Ride Distance. Pas een natuurlijke log-transformatie toe en imputeer alle missing values door de mediaan van iedere variabele.

  • Wat is de hoogste absolute paarsgewijze correlatie (1<x<1-1 < x < 1; precisie: 4 decimalen)?

  • Gaat het over een positief of negatief verband?

✍️

  • Hoogste absolute paarsgewijze correlatie: 0.1216

  • Verband: Positief

# Create a subset for Avg CTAT, Booking Value, and Ride Distance
subset_df = df.loc[:, ("Avg CTAT", "Booking Value", "Ride Distance")]

# Log transform
subset_df_log = subset_df.apply(np.log)

# Impute missing values with the median value
subset_df_log["Avg CTAT"] = subset_df_log["Avg CTAT"].fillna(subset_df_log["Avg CTAT"].median())
subset_df_log["Booking Value"] = subset_df_log["Booking Value"].fillna(
    subset_df_log["Booking Value"].median()
)
subset_df_log["Ride Distance"] = subset_df_log["Ride Distance"].fillna(
    subset_df_log["Ride Distance"].median()
)

subset_df_log.corr()
Loading...