Je geeft je antwoord telkens na de ✍️. In de bijhorende code-cell(en) toon je hoe je aan dat antwoord kwam. Becommentarieer je code!
Dataset Schema¶
| Column Name | Description |
|---|---|
| Date | Date of the booking |
| Time | Time of the booking |
| Booking ID | Unique identifier for each ride booking |
| Booking Status | Status of booking (Completed, Cancelled by Customer, Cancelled by Driver, etc.) |
| Customer ID | Unique identifier for customers |
| Vehicle Type | Type of vehicle (Go Mini, Go Sedan, Auto, eBike/Bike, UberXL, Premier Sedan) |
| Pickup Location | Starting location of the ride |
| Drop Location | Destination location of the ride |
| Avg VTAT | Average time for driver to reach pickup location (in minutes) |
| Avg CTAT | Average trip duration from pickup to destination (in minutes) |
| Cancelled Rides by Customer | Customer-initiated cancellation flag |
| Reason for cancelling by Customer | Reason for customer cancellation |
| Cancelled Rides by Driver | Driver-initiated cancellation flag |
| Driver Cancellation Reason | Reason for driver cancellation |
| Incomplete Rides | Incomplete ride flag |
| Incomplete Rides Reason | Reason for incomplete rides |
| Booking Value | Total fare amount for the ride |
| Ride Distance | Distance covered during the ride (in km) |
| Driver Ratings | Rating given to driver (1-5 scale) |
| Customer Rating | Rating given by customer (1-5 scale) |
| Payment Method | Method used for payment (UPI, Cash, Credit Card, Uber Wallet, Debit Card) |
import os
import kagglehub
import numpy as np
import pandas as pd
import plotly.express as px# Download data from kaggle
path = kagglehub.dataset_download("yashdevladdha/uber-ride-analytics-dashboard")
# Load data into Pandas DataFrame
csv_file = os.path.join(path, "ncr_ride_bookings.csv")
df = pd.read_csv(csv_file)
print("✅ Data loaded successfully!")✅ Data loaded successfully!
Q1¶
❓ Welke variabele(n) uit de dataframe ontbreken in de lijst:
TimeBooking IDBooking StatusCustomer IDVehicle TypePickup LocationDrop LocationAvg VTATAvg CTATCancelled Rides by CustomerCancelled Rides by DriverIncomplete RidesIncomplete Rides ReasonBooking ValueRide DistanceDriver RatingsCustomer RatingPayment Method
✍️
DateReason for cancelling by CustomerDriver Cancellation Reason
# Display DataFrame information
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 150000 non-null object
1 Time 150000 non-null object
2 Booking ID 150000 non-null object
3 Booking Status 150000 non-null object
4 Customer ID 150000 non-null object
5 Vehicle Type 150000 non-null object
6 Pickup Location 150000 non-null object
7 Drop Location 150000 non-null object
8 Avg VTAT 139500 non-null float64
9 Avg CTAT 102000 non-null float64
10 Cancelled Rides by Customer 10500 non-null float64
11 Reason for cancelling by Customer 10500 non-null object
12 Cancelled Rides by Driver 27000 non-null float64
13 Driver Cancellation Reason 27000 non-null object
14 Incomplete Rides 9000 non-null float64
15 Incomplete Rides Reason 9000 non-null object
16 Booking Value 102000 non-null float64
17 Ride Distance 102000 non-null float64
18 Driver Ratings 93000 non-null float64
19 Customer Rating 93000 non-null float64
20 Payment Method 102000 non-null object
dtypes: float64(9), object(12)
memory usage: 24.0+ MB
Q2¶
❓ Geef een voorbeeld (variabele naam) van een variabele met een nominale meetschaal.
✍️Vehicle Type
# Display first few rows of the DataFrame
df.head()# Count unique vehicle types
len(df["Vehicle Type"].unique())7# Display DataFrame information
# df.info()Q5¶
❓ Hoeveel observaties zouden we overhouden als we een analyse maken met de volgende variabelen waarbij we alle observaties met missing values weglaten?
Vehicle TypeAvg CTAT
✍️
102000
# Create a subset and drop any missings
subset_df = df.loc[
~((df["Vehicle Type"].isna()) | (df["Avg CTAT"].isna())), ("Vehicle Type", "Avg CTAT")
]
subset_df.info()<class 'pandas.core.frame.DataFrame'>
Index: 102000 entries, 1 to 149999
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Vehicle Type 102000 non-null object
1 Avg CTAT 102000 non-null float64
dtypes: float64(1), object(1)
memory usage: 2.3+ MB
Q6¶
❓ Wat is de Timestamp van de oudste observatie van een betaling (zie Payment Method) met een Debit Card?
✍️
2024-01-01 05:28:38
# Convert Date and Time columns to datetime by concatenating strings
df["DateTime"] = pd.to_datetime(df["Date"] + " " + df["Time"])
# Calculate the minimum DateTime for a payment with debit card
df.loc[df["Payment Method"] == "Debit Card", "DateTime"].min()Timestamp('2024-01-01 05:28:38')Q7¶
❓ Welke soort Payment Method heeft het kleinste aandeel in de observaties (naam + percentage)?
✍️Debit Card: 5.49%
px.pie(df, names="Payment Method")df["Payment Method"].value_counts() / len(df) * 100Payment Method
UPI 30.606000
Cash 16.911333
Uber Wallet 8.184000
Credit Card 6.806000
Debit Card 5.492667
Name: count, dtype: float64df["Pickup Location"].value_counts()Pickup Location
Khandsa 949
Barakhamba Road 946
Saket 931
Badarpur 921
Pragati Maidan 920
...
Mandi House 800
Noida Sector 18 799
Laxmi Nagar 797
Ashok Vihar 796
Ghitorni Village 790
Name: count, Length: 176, dtype: int64Q9¶
❓Geef een voorbeeld (variabele naam) van een variabele met een interval meetschaal.
✍️Booking Value
Q10¶
❓Geef volgende descriptieve waarden voor deze variabele (precisie: 1 decimaal):
Aantal (non-missing) observaties
Gemiddelde
Standaardafwijking
Minimum
Maximum
25% Percentiel
50% Percentiel
75% Percentiel
✍️
Aantal (non-missing) observaties: 102000
Gemiddelde: 508.3
Standaardafwijking: 395.8
Minimum: 50.0
Maximum: 4277.0
25% Percentiel: 234.0
50% Percentiel: 414.0
75% Percentiel: 689.0
# Number of non-missing values for Booking Value
df["Booking Value"].dropna().count()np.int64(102000)# Get summary statistics for numerical columns
df.describe().TQ11¶
❓ Bepaal of er uitschieters zijn voor de variabele Booking Value en zo ja, geef hun aantal volgens de onderstaande methode. Je gebruikt hiervoor volgende grenzen:
$$
$$ Geef zowel de grenzen als het aantal outliers in je antwoord.
✍️
Lower bound: -1131.0
Upper bound: 2054.0
Number of outliers: 911
# Calculate outliers for Booking Value
Q1 = df["Booking Value"].quantile(0.25)
Q3 = df["Booking Value"].quantile(0.75)
IQR = Q3 - Q1
# Define outlier boundaries
lower_bound = Q1 - 3 * IQR
upper_bound = Q3 + 3 * IQR
# Count outliers
outliers = df.loc[
(df["Booking Value"] < lower_bound) | (df["Booking Value"] > upper_bound), "Booking Value"
]
num_outliers = len(outliers)
print(f"Lower bound: {lower_bound}")
print(f"Upper bound: {upper_bound}")
print(f"Number of outliers: {num_outliers}")Lower bound: -1131.0
Upper bound: 2054.0
Number of outliers: 911
Q12¶
❓ Bereken de gemiddelde Avg VTAT voor iedere dag van de week voor Uber XL op de meest populaire pickup locatie (precisie: 1 decimaal).
✍️
Friday: 8.2
Monday: 9.9
Saturday: 8.0
Sunday: 6.8
Thursday: 5.8
Tuesday: 13.3
Wednesday: 13.0
# Most frequent Pickup Location
location = df["Pickup Location"].mode()[0]
print(location)
# Subset of the Date and Avg VTAT data for the most frequent Pickup Location with Uber XL vehicle type
subset_df = df.loc[
(df["Pickup Location"] == location) & (df["Vehicle Type"] == "Uber XL"), ("Date", "Avg VTAT")
]
# Translate Date to day of the week
subset_df["Day of Week"] = pd.to_datetime(subset_df["Date"]).dt.day_name()
# Aggregate per day of the week
subset_df.groupby("Day of Week")["Avg VTAT"].mean()Khandsa
Day of Week
Friday 8.175000
Monday 9.933333
Saturday 8.014286
Sunday 6.800000
Thursday 5.800000
Tuesday 13.300000
Wednesday 12.966667
Name: Avg VTAT, dtype: float64Q13¶
❓ Beschouw de observaties voor Avg CTAT, Booking Value en Ride Distance. Pas een natuurlijke log-transformatie toe en imputeer alle missing values door de mediaan van iedere variabele.
Wat is de hoogste absolute paarsgewijze correlatie (; precisie: 4 decimalen)?
Gaat het over een positief of negatief verband?
✍️
Hoogste absolute paarsgewijze correlatie: 0.1216
Verband: Positief
# Create a subset for Avg CTAT, Booking Value, and Ride Distance
subset_df = df.loc[:, ("Avg CTAT", "Booking Value", "Ride Distance")]
# Log transform
subset_df_log = subset_df.apply(np.log)
# Impute missing values with the median value
subset_df_log["Avg CTAT"] = subset_df_log["Avg CTAT"].fillna(subset_df_log["Avg CTAT"].median())
subset_df_log["Booking Value"] = subset_df_log["Booking Value"].fillna(
subset_df_log["Booking Value"].median()
)
subset_df_log["Ride Distance"] = subset_df_log["Ride Distance"].fillna(
subset_df_log["Ride Distance"].median()
)
subset_df_log.corr()