The Example you have provided is having the different output than you mentioned in statements. For eg.) The customer id "abc" in "MX"
country is actually having hotel bookings on 1st March
and he also have ticket booking on 4th March
as per Airline data. Which is actually falling in (+/-) 30 days range.
By understanding your statements, I have made below example.
Airline Data
airline_data = pd.DataFrame({"Customer_Id":["xyz", "abc", "xyz"], "Country":["US", "MX", "CA"],"Date_Of_Travel":["10-20-2018", "03-04-2018", "04-05-2019"]})
print(airline_data)
Customer_Id Country Date_Of_Travel
xyz US 10-20-2018
abc MX 03-04-2018
xyz CA 04-05-2019
Hotel Data
Hotel_data = pd.DataFrame({"Customer_Id":["xyz","xyz", "xyz","abc", "abc"], "Country":["CA","US","US","PH","MX"],"Date_of_Booking":["03-30-2019", "05-01-2018", "10-15-2018", "10-20-2018", "03-01-2018"]})
print(Hotel_data)
Customer_Id Country Date_of_Booking
xyz CA 03-30-2019
xyz US 05-01-2018
xyz US 10-15-2018
abc PH 10-20-2018
abc MX 03-01-2018
Convert the Date column from string to Datetime object
airline_data["Date_Of_Travel"] = pd.to_datetime(airline_data["Date_Of_Travel"])
Hotel_data["Date_of_Booking"] = pd.to_datetime(Hotel_data["Date_of_Booking"])
Create two columns for Minimum date and Maximum date which is (+/-) 30 days.
airline_data["Min_date"] = (airline_data["Date_Of_Travel"]) - pd.Timedelta(days=30)
airline_data["Max_date"] = (airline_data["Date_Of_Travel"]) + pd.Timedelta(days=30)
Join two dataframes based on the Customer_id, Country to get the required data
df_1 = pd.merge(airline_data, Hotel_data, on=["Customer_Id", "Country"],how="left")
print(df_1)
Customer_Id Country Date_Of_Travel Min_date Max_date Date_of_Booking
xyz US 2018-10-20 2018-09-20 2018-11-19 2018-05-01
xyz US 2018-10-20 2018-09-20 2018-11-19 2018-10-15
abc MX 2018-03-04 2018-02-02 2018-04-03 2018-03-01
xyz CA 2019-04-05 2019-03-06 2019-05-05 2019-03-30
Create a column which is satisfying the condition of Booking is +/- 30 days from Travel date.
df_1["Bool"] = (df_1.Date_of_Booking >= df_1.Min_date) & (df_1.Date_of_Booking <= df_1.Max_date)
df_1["Bool"] = df_1["Bool"].apply(int)
print(df_1)
Customer_Id Country Date_Of_Travel Min_date Max_date Date_of_Booking Bool
xyz US 2018-10-20 2018-09-20 2018-11-19 2018-05-01 0
xyz US 2018-10-20 2018-09-20 2018-11-19 2018-10-15 1
abc MX 2018-03-04 2018-02-02 2018-04-03 2018-03-01 1
xyz CA 2019-04-05 2019-03-06 2019-05-05 2019-03-30 1
Now, apply groupby
on the df_1 and get the sum of the Bool to get the total bookings of Each customer in particular country.
Req_Results = df_1.groupby(["Customer_Id","Country","Date_Of_Travel","Date_of_Booking"]).sum().reset_index()
Req_Results = Req_Results[Req_Results.Bool!=0]
Req_Results.rename(columns={"Bool":"Total_Hotel_Bookings"},inplace=True)
print(Req_Results)
Customer_Id Country Date_Of_Travel Date_of_Booking Total_Hotel_Bookings
abc MX 2018-03-04 2018-03-01 1
xyz CA 2019-04-05 2019-03-30 1
xyz US 2018-10-20 2018-10-15 1
CLICK HERE to find out more related problems solutions.