Through personalized travel recommendations, Upside’s machine learning team makes it their primary mission to reduce customers’ cognitive load during the booking process. Simply put, we want to be able to predict the exact flight, hotel, and rental car that our customers ultimately choose from our site. Most online travel agencies (like Priceline or Kayak) initially present their inventory by lowest price. We’ve discovered that this approach is not ideal for business travelers, who have fundamentally different needs and constraints than leisure travelers, when it comes to things like flight duration, flight take off time, and number of layovers.
If anyone has been keeping up with the …show more content…
One important measure used in many IR problems is called “top N recall” (or sometimes referred to as “recall rate @N”). Essentially, this measures in the percentage of the time that our customers purchased a flight that appeared in the top N results (where N is any positive integer value). So, we can pose the flight sorting optimization problem as: “create a machine learning model that improves the top N recall as compared to the baseline sort.” For those of you who work on IR-type problems, what objective functions have you found to be predictive and …show more content…
Challenger
Once we had our trained machine learning model (i.e., the “challenger”), the next step was to deploy it in production and compare its performance against the existing model (i.e., the “champion”). At Upside, we use Optimizely to run our A/B tests and use conversion rate as the primary metric to determine a winner. Our test took approximately two weeks to hit statistical significance. At completion, we found that the machine learning model increased conversion rate from 4.65% to 5.67%, which is almost a 22% improvement!
Conversion rate is a great business KPI, but as I mentioned earlier, we are really interested in the top N recall metric. As such, we plotted the top N recall and compared our predicted values. To accomplish this, we used the validation portion of the training data by employing a simulation using the actual values measured from the live A/B experiment. The figure below shows that we improved top 10 recall from 22% to 38%, which is approximately a 73% improvement! It also shows that our predicted values were very close to the actual values for the challenger model. We believe the discrepancy in the champion performance was due to the fact that our flight inventory was significantly improved between the time when we trained the model and the completion of the A/B test. Simulations between the champion and challenger performance (i.e., the dotted lines) were crucial in making the decision to push our machine learning model to production and