Netflix developed a new machine learning algorithm based on reinforcement learning to create an optimal list of recommendations considering a finite time budget for the user. In a recommendation use case, often the factor of finite time to make a decision is ignored, Netflix added this dimension to its recommendation system and in general in decision problems, in addition to the relevance of the recommendations.
The evaluation problem can be measured in amount of time, because the user takes some time to choose what he or she wants to see: reading trailers, showing the preview and so on and different shows require different amounts of evaluation time. This time budget can be considered in the recommendation system, so the recommendation model should construct the recommendation list considering the relevance of the items and the evaluation cost for the user. The user’s time budget may not be directly observable (like the preferences) but the goal for the recommendation algorithm is to construct the recommendation list that has a higher chance of engagement for the user. The recommender system needs to learn the user’s time budget in addition to the user’s latent preferences.
A typical recommendation system works with a bandit style approach to list construction and it constructs a slate of K items as following:
Image source: A bandit style recommender system for slate construction
The item scorer scores all the available N items and may use the slate constructed so far as additional context. The scores are then passed through a sampler that selects an item from the available items. This is the way how the scorer and the sampling step, the main components of a recommender system, insert an element at slot K in the slate.
The recommendation system presents a one dimensional list of K elements to the user (in a simplified setting), the user has a time budget modeled as a positive real number. The problem can be modeled with two features: relevance and cost. When the user evaluates an item of the list the cost (time) is consumed and when the budget ends the user can’t evaluate other suggested items. Each item has a probability between 0 and 1 to be consumed and the probability to choose an item depends not only from the relevance of the item but also the number of items the user is able to examine.
A recommendation system trying to maximize the user’s engagement with the slate needs to pack in as many relevant items as possible within the user budget, by making a trade-off between relevance and cost.
The problem is connected to the 0/1 Knapsack problem in theoretical computer science: the goal is to find the subset of items with the highest total utility such that the total cost of the subset is not greater than the budget. The 0/1 knapsack problem is NP-Complete, there are many approximate solutions. Netflix proposes to model the budget constrained recommendation as a Markov Decision process and use a Reinforcement Learning to find a solution. In the Markov Decision process the key concept is the current state and the action taken by the agent. In this problem the recommender system is the agent and the interaction between user and recommender system is modeled as the environment. The recommender system (agent) constructs the list of recommendations by repeatedly selecting the k items deemed appropriate at each slot. The environment (interaction between user and recommendation system) is characterized by the time budget and the items examined in the list at particular steps. In the real world the user’s time budget is unknown and can be estimated by the user’s historical data (eg how long the user scrolled before abandoning in the historical data logs). The reinforcement learning algorithm used for this problem is based on estimating the return, specifically using Temporal Difference learning to estimate the value function.
The simulation is very helpful to study and better understand the problem. With simulation, various recommendation algorithms are trained and compared. To evaluate the performances a simple metric is used at is the average number of success of the generated stater referred to as play rate. In addition to play rate, it is important to consider the effective slate size: one of the ways to improve the play-rate is to construct later effective slates, this metric is important to understand the mechanism of recommendation algorithms.
Thanks to the flexibility of simulation and setting of simulation parameters, the Netflix team learned how to construct optimal slates in an on-policy manner using SARSA algorithm. Comparing the RL model and contextual bandit the performances are much better for the reinforcement learning approach both in order to effective slate size both to play rate. In particular, for play rate, the result is statistically significant increase in play rate for small to medium budget users.
the on-policy learning is easy to simulate but it is difficult to execute in realistic recommender settings because the data are generated by a different policy (behavior policy) and the goal is to learn a new (better) policy from this data. In this case the Q learning is the technique that allows the learn optimal values function in an off-policy setting.
Q-learning and SARSA of on-policy setup are compared and the result is that Q-learning seems to be generating very large effective slate sizes without much difference in play rate. This result is interesting and still unclear, this is why it needs more investigation to be fully understood.