Personalized Customer Service Request Stream Routing using Deep Reinforcement Learning
-By ZINING LIU1 , CHONG LONG , XIAOLU LU , ZEHONG HU , JIE ZHANG , YAFANG WANG
Courtesy:facebook
Abstract
Customer services are critical to all companies, as they may directly connect to the brand reputation. Due to a great number of customers, e-commerce companies often employ multiple communication channels to answer customers’ questions, for example, chatbot and hotline. On one hand, each channel has limited capacity to respond to customers’ requests, on the other hand, customers have different preferences over these channels. The current production systems are mainly built based on business rules, which merely considers tradeoffs between resources and customers’ satisfaction. To achieve the optimal tradeoff between resources and customers’ satisfaction, we propose a new framework based on deep reinforcement learning, which directly takes both resources and user model into account. In addition to the framework, we also propose a new deep-reinforcement-learning based routing method – double duelling deep Q-learning with prioritized experience replay (PER-DoDDQN).
The quality of customer service is crucial to a company’s reputation: its quality is measured by how quickly a company responds to customers’ requests and how satisfied customers are when seeking help. Obviously, the satisfaction of a customer can merely be measured using the problem-solving quality alone. In practice, a company also adopts customers’ queuing time as one of the indicators for measuring satisfaction. To shorten customers’ waiting time, major companies often provided multiple communication channels for customers to choose, for example, mobile App,web-based message and the traditional hotline. Different communication channels have their own limited quota for responding to customers’ requests, and also require different cognitive loads. Customers are impatient, especially when they have a request to be resolved, but they want to express their need using the least effort.
The proposed framework is in sharp contrast to the rule-based system, as our framework directly captures: (i) customers’ preference and (ii) each channels future traffic. Customers have different preferences over different communication channels, and a simple routing method may result in low satisfaction. For example, if the problem is not so urgent, a student may be willing to leave messages on the offline service desk if the current hotline is busy; someone who is doubtful about the chatbot may be stuck to the hotline channel, regardless the waiting time. Recommending customers their preferrable surrogate channel is essential as it affects the overall customers’ experience. Another key to solving the allocation problem is to be able to predict the channel’s traffic over the next time window. The requests data stream often comes in a high speed, especially in peak hours and a mass corporation, a system can fail miserably if it ignores the time-series feature of the request stream. For example, if our prediction suggests that the hotline is not busy in the next time window we can then let customers be served using the hotline if they prefer to; similarly if our the prediction shows the hotline’s capacity will be exceeded in the next time window, we may try to route customers’ request to alternative channels if possible. Hence, the framework we proposed in this paper is to seek tradeoffs among the channel capacity, user preference and the predicted traffic of the current channel.
Contributions can be concluded in three-fold:
1) Model the customer service requests routing problem using deep reinforcement learning, considering both channel resources and customers’ satisfaction.
2) They propose the double duelling deep Q-learning with prioritized experience replay method to solve the routing problem, which achieves that better performance than its counterparts in practice.
3) Authors perform an extensive evaluation using both real and synthetic data to demonstrate the practical value of our proposed methods.
Application Background
Modern business often provides several communication channels for customers convenience, ranging from traditional call center service, online chatbot, to mobile APP. Besides the traditional call center (or hotline) services, other channels may be served by a mixture of automatic chatbot and human customer representatives. Apparently, the hotline service has the least capacity to deal with customers’ request, while it is the most preferred one by a majority of the customers, due to its low communication cost. Other channels face a similar trade-off. For example, it is easier for users to access mobile Apps for issuing customer service requests, but it requests a lot of cognitive effort for users to describe their information need precisely and the interaction is often less efficient compared to using the hotline service. While most of the time the mobile APP may not be preferred, for simple requests, customers’ may resolve them more efficient by selecting or browsing pre-selected questions. Similarly, for the online interface, customers are required to input their requests precisely, which may prevent customers from using this channel. However, compared to the mobile APP and the hotline service, the web interface provides a way for customers to interact with each other, which may help to reduce the workload of customer service representatives to some extent. When a large number of customers are making requests to one channel, the channel’s capacity may be exceeded, and customers need to wait a long time. In this case, most customer services platforms will re-direct users to other channels, however, users have a large chance to reject such recommendation.
image
BASELINE CUSTOMER SERVICE ROUTING SYSTEM
To compare with our proposed RL-based algorithm, we consider a real customer service routing system as an example, shown in Figure 1. This is a real product system adopted by a large financial technology company, which is designed based on business rules. The system considers two communication channels: self-service and hotline, where self-service let customers solve simple requests and the hotline is the traditional call-based customer service. The system works as follows: when a customer calls in, the system always asks whether the customer would like to use a self-service, as the hotline may be congested and the customer’s question may be simple. If the customer agrees, he or she will be routed to the self-service channel, otherwise, the system will then make a decision depending on the current queue length of the hotline channel. A customer will always be re-directed to hotline service when there are no people waiting in the queue. However, when the service channel is busy and there is a queue, the system will randomly select a set of customers to ask if they are willing to switch to the drainage channel. The exact number of customers who will be chosen to ask for a second preference depends on the current length of the hotline queue – if the queue is long, then more customers will be selected than a short queue.
The Routing Model
The overall routing model is based on deep reinforcement learning, and more specifically, the DQN variants. They approximate qθ(s, a), which means the value of "state"(s) and "action"(a) pairs with the deep learning.
We start to describe how we formulate the three key elements (action, state and reward function) in any reinforcement learning model, and then we describe our proposed PER-DoDDQN in Algorithm 1. Action. The system aims to learn a policy that can determine which channel should be recommended to users, so the action here is to select a channel among n candidates, where n is determined by the total number of channels in the real application. Representd the action as a, (a ∈ {1, 2, 3, . . . , n}). State. A state is expected to express the customer’s channel preference, and channel feasibility of handling furthermore requests. So the state includes the customers’ preferences over different channels u, the channel capacity c and the channel’s future request flow traffic eˆt. The entire state is represented as s = hu, eˆt, ci. Reward Function. There are duel reward aspects we need to consider: from the customers’ perspective and from the channel capacity, based on which the reward is:
Results on Synthetic Data.
They showed evaluation results on the synthetic dataset in table 2 and figure 5. As the hotline channel is the main concern in the real production system, we first focus on the absolute congestion percentage of this particular channel, and show results in Figure 5. We show the comparison of our proposed PER-DoDDQN in Figure 5(a). As we can see, our environment model is crucial for finding the optimal routing plan: the method PER-DoDDQN-e, which doesn’t have flow estimation is consistently worse than the others. Without user modeling also hurt the performance, but only slightly on the synthetic data. We then compare our method to other DQN variants in Figure 5(b). The trend is clear – our proposed PER-DoDDQN is the best among all. Also, we can see that the prioritized experience replay plays an important role in the model, without which the performance can be degraded a lot. We then consider the overall evaluation using proposed metrics, in Table 2. Among all evaluation metrics, the CCR is the most important one, and the proposed PER-DoDDQN is the best across all configurations; this trend also holds when evaluating using AC. DoDDQN shows a better performance when considering the peak congestion, and it achieves a similar but worse performance than its PER variant. Using a simple DQN is the best if only drainage percentage or average free rate, however, it simply means a non-optimal plan is shown. Further, results on both DoDDQN and DDQN suggest that our proposed PER-DoDDQN is more advanced than considering any of the components alone. We compare RL models with standard heuristics algorithms, including Simulated annealing (SA), CNN and Baselines(rule-based system) which can be viewed as heuristic methods. As shown in Figure 6 and Table 3, the traditional machine learning (SA, SVM, KNN, CNN) perform worse than the RL algorithms, especially in the metrics CCR, AC and PC. Because the long term gains are not considered in these models. The RL framework takes into account not just immediate reward but also the impact of the selected action in the future. But other ML models recommend channels to the customers solely on the current state. Note that the average idle degree of the catering staff(AFR) can be very low when the customers are more likely to be assigned to hotline regardless of the capacity of channels. And the percentages of customers who accepted the switchto-self-service suggestions(SP) and switch-to-app suggestions(DP) are independent of the future state. Supervised machine learning methods only care about the current state, so it can perform similarly to or even perform better than the RL models in one or two of the above three metrics, but obviously it can not change the fact that it can not beat RL algorithms among all the metrics in general.
Comments