Emails are the de facto standard for written communication between companies and customers. They are the most preferred channel of communication with 91% of all US consumers still using email daily. [1] That’s why businesses and individuals face massive volumes of unprioritized noisy emails.
Customer services are no exception. They keep using emails as they are a great straightforward way for clients to express themselves in natural language. In that context, being able to prioritize and classify emails is crucial for customer services. The customer service we worked with wanted emails to be automatically dispatched to the proper service thanks to machine learning. This low-added-value process
From a data scientist’s perspective, what are some of the lessons learnt from working eight months with the customer service of a leading bank to streamline their email classification pipeline?
Email classification is a classic of natural language processing (NLP). However, you will face some challenges when processing emails:
To tackle all these issues, you will have to continuously annotate and train a model. No public dataset will help you for training, as they most certainly do not contain your products and your vocabulary. The need to annotate and train on your very own emails will emerge from the beginning.
In our case, four parties were cooperating for the project’s success:
Do not separate concerns! All four parties must be as agile as possible and cooperate along the project. That is why we favoured two-week SCRUM iterations over monolithic work. [3]
Annotation plan. At the very beginning of the project, the annotation plan has to be formalized. The annotation plan constitutes the categories of the classification along with the written guidelines to annotate emails. Actually annotating 100 random emails will give you a sense of all ambiguities and edge cases in the dataset. Too few categories would introduce high bias, and too many overlapping or ambiguous categories would lead to high variance in the dataset. It is important to co-create the annotation plan, as well as to share and explain it across the team.
Scale annotation. What should you expect from an annotation platform when doing email classification?
The goal was to annotate 10,000 emails by four people who were working full-time at the customer service. The customer service used to annotate with Excel. Excel allows none of the latter expectations, which is why they turned to Kili Technology and increased their productivity by a factor of 2. This saves time:
Pick the performance measure. As far as the evaluation metrics is concerned, accuracy will not work well if classes are strongly imbalanced (which is practically always the case on real-world examples). Recall, on the other hand, will only account for the coverage of the positive classes. The metrics you should aim for is F1-score, because it balances the concerns of both precision and recall. [4]
Development pipeline. Machine learning consists in iteratively pick data, a preprocessing scheme and a model, see how they perform in regards to your performance metrics, and repeat.
We aimed for a fully-automatic pipeline in order to be able to repeat experiments and launch new experiments very quickly. The idea is to seamlessly recompute business metrics given a change in the model or the data. Here are some experiments that proved successful or interesting to tackle to gain performance:
Every choice has to be backed by cross-validated metrics, to see how it impacts the final model.
Models. As far as the model is concerned, we focused on easy-to-use packaged frameworks with out-of-the-box solutions and preferably APIs for automated machine learning (AutoML). [5] For transparency and interpretability reasons towards the business, we began by the most traditional techniques.
Do not fine-tune any model until you find a better strategy by an order of magnitude in percentage (+5-10%). In most cases, fine-tuning a particular model will only make you win few metrics points (+0-1%).
Track quality all along. Your model will only perform as good as your dataset is. To track quality along the annotation phase, you can use:
Those three moments are the occasion to check how relevant your annotation plan is, and if every member of the team understood it. That is also why you shouldn’t compare your final model’s performance to some theoretic threshold but to the annotator performance. Business managers of machine learning projects always expect top performances for models in production. In our use case, it wouldn’t go to production if the performance was under 80% (F1-score). For some categories, however, we were able to show that the inter-annotator agreement didn’t even meet this requirement. In other words, Kili Technology showed that even a human was not capable of completing the task with this level of confidence. An important lesson on this was to compare the inter-annotator confusion matrix with the model confusion matrix.
Having a robust development pipeline will allow you to scale to more use cases within your company. Today, our client delivers a new email classification project in weeks.
The production pipeline typically re-uses some of the code written for the development pipeline. For instance, preprocessing steps will often be the same.
The pipeline should be thoroughly tested to guarantee its resilience to all kinds of inputs. Unit tests are particularly relevant to test all separate pieces of a pipeline.
However, unit tests protect you against code regressions, not data regressions. Robust production pipelines also encompass strong safeguards against data outage and concept drift. We will detail more about the topic of monitoring ML models in production in a future blog post.
In conclusion, using Kili Technology allowed the customer service to increase the productivity by a factor of 2 on the most critical phase: the annotation phase. Kili Technology allowed to work frictionlessly in an agile setting throughout the project by synchronizing all actors from the business people to the data science team. Once in production, automating email dispatching by machine learning saves hours of human review everyday.
[1] Aufreiter, N., Boudet, J., Weng, V., Why marketers should keep sending you e-mails (2014), https://www.mckinsey.com/business-functions/marketing-and-sales/our-insights/why-marketers-should-keep-sending-you-emails
[2] Katakis, I., Tsoumakas, G. & Vlahavas, I. Tracking recurring contexts using ensemble classifiers: an application to email filtering (2010), https://doi.org/10.1007/s10115-009-0206-2
[3] Sutherland, J., Schwaber, K., SCRUM guides, https://www.scrumguides.org
[4] Precision and recall, Wikipedia, https://en.wikipedia.org/wiki/Precision_and_recall
[5] Xin He, Kaiyong Zhao, Xiaowen Chu, AutoML: A survey of the state-of-the-art (2020), https://doi.org/10.1016/j.knosys.2020.106622.