Faster Training With Kili Technology: Active Learning

Written by Maxime Duval | Jun 8, 2020 12:39:24 PM

Faster Training With Kili Technology : Active Learning

It is no secret that machine learning models, especially deep learning models, need lots of training data. In the real world, unsupervised data is plenty while supervised data is rare and costly to obtain. Thus, you may be interested in using active learning. It is the task of choosing the data samples to annotate to minimize the number of annotations required to achieve some performance. Since most real world applications are budget-constrained, active learning (AL) can help decrease the annotation cost.

At Kili Technology, we develop an annotation platform, to quickly get a production ready dataset covering almost all use cases. One of the features of our Python API is the ability to change the priority of samples to annotate. Coupled with our query mechanism, which allows to import labels into your scripts, it allows the use of active learning algorithms. We show you how to do that here.

In this blog post, we will provide an overview of the current state (05/2020) of active learning :

what is it ?
what are typical algorithms ?
how to apply it to your use cases ?

This blog post is the first of a series on active-learning.

1. Introduction

The most common case, where active learning will be most useful, is the situation where you have lots of unsupervised training data quickly available. In this situation, you can iteratively select a subset of labels to annotate. You need three components :

A machine learning model that learns with supervised data $(X_l, y_l)$
An algorithm, which given a trained model and unlabeled data $X_u$, sends a query : a subsample $X_s$ of this unlabeled data which will be the most beneficial to the training of a new model : $X_s \subseteq X_u$
An oracle, which is most probably you, the human annotator, that returns the labels $y_s$.

Then, the training process is as follows (see above schema).

You begin by training a model on preliminary labeled data.
You run the active learning algorithm on the remaining unsupervised data. It queries samples to annotate.
You annotate the queried data.
You retrain your model with this newly acquired data.
Repeat steps 2-4 until your model has good enough performance.

In the following, we will talk about the active learning algorithms allowing this training proces. Two references for the following work can be found here :

Settles, Burr (2010). “Active Learning Literature Survey” : reference review for active learning.
To get up to date with the state of the art : Papers with code – active learning

2. Common algorithms

2.1 Uncertainty based methods

The first class of active learning algorithms is based on uncertainty methods. They require that the model $m$ is a probabilistic learner, that is it outputs prediction probabilities : $m(x) = (\mathbb{P}(y_i|x))_{1 \leq i \leq n}$ if there are $n$ classes. Those are often the most used types of active learning algorithms, because they are easily usable with neural networks whose last layer has a softmax activation, representing class-probabilities. There are three main algorithms :

The simplest idea is to query assets for which we are the least confident about. Given a sample $x$, the model predicts $\hat{y}$ with probability $\underset{y}{\operatorname{argmax}} \mathbb{P}(y|x)$. We then return the sample(s) for which we are the least sure (you can control the amount of assets queried).
$$
X_s = \underset{x}{\operatorname{argmax}} 1 – \mathbb{P}(\hat{y}|x)
$$
Doing this discards all information about other class probabilities. One way to account for this is to use the margin between the predicted class $\hat{y}_1$ and the second top prediction $\hat{y}_2$. This is called margin sampling
$$
X_s = \underset{x}{\operatorname{argmin}} \mathbb{P}(\hat{y}_1|x) – \mathbb{P}(\hat{y}_2|x)
$$
Finally, to use in the optimal way all labels, it is often best to use entropy : it returns the assets for which the distribution of the predictions has the largest entropy. Its main difference with the two previous algorithms is that it won’t return assets for which one label is very unlikely.
$$
X_s = \underset{x}{\operatorname{argmax}} -\sum_y \mathbb{P}(y|x)\log\mathbb{P}(y|x)
$$

To conclude, if you are interested in reducing the loss, use entropy-based uncertainty sampling, if you are interested in reducing the classification error, use margin sampling.

2.2 Commitee-based methods

A second class of algorithms are commitee-based. They require :

An ensemble of trained models $(m_1, …, m_K)$
A way to compute disagreement between models $x \mapsto D(m_1, …, m_K)(x)$

If your machine learning models are neural network, you can either :

Train $K$ models with different initializations. Thanks to the probabilistic nature of training neural networks, it will create an ensemble of different models.
Train a Bayesian neural network
Use dropout to sample neural networks with different weights.

To compute disagreement, you can use vote-entropy :
$$
X_s = \underset{x}{\operatorname{argmax}} -\sum_y \frac{N_y(x)}{K}\log\frac{N_y(x)}{K}
$$
where $N_y(x)$ is the number of models that predicted class $y$ for the sample $x$.
You can also use the Kullback-Leibler divergence :
$$
X_s = \underset{x}{\operatorname{argmax}} \frac{1}{K} \sum_{k=1}^{K} D(m_k||m)(x)
$$
where $m(x)$ is the mean model prediction : $m(x) = \frac{1}{K} \sum_{k=1}^{K} m_k(x)$ and the KL divergence is computed as such : $D(m_k||m)(x) = \sum_y \mathbb{P}_k(y|x) \log \frac{\mathbb{P}_k(y|x)}{\mathbb{P}(y|x)}$. You don’t need dozens of models, in most cases having between $3$ and $5$ models is enough.

2.3 Global methods

A third class of algorithms requires that the model is trained with a gradient descent. Since currently neural networks are trained this way, those methods are applicable for deep learning. Those methods are less prone to fail against outliers and have proven experimentally to be effective, however they can be computationally expensive.

Expected model change

The first method computes a quantity called the expected model change, which, for a sample $x$, quantifies by how much the model would change if we added this sample to the training set. The question the algorithm answers is : which new labeled sample would minimize the prediction error if we performed one optimization step with it ? A common way to compute this quantity is :

First, get predictions $\mathbb{P}(y|x)$.
Then, for a class $c$ (which simulates a label we don’t have access to), compute the gradient loss $\nabla l(\delta_c, \mathbb{P}(y|x))$ of the model, where $\delta_c$ is the dirac distribution with all mass probability on class $c$.
Finally, return the sum over $c$ of those class-gradients, weighted by their probability $\mathbb{P}(c|x)$

This approximates the loss of the gradient of some sample $x$ :
$$
X_s = \underset{x}{\operatorname{argmax}} \sum_c \mathbb{P}(c|x) \| \nabla l(\delta_c, \mathbb{P}(y|x)) \|
$$

Expected error reduction

The question the algorithm answers is : how much is the error prediction for all samples reduced if we re-train with that supplementary label ? We want to maximize the increase in certainty accross all labeled samples.

In that case :
$$
X_s = \underset{x}{\operatorname{argmin}} \sum_c \mathbb{P}(c|x) \sum_j E(x_j)
$$
where $E(x_j)$ is the error on sample $x_j$ if we trained with $x$ labeled. Of course we don’t have access to the label of $x$ and don’t want to train for all samples, so an approximation of $E(x_j)$ can be :

$$
E(x_j) \approx \sum_{\tilde{c}} \mathbb{P}(\tilde{c}|x_j) \nabla l(\delta_{\tilde{c}}, \mathbb{P}(\tilde{c}|x_j)) \cdot \nabla l(\delta_{c}, \mathbb{P}(c|x))
$$

Density-weighted methods

This method is a mix of both local and global based methods : it uses the result of a local method (like uncertainty sampling), and combines it with the representativeness of the considered data point. The idea is that it is no use knowing the label of a sample if it is an outlier. In that case :

$$
X_s = \underset{x}{\operatorname{argmax}} (\text{information brought by x}) \times \sum_{x_j} \text{similarity}(x, x_j)
$$

Other techniques exist, let’s name a few :

Variance reduction : reduce the variance term of the predictions. See Section 3.5 of Settles, Burr (2010). “Active Learning Literature Survey”
Batch-mode active learning : This family of methods tries to directly identify subsets of interesting assets to query instead of ranking each asset individually then returning the top $K$ assets.

3. Applying algorithms to specific tasks

At the moment, we only mentioned cases where your the model is trained on a single-class classification task. However, there are many more cases where active learning can be useful :

For regression, you can
- change your model so that it predicts a mean and a variance
- use commitee-based methods to compute a mean and a variance of predictions.
For object detection :
- Active Learning for Deep Object Detection. Compute the margin sampling for all detected objects in an asset, take the weighted (by class imbalance) average of all objects per image.
- ViewAL: Active Learning With Viewpoint Entropy for Semantic Segmentation, CVPR 2020. Only works for multi-view datasets (the same object is present in multiple images, and we have 3D information).
For semantic segmentation :
- Deeply Supervised Active Learning for Finger Bones Segmentation. Using a U-Net architecture, you can define predictions at different stages of the decoder, and compare those predictions to compute agreement. Then : less agreement = more informative sample.
For Named Entity Recognition (NER) :
- Practical, Efficient, and Customizable Active Learning for Named Entity Recognition in the Digital Humanities, NAACL 2019. 20-60% increase speed compared to random.
- A study of active learning methods for named entity recognition in clinical text
- Deep Active Learning for Named Entity Recognition
For OCR :
- Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning. Query-by-commitee active learning, using 5 models. They compute uncertainty with a normalized Levenshtein distance ratio.
For NLP :
- Deep Bayesian Active Learning for Natural Language Processing:
  Results of a Large-Scale Empirical Study Three tasks are studied : sentiment classification, NER and semantic role labeling. Their conclusion : if you do NLP, you should use BALD : Bayesian active learning by disagreement. To compute it for deep learning, you can use Monte-Carlo dropout or Bayes-by-backprop.

View full post