It is no secret that machine learning models, especially deep learning models, need lots of training data. In the real world, unsupervised data is plenty while supervised data is rare and costly to obtain. Thus, you may be interested in using active learning. It is the task of choosing the data samples to annotate to minimize the number of annotations required to achieve some performance. Since most real world applications are budget-constrained, active learning (AL) can help decrease the annotation cost.
At Kili Technology, we develop an annotation platform, to quickly get a production ready dataset covering almost all use cases. One of the features of our Python API is the ability to change the priority of samples to annotate. Coupled with our query mechanism, which allows to import labels into your scripts, it allows the use of active learning algorithms. We show you how to do that here.
In this blog post, we will provide an overview of the current state (05/2020) of active learning :
This blog post is the first of a series on active-learning.
The most common case, where active learning will be most useful, is the situation where you have lots of unsupervised training data quickly available. In this situation, you can iteratively select a subset of labels to annotate. You need three components :
Then, the training process is as follows (see above schema).
In the following, we will talk about the active learning algorithms allowing this training proces. Two references for the following work can be found here :
The first class of active learning algorithms is based on uncertainty methods. They require that the model $m$ is a probabilistic learner, that is it outputs prediction probabilities : $m(x) = (\mathbb{P}(y_i|x))_{1 \leq i \leq n}$ if there are $n$ classes. Those are often the most used types of active learning algorithms, because they are easily usable with neural networks whose last layer has a softmax activation, representing class-probabilities. There are three main algorithms :
To conclude, if you are interested in reducing the loss, use entropy-based uncertainty sampling, if you are interested in reducing the classification error, use margin sampling.
A second class of algorithms are commitee-based. They require :
If your machine learning models are neural network, you can either :
To compute disagreement, you can use vote-entropy :
$$
X_s = \underset{x}{\operatorname{argmax}} -\sum_y \frac{N_y(x)}{K}\log\frac{N_y(x)}{K}
$$
where $N_y(x)$ is the number of models that predicted class $y$ for the sample $x$.
You can also use the Kullback-Leibler divergence :
$$
X_s = \underset{x}{\operatorname{argmax}} \frac{1}{K} \sum_{k=1}^{K} D(m_k||m)(x)
$$
where $m(x)$ is the mean model prediction : $m(x) = \frac{1}{K} \sum_{k=1}^{K} m_k(x)$ and the KL divergence is computed as such : $D(m_k||m)(x) = \sum_y \mathbb{P}_k(y|x) \log \frac{\mathbb{P}_k(y|x)}{\mathbb{P}(y|x)}$. You don’t need dozens of models, in most cases having between $3$ and $5$ models is enough.
A third class of algorithms requires that the model is trained with a gradient descent. Since currently neural networks are trained this way, those methods are applicable for deep learning. Those methods are less prone to fail against outliers and have proven experimentally to be effective, however they can be computationally expensive.
The first method computes a quantity called the expected model change, which, for a sample $x$, quantifies by how much the model would change if we added this sample to the training set. The question the algorithm answers is : which new labeled sample would minimize the prediction error if we performed one optimization step with it ? A common way to compute this quantity is :
This approximates the loss of the gradient of some sample $x$ :
$$
X_s = \underset{x}{\operatorname{argmax}} \sum_c \mathbb{P}(c|x) \| \nabla l(\delta_c, \mathbb{P}(y|x)) \|
$$
The question the algorithm answers is : how much is the error prediction for all samples reduced if we re-train with that supplementary label ? We want to maximize the increase in certainty accross all labeled samples.
In that case :
$$
X_s = \underset{x}{\operatorname{argmin}} \sum_c \mathbb{P}(c|x) \sum_j E(x_j)
$$
where $E(x_j)$ is the error on sample $x_j$ if we trained with $x$ labeled. Of course we don’t have access to the label of $x$ and don’t want to train for all samples, so an approximation of $E(x_j)$ can be :
$$
E(x_j) \approx \sum_{\tilde{c}} \mathbb{P}(\tilde{c}|x_j) \nabla l(\delta_{\tilde{c}}, \mathbb{P}(\tilde{c}|x_j)) \cdot \nabla l(\delta_{c}, \mathbb{P}(c|x))
$$
This method is a mix of both local and global based methods : it uses the result of a local method (like uncertainty sampling), and combines it with the representativeness of the considered data point. The idea is that it is no use knowing the label of a sample if it is an outlier. In that case :
$$
X_s = \underset{x}{\operatorname{argmax}} (\text{information brought by x}) \times \sum_{x_j} \text{similarity}(x, x_j)
$$
Other techniques exist, let’s name a few :
Recent papers :
At the moment, we only mentioned cases where your the model is trained on a single-class classification task. However, there are many more cases where active learning can be useful :