Image annotation plays an important role in training a machine to automatically assign relevant metadata information to a digital picture. This metadata often includes captions, keywords, location markers, or any combination of these details. This process is required for creating datasets that are used to train the deep learning models of computer vision applications.
Many of these computer vision tools are used for the image retrieval systems of ecommerce platforms, social media sites, and other similar groups. It’s also used for the multimedia databases of public offices and private institutions. These applications make it quicker, easier and less costly to organize, update and serve the right multimedia content from the local and remote databases of these groups.
Computer vision applications are also used for robotics, security, manufacturing and a lot of other industries. For example, many factories all around the world use machines to supplement the productivity of their workers. Some of these tools are equipped with cameras and computer vision capabilities for automatically spotting issues in the output of manufacturing equipment. These machines also alert workers about these problems in real time. Meanwhile, computer vision tools are also used in self-driving cars and facial recognition apps.
Artificial neural networks were designed in the mid 1940s as algorithms that aim to mimic how the human brain processes information based on sensory input. As newer techniques and more powerful computers emerged, the line between narrow AI (artificial intelligence) and artificial general intelligence (AGI) became much clearer. So, the industry decided on a more appropriate name for this field of artificial neural network research that’s slanted towards the development of narrow AI. This is now known as machine learning.
Afterwards, deep learning (DL) was born as a sub field of machine learning. This was inspired by the advent of even more powerful computers and much quicker ways to access larger volumes of data.
This indicates that deep learning is all about scaling up the capabilities of machine learning models. Another objective of this sub field is to effectively work around the overfitting issues of other machine learning methods. So, by using larger artificial neural networks and more capable computers for training these models with larger quantities of data, the performance of these networks is known to continue to increase over time.
An array of pixels represents a digital image when using tools like Tensorflow and Pytorch to convert it into tensor data, which is just a large array of matrices. For example, an 8-bit digital image is represented by pixels, each with a value that ranges from 0 to 255.
Image datasets are often used for training a computer vision application’s deep learning model. This is usually a convolutional neural network. Such a convolutional process is used to efficiently perform feature extraction for images. This is inspired by convolving light in a certain color with a filter in a different color, which results to an energy spectrum in the filter’s color. For example, when white light is passed through a blue filter, then a spectrum of energy in blue color is produced.
Convolution is a mathematical process that’s fundamental for the techniques used in the architecture of deep learning models. This method maps out the similarity between 2 (two) signals, which is also called an energy function. So, convolution is known to simplify the process of extracting features from training data and unseen inputs.
This process extracts features from images. These often include edges, contours, colors, textures and others. Some are automatically discovered by a deep learning model during unsupervised training. A variety of algorithms are used for feature extraction, and a few of the most common ones are SIFT and SURF.
A deep learning model tries to map these features as key descriptors of images in the dataset during training. Meanwhile, the model searches for these key descriptors to an image in the dataset during validation, or to an unseen image during inferencing.
Generally, image annotation methods for datasets of computer vision applications fall under 3 (three) main categories. These include the following:
1. Retrieval-Based Image Annotation
This is also known as CBIR, which stands for content-based image retrieval. This is a method that classifies images as semantically relevant when they share similar visual features.
CBIR systems are often designed to use an image’s texture, color and shape to compare it against sets of ground truth images. Each set has a concept label, which is its particular semantic classification. For example, a set can fall under the concept label “dog”, while another under “cat”, and so on.
This is where an image is classified under a particular concept label that’s assigned to a corresponding image set during training. The process completes once the CBIR system identifies a set with the highest visual feature similarity as the features of the image. However, the dataset that was used to train the artificial neural network limits it from finding similar abstract concepts. The deep learning model is also restricted from finding hidden features between the sets and the image. That’s because it’s restricted by the pre-defined classifications provided for each set of ground truth images.
2. Classification-Based Image Annotation
This is a method where multiple classifiers are used to annotate and categorize an image. Also known as supervised learning, these multiple classifiers are visual characteristics of images that have been identified from the feature extraction process during training.
Each visual feature is treated as an independent semantic concept label. It’s assigned a particular class, which in turn is used as a unique classifier. SVM (support vector machine) and Bayesian methods are the most common algorithms that are used for this method.
SVM is known to be more efficient when a deep learning model is fed with small datasets. This algorithm is commonly used when training a deep learning model for computer vision applications through a supervised learning approach. However, this is primarily used for resolving classification problems, though it also works quite well for regression tasks.
This process is where the value of a certain feature that’s extracted from an image is mapped to a particular coordinate. These data points are plotted against the number of all visual features extracted from the images in a dataset during training. This is treated by the computer vision application’s deep learning model as the dimensional space that represents an image.
This dimensional space consists of all extracted visual features, which are grouped and assigned a particular semantic concept label. During validation and inferencing, the deep learning model tries to find the hyper-plane that distinguishes each class from all other classes, on its own. It performs this operation to complete its classification task.
Meanwhile, the Bayesian approach, as its name implies, is based on Bayes theorem. This is drawn from conditional, joint and marginal statistical probabilistic methods. However, Bayes theorem is an alternative approach to calculate conditional probability. This is where the secondary conditional probability of a given marginal, joint or conditional statistical problem is used to calculate its primary conditional probability, and vice-versa.
3. Probabilistic-Based Image Annotation
This is where correlations are estimated between an image’s visual features and its most probable concept labels. Such a process is based on the term-term relationships between these visual features and semantic classifications.
A match is found based on the similarities of an image’s features and available concept labels. This method is often used for resolving homograph and synonym problems.
Algorithmic approaches that are often used for probabilistic-based image annotation include LSA (latent semantic analysis), cooccurrence model, HMM (hidden Markov model) and PLSA (probabilistic latent semantic analysis).
As described earlier, image annotation is the process of annotating target objects within a digital image’s region of interest. This is performed to train a machine to recognize objects under the same classes in unseen images and visual scenes. However, this method can be quite challenging. That’s because there are different approaches to developing deep learning model architectures and techniques for training a machine to do this.
This means we should learn about today’s most frequently used image annotation types and methods. Here they are:
1. Bounding Boxes
This is a simple yet versatile type of image annotation. And, this is the primary reason why this method is among the most widely used techniques for annotating images in a dataset for a computer vision application’s deep learning model. As its name implies, objects of interest are enclosed in bounding boxes. An image is annotated with markers for X and Y coordinates, which are the top left column and bottom right row of the bounding box that encloses the object of interest.
2. Semantic Segmentation
This image annotation method is where each pixel in an image is assigned with a particular semantic concept label. The image is initially marked, with the objective being to separate it as individual regions. These are annotated with different semantic labels, i.e. Each pixel in a certain region is assigned “road”, while another set of pixels in a different region is annotated with the concept label “sky”.
3. Polygonal Segmentation
Complex polygons are used in place of simple bounding boxes for this image annotation method. This is known to increase model accuracy, in terms of finding the locations of objects within a region of interest in the image. In turn, this is also known to improve object classification accuracy. That’s because this technique cleans up and removes the noise around the object of interest, which is the set of unnecessary pixels around the object that tends to confuse classifiers.
4. Line Annotation
Lines and splines are used for this image annotation method to mark the boundaries of a region of interest within an image that contains the target object. This is often used when regions of interest containing target objects are too thin or too small for bounding boxes.
5. 3D Cuboids
This is an image annotation method that’s commonly used for target objects in 3D scenes and photos. As its name implies, the difference between this method and bounding boxes is that annotations for this technique include depth, and not just height and width.
6. Landmark Annotation
Also known as dot annotation, this method uses dots as annotations around target objects, which are enclosed by the image’s individual regions of interest. This is frequently used for finding and classifying target objects surrounded or containing much smaller objects. Plus, this is often used to mark the outline of the target object.
These are the different image annotation types and methods that are commonly used today. Datasets of digital images for the deep learning models of computer vision applications are annotated through these techniques. The method that must be used should match the architecture of the deep learning model and the use case for the computer vision tool. Any of these image annotation tasks can consume a lot of time and resources. Specially designed image annotation tools like Kili Playground offers multi-purpose benefits for data scientists, AI researchers and machine learning developers.