What is Video Annotation for Deep Learning

Written by Fx Leduc | Mar 12, 2020 2:38:54 AM

Video annotation is the process of labelling video clips. This is done to prepare it as a dataset for training deep learning (DL) and machine learning (ML) models. These pre-trained neural networks are then used for computer vision applications, such as automatic video classification tools.

ML is a field of artificial intelligence (AI) research, which goes back to the early 1940s. This was a time when artificial networks were developed to simulate the neural functions and workflow systems of the human brain. However, ML is now categorized under narrow AI research, which continues to be largely differentiated from AGI (artificial general intelligence).

Meanwhile, deep learning is a sub field of ML. This deals with larger artificial neural networks, which are trained using bigger volumes of data. This sub field began when more powerful computers started to be used for training ML models.

On the other hand, computer vision applications are tools that use ML and DL models for processing visual data. These are facial recognition and person identification apps, image classification and automatic video labelling platforms, among others. These are now integrated into many back-end and customer-facing systems of enterprises, government offices, SMEs, and independent research groups.

What is Automatic Video Labelling?

This is a process that involves ML and DL models that have been trained using datasets for this computer vision application. Sequences of video clips that are fed to a pre-trained model are automatically categorized under a certain group of classes. For example, a camera security system that’s powered by a video labelling model can be used to identify persons and objects, recognize faces, and classify human actions or activities, among others.

Automatic video labelling is quite similar to ML and DL-powered image labelling tools. The difference is, video labelling applications process sequential 3D visual data in real time. However, some data scientists and AI development groups just process each frame of a real time video stream. This is to label each video sequence (group of frames) using an image classification model.

That’s because the architecture of these automatic video labelling models is similarly constructed as the artificial neural networks of image classification tools and other related computer vision applications. Also, similar algorithms are implemented into the supervised, unsupervised and reinforced learning modes that are used while training these models. This strategy often works quite well, but for some use cases, significant visual information from video data is lost during this pre-processing stage.

Frame-by-Frame Video Annotation for Deep Learning

As mentioned, annotating video datasets is largely similar to preparing image datasets for the DL models of computer vision applications. The primary difference is, videos are processed as frame-by-frame image data. For example, a 60-second video clip with a frame rate of 30 fps (frames per second) is comprised of 1800 video frames, which can be treated as 1800 static images.

So, it can take a significant number of hours to annotate a 60-second video clip. Now imagine doing this for a dataset that’s collectively worth over 100 hours of videos. This is why most ML and DL development groups opt to annotate a certain frame, and just do this again after considerable numbers of frames have lapsed. Many keep an eye out for certain indicators, such as significant changes to the foreground and background scenes of the current video sequence. They do this to annotate its most relevant parts.

For example, if frame 1 in a 60-second video at 30 fps shows car brand X and model Y, then a variety of image annotation techniques can be used to label the region of interest for classifying the automobile brand and model. This can include 2D and 3D image annotation methods. However, if it’s also crucial for your particular use case to annotate background objects, such as for semantic segmentation objectives, then the visual scenes and objects around the car in the same frame are also annotated.

Then, depending on your preferences and goals, you can opt to annotate the next frames where there are significant changes to both foreground and background objects. You can also choose to annotate frame X in case there aren’t any considerable visual foreground and background changes after Y seconds.

However, important information can be lost through this method while training your ML or DL model. That’s why you’re recommended to incorporate interpolation techniques while annotating your video dataset. This can help you complete your annotation requirements in much quicker and cost-effective ways. Plus, this significantly improves the performance of ML and DL networks for automatic video labelling applications and computer vision tools.

How to Interpolate Data While Annotating Video Datasets

Interpolation is a technique that’s equipped into many image and video editing tools. It’s also incorporated as a toolset into lots of digital motion graphics and 3D or 2D animation programs.

Simply put, interpolation in this context is the process of generating synthesized video data between 2 existing frames. Meanwhile, extrapolation involves the creation of synthesized frames after existing video data. Both are synthesized through relevant features drawn from the original video data.

This means video interpolation can also be used to generate much clearer visual data when the previous and next frames are blurred or problematic. A common algorithm that’s used for generating these interpolated video frames is known as optical flow estimation. This is where all pixels in the previous and next frames are analyzed and used to predict the motion of newly synthesized pixels.

By implementing these techniques, you’ll be able to improve the overall quality of your video datasets. This applies both to annotated and un-annotated data.

State-of-the-Art Methods for Video Dataset Interpolation in Deep Learning

1. Adaptive Separable Convolution

This method is done through a DL model with a fully convolutional neural network (FCNN). A widely used method synthesizes 1D kernel pairs for all pixels of the video frame. These are drawn from estimations based on 2 input frames. These are fed to the DL model during training.

This also allows the network to generate synthesized data for the entire video frame in one fell swoop. Perpetual loss then becomes possible during the training process. Because of this, clearer video frames of higher quality are generated for your dataset.

Another very similar method also synthesizes new video frames between a starting input frame and an ending input frame. This is known as a spatially-adaptive separable convolution technique. These input frames are convolved to calculate the newly synthesized pixel’s color and other properties. This is done by estimating 2D convolution kernel pairs.

However, these kernels are dependent on the newly synthesized pixels. With this, it’s important to capture re-sampling information and motion. Information flow is directed into 4 sub networks. Each sub network makes 1D kernel estimations, for a total of 4 sets of 1D kernels.

Also, 3×3 convolutional layers are used for the Rectified Linear Unit (ReLU). Through this method, a DL model is trained through the AdaMax optimizer. Common decaying learning rates start out at 0.001. Meanwhile, mini-batch size usually starts out at 16 samples. Plus, random cropping techniques are often used for augmenting a video dataset as a pre-processing step. This is known to significantly reduce DL model bias.

2. Adaptive Convolution

This technique is used to estimate motion while simultaneously synthesizing pixels. Keep in mind, interpolating video frames is usually performed through a 2-step process where motion estimation and pixel synthesis are involved. However, the adaptive convolution method reduces this to a single, simultaneous step.

This is done through a DL model with an FCNN. This is where the model estimates spatially-adaptive convolution kernels for each pixel in the interpolated video frame. This is drawn from each pixel’s two central receptive field patches.

The model does this to generate synthesized pixels by using these convolution kernels for convolving the input patches. It’s also used for capturing re-sampling coefficients and motion.

This FCNN is composed of several down convolutions and convolutional layers. These are implemented as alternatives to max-pooling layers. Meanwhile, ReLUs are commonly used as activations and also for batch normalization during regularization.

3. Deep Voxel Flow (DBF)

This method is for interpolating video frames through an end-to-end, fully differentiable DL model. The architecture of this network is implemented with 3 convolution layers, 1 bottleneck layer, and 3 deconvolution layers. A voxel flow layer is also integrated into this model. This is drawn from the input video frame’s space and time features. Plus, this is a fully-convolutional encoder-decoder network.

Through this technique, the pixels of an interpolated video frame are synthesized from relevant values of nearby pixels. Each final set of pixels is synthesized through trilinear interpolation, which is done on input video frames. Meanwhile, this DL model is commonly trained through unsupervised learning.

This trilinear interpolation method often produces crisper, smoother video frames. During training, this model is fed with 2 video frames as input. On the other hand, a reconstruction target is set from the relevant values of the remaining frame. Simply put, a frame is reconstructed through voxels that are borrowed from nearby frames.

4. Bidirectional Predictive Network (BiPN) for Long-Term Video Interpolation

This method is for synthesizing a series of video frames, which is generated from a pair of non-consecutive input frames. Simply put, intermediate frames are estimated and synthesized from 2 opposite directions, namely future-forward and past-backward time and space values.

The technique is done through a DL model with a bidirectional predictive network. A convolutional encoder-decoder architecture is implemented into this model. During training, the DL model is fed with 2 non-consecutive video frames as input. One of these is set as the start frame, which is used by the model to predict and generate future-forward video frames. Meanwhile, the other is used to synthesize past-backward frames. Through regression algorithms, the network learns to generate missing intermediate frames across both directions simultaneously.

Its bidirectional encoder processes the encoding information of these start and end frame inputs. This is performed to generate latent frame representations. Several convolutional layers make up its forward and reverse encoders. Each layer has a ReLU.

Meanwhile, its single decoder processes these feature representations as inputs. This is done to predict and synthesize multiple missing intermediate frames. A set of ReLUs and up-convolutions make up this network’s decoder layers, and it produces a feature map as its output. The size of these predicted intermediate frames is set as L xWXHXC (the length of the synthesized frame x its height x width x number of channels).

5. PhaseNet

This method is for estimating the missing intermediate frame’s phase decomposition values. Basically, the architecture of this model uses a combination of a learning framework and the phase-based technique.

Also, precise dense correspondences are required by a lot of modern video interpolation methods when generating intermediate frames. However, this network is a decoder-only model. So it is designed to continuously increase the resolution quality of its video frame outputs at each level where corresponding decomposition values are used. Plus, with the exception of the lowest level, the structure of these levels is identical, and succeeding levels have relevant information from previous levels.

A PhaseNet block is present in each of these resolution levels. It processes the input frame’s decomposition values, along with the input frame’s resized feature maps, and also the previous resolution level’s resized predicted values. These details are fed to 2 convolution layers. Batch normalization is then performed, followed by ReLU non-linearity.

A set of 64 feature maps is produced by each convolution. Intermediate frame decomposition values are estimated by each PhaseNet block. This is completed when the PhaseNet block’s output feature maps pass through a 1×1 convolution layer. Then, the hyperbolic tangent function is used to generate the intermediate video frame. This is done after the network estimates output values and calculates the synthesized intermediate frame’s decomposition values.

How to Do Video Annotation for Deep Learning

There are a multitude of strategies for annotating video datasets. One is through a multi-purpose video annotation tool. This can make it quicker and less costly to complete your video annotation requirements.

These are often packaged as standalone programs for computers that run on Microsoft Windows, a variety of Linux distros, and Mac OSX. On the other hand, some are designed as SaaS (software as a service) platforms, which can be accessed through modern Web browsers like Google Chrome, Mozilla Firefox, Microsoft Edge, Apple Safari and others. However, most standalone video annotation tools are also implemented with SaaS features, or cloud-based, server-side functionality.

Many of these video annotation tools are integrated with features to quickly cut up a long video into shorter clips, based on your desired length. Automatic ways of breaking down these clips into static frames are also usually available. Of course, quick and easy ways to label and annotate a frame are mostly present in these video annotation programs and platforms.

Outsourcing your video annotation projects is also another cost-effective solution. Lots of companies offer video annotation services. They have teams of project managers, quality assurance specialists, in-house and remote crowdsourced agents. Most of them also have their own video annotation tools and platforms. Plus, keep in mind, some of these providers can recommend the most suitable methods for your video annotation tasks especially if they specialize in particular techniques that are ideal for your requirements.

You can also outsource your video annotation tasks to a group of independent freelancers. There are lots of micro job sourcing platforms that can allow you to build a crowdsourced team of remote workers. This is often a more inexpensive approach than outsourcing your video annotation projects to a company, especially since each freelancer in your remote team can be paid for each video clip that they complete.

However, you’ll be responsible for managing and training each agent in your crowdsourced team, delegating tasks to them, monitoring their activities, and for ensuring the quality of their work output. Kili Technology offers a video annotation tool for doing all these things in more straightforward ways, among other things like automatic video frame labeling and interpolation.