The purpose of data annotation.
Machine learning is embedded in AI and allows machines to perform specific tasks through training. With data annotation, it can learn about pretty much everything. Machine learning techniques can be described into four types: Unsupervised learning, Semi-Supervised Learning, Supervised Learning, and Reinforcement learning
▸Supervised Learning: The supervised learning learns from a set of labeled data. It is an algorithm that predicts the outcome of new data based on previously known labeled data.
▸Unsupervised Learning: In unsupervised machine learning, training is based on unlabeled data. In this algorithm, you don’t know the outcome or the label of the input data.
▸Semi-Supervised Learning: The AI will learn from a dataset that is partly labeled. This is the combination of the two types above.
▸Reinforcement Learning: Reinforcement learning is the algorithm that helps a system determine its behavior to maximize the benefits. Currently, it is mainly applied to Game Theory, where algorithms need to determine the next move to achieve the highest score.
Although there are four types of techniques, the most frequently used are unsupervised and supervised learning. You can see how unsupervised and supervised learning works according to Booz Allen Hamilton description in this picture:
What is labeled data?
Labeled data is a group of samples that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of it with informative tags. Labeled data will help machine learning “learn” the similar pattern in the input data and then predict another dataset.
How to process data annotation?
Step 1: Data Collection
Data collection is the process of gathering and measuring information from countless different sources. To use the data we collect to develop practical artificial intelligence (AI) and machine learning solutions, it must be collected and stored in a way that makes sense for the business problem at hand.
There are several ways to find data. In classification algorithm cases, it is possible to rely on class names to form keywords and to use crawling data from the Internet to find images. Or you can find photos, videos from social networking sites, satellite images on Google, free collected data from public cameras or cars (Waymo, Tesla), even you can buy data from third parties (notice the accuracy of data). Some of the common datasets can be found on free websites like Common Objects in Context (COCO), ImageNet, and Google’s Open Images.
Some common data types are Image, Video, Text, Audio, and 3D sensor data.
- Image: Often are photographs of people, objects, or animals.
- Video: Recorded tape from CCTV or camera, usually divided into scenes.
- Text: Different types of documents include numbers and words and they can be in multiple languages.
- Audio: They are sound records from people having dissimilar demographics.
- 3D Sensor data: 3D models generated by sensor devices.
Step 2: Identify the problem
Knowing what problem you are dealing with will help you to decide the techniques you should use with the input data. In computer vision, there are some tasks such as:
- Image classification: Collect and classify the input data by assigning a class label to an image.
- Object detection & localization: Detect and locate the presence of objects in an image and indicate their location with a bounding box, point, line, or polyline.
- – Object instance / semantic segmentation: In semantic segmentation, you have to label each pixel with a class of objects (Car, Person, Dog, etc.) and non-objects (Water, Sky, Road, etc.). Polygon and masking tools can be used for object semantic segmentation.
Step 3: Data Annotation
After identifying the problems, now you can process the data labeling accordingly. With the classification task, the labels are the keywords used during finding and crawling data. For instance segmentation task, there should be a label for each pixel of the image. After getting the label, you need to use tools to perform image annotation (i.e. to set labels and metadata for images). The popular tools can be named Comma Coloring, Annotorious, LabelMe. You can refer to some of the common data annotation tools and their features in our infographic here.
However, this way is manual and time-consuming. A faster alternative is to use algorithms like Polygon-RNN ++ or Deep Extreme Cut. Polygon-RNN ++ takes the object in the image as the input and gives the output as polygon points surrounding the object to create segments, thus making it more convenient to labeling. The working principle of Deep Extreme Cut is similar to Polygon-RNN ++ but it allows up to 4 polygons.
It is also possible to use the “Transfer Learning” method to label data, by using pre-trained models on large-scale datasets such as ImageNet, Open Images. Since the pre-trained models have learned many features from millions of different images, their accuracy is fairly high. Based on these models, you can find and label each object in the image. It should be noted that these pre-trained models must be similar to the collected dataset to perform feature-extraction or fine-turning.
Who can annotate data?
The data annotators are the ones in charge of labeling the data. There are some ways to allocate them:
The data scientists and AI researchers in your team are the ones who label data. The advantages of this way are easy to manage and having a high accuracy rate. However, it is such a waste of human resources since data scientists will have to spend much time and effort on a manual, repetitive task.
You can find a third party – a company that provides data annotation services. Although this option will cost less time and effort of your team, you need to ensure that the company commits to providing transparent and accurate data.
Online workforce resources
Alternatively, you can use online workforce resources like Amazon Mechanical Turk or Crowdflower. These platforms recruit online workers around the world to do data annotation. However, the accuracy and the organization of the dataset are the issues that you need to consider when purchasing this service.
The data annotation guide described here is basic and straightforward. To build machine learning, besides data scientists who will set the infrastructure and scale for complex machine learning tasks, you still need to find data annotators to label the input data. Lotus Quality Assurance provides professional data annotation services in different domains. With our quality review process, we commit to bring a high-quality and secure service. Contact us for further support!