Basically, the steps of data annotation for deep machine learning are as follows:
1. Collect Data Set
First we need to understand what the problem is and the business value of it to be able to find exact training data. For classification problems, it is possible to rely on the names of classes to create keywords and use crawling data tools from the Internet for finding images. Or we can find photos, videos from social network sites, satellite images on Google, free collected data from public cameras or cars (Waymo, Tesla), or even buy data from 3rd parties (note about the accuracy of the data)
Note: After collecting data, it is necessary to conduct pre-process because most of the collected data are raw data with different height, width, ratio … so it cannot be directly included in Deep Learning Models! We will use regularly the built-in libraries like Open CV, Scikit-Image … to pre-process the image.
2. Labeling Data Set
The problems needed to solve are mostly Supervised Learning so it is necessary to label the collected data.
This is an important step because it will evaluate whether our model works well or not! Wrong labeling of data will cause the model to predict and evaluate incorrectly, thus spending a lot of time and effort on training.There are two points we have to pay attention:
・ How to label data?
・ Who will label the data?
2.1. How to label data?
After finding a data set for the requirements, we need to care about what type of annotation should it be determined? For example classification, object detection, segmentation …
Hence we can process data to label accordingly! In the case of classification, the labels are the keywords used in the process of finding and crawling data from the Internet. In case of instance segmentation, a label for each pixel of the image is needed.
After that we need to use tools to perform image annotation (set label and metadata for images). Common tools can be named Comma Coloring, Annotorious, LabelMe … These tools will support the GUI for labeling each segment of the image.
2.2. Who will label the data?
There are 2 different types:
In-house: your company will label the data yourself.
・Pros: easy to control the accuracy of data, low cost.
・Cons: It takes a lot of time to collect and label data.
Out-source: Thanks to third parties which can be companies specializing in providing data on business requirements.
・Pros: data has the ability to aggregate quickly.
・Cons: data needs to be transparent and accurate, costly
Besides, we can also use online workforce resources like Amazon Mechanical Turk (https://www.mturk.com/) or Crowdflower (http://www.crowdflower.com/).
In short, we ask the online community for labelling data, usually it takes fees. This is also the way big datasets like ImageNet or Microsoft Coco were born. However, the accuracy and organization of the data is an issue we need to consider.
Depending on the conditions and requirements, you need to choose the appropriate options!
3. Test & Evaluate Model
Choose suitable deep learning models-> Conduct training -> Conduct tests and assessments
4. Satisfying Acceptable Quality
Repeat the above steps until you meet the requirements of the problem.
For more information, please refer to Annotation services
Lotus Quality Assurance (LQA)
Tel: (+84) 24-6660-7474