What is the best way to collect Datasets for Annotation? - Lotus QA - Leading IT Outsourcing Company In Vietnam

Data is the foundation of all the AI projects and there are different ways to prepare datasets, including collecting through the internet or consulting an agency. So, what is the best way to get raw data for the AI Data Training process?

One suggested way to collect the train and test data is to visit various open labeled resources like Google’s Open Images and mldata.org or many other websites providing datasets for training in ML projects. These platforms supply you with an endless multitude of data (mostly in the form of images) to start your training process.

Depending on what kind of datasets you’re looking for, you can divide it into these categories of:

Open Dataset Aggregators
Public government Datasets for machine learning
Machine Learning Datasets for finance & economics
Image datasets for computer vision

For a high-quality machine learning / artificial intelligence project, datasets for training is the top priority that defines the outcome of the project. For the qualified and suitable datasets, you can consider the following filters to find the most suitable ones.

Open Dataset Aggregators

The most common thing that you might be looking for when working on machine learning / artificial intelligence is a source of free datasets. Open dataset finders that you can use to browse through a wide variety of niche-specific datasets for your data science projects. You can find it in:

1. Kaggle: A data science community with tools and resources which include externally contributed machine learning datasets of all kinds. From health, through sports, food, travel, education, and more, Kaggle is one of the best places to look for quality training data.
2. Google Dataset Search: A search engine from Google that helps researchers locate freely available online data. It works similarly to Google Scholar, and it contains over 25 million datasets. You can find here economic and financial data, as well as datasets uploaded by organizations like WHO, Statista, or Harvard.
3. OpenML: An online machine learning platform for sharing and organizing data with more than 21.000 datasets. It’s regularly updated and it automatically versions and analyses each dataset and annotates it with rich meta-data to streamline analysis.

Public government Datasets

For machine learning projects concerning social matters, public government datasets are very important. You can find useful datasets in these following sources:

4. EU Open Data Portal: The point of access to public data published by the EU institutions, agencies, and other entities. It contains data related to economics, agriculture, education, employment, climate, finance, science, etc.
5. World Bank: The open data from the World Bank that you can access without registration. It contains data concerning population demographics, macroeconomic data, and key indicators for development. A great source of data to perform data analysis at a large scale.

Machine Learning Datasets for finance & economics

The use of machine learning / artificial intelligence for finance & economics has long been very promising with the vast implementation in algorithmic trading, stock market predictions, portfolio management, and fraud detection. The quantity for this is very big thanks to the datasets built over many years. You can find the easily accessible datasets for finance & economics here:

6. Global Financial Development (GFD): An extensive dataset of financial system characteristics for 214 economies around the world. It contains annual data which has been collected since 1960.
7. IMF Data: International Monetary Fund publishes data related to the IMF lending, exchange rates, and other economic and financial indicators.

Image datasets for computer vision

Medical imaging, automatic cars/self-driving cars are becoming more popular these days. With the high-quality datasets of training visual data, the application of these technologies will be better than ever. You can find the sources here:

8. Visual Genome: A large and detailed dataset and knowledge base with captioning of over 100.000 images.
9. Google’s Open Images: A collection of over 9 million varied images with rich annotations. It contains image-level label annotations, object bounding boxes, object segmentation, and visual relationships across 6000 categories. This large image database is a great source of data for any data science project.
10. Youtube-8M: A vast dataset of millions of YouTube video IDs with high-quality machine-generated annotations of more than 3,800 visual entities. This dataset comes with pre-computed audio-visual features from billions of frames and audio segments.

Finding the suitable datasets for machine learning / AI is never easy. Besides the 4 categories mentioned above, the datasets can be Natural Language Processing Datasets, Audio Speech and Music Datasets for Machine Learning Projects, Data Visualization Datasets. You can check out other free source of datasets for machine learning with V7’s 65+ Best Free Datasets for Machine Learning.

However, the downside is that those open sources are not credible enough, so if your team accidentally gathers wrong data, your ML project will be affected badly, which reduces the level of accuracy for end-users. Also, collecting the data from unknown sources will cost you a great deal of time as it requires a lot of physical and manual labor.

So, the optimal strategy to get high-quality data for the task of labelling is to outsource to a professional vendor who has profound experience and knowledge providing data collection service to AI-based projects.

For your information, Lotus Quality Assurance is an expert at both data collection and annotation services. The datasets that Lotus Quality Assurance collects, including but not limited to images from reliable sources on the Internet, videos and sound captured and recorded with specific scenes, are provided with best quality and accuracy.