Machines learn a lot like people do, by forming observations on the results of various inputs and building knowledge based on those observations. If we start with insufficient or homogenous data, we won’t learn effectively, and neither will a machine. Imagine growing up watching only two teams play baseball and all the players on those teams happen to be exactly 5’11”. Those two teams are your data set and so, if you were asked to describe a baseball player, you’d be likely to assert that a baseball player must always be 5’11”. Of course, we know this to be untrue. A situation like this may feel pretty unlikely for a person, but in Machine Learning, it is a very real risk that must be considered.
The Critical Role of Diverse Data Sets
To get better results, we need better data to train on. A smaller or less diverse data set might not be enough to get the model properly trained. For example, if you were to train a model to determine if a shape was a triangle, circle, or square and all your square examples were oriented vertically and on a blue background, your trained model might mistakenly assert confidence that an angled square is a triangle or that a circle on a blue background is a square. The more you can trust your model, the more useful it will be.
“More data beats clever algorithms, but better data beats more data.”
Andrew Ng
The Wrong Ways to Source Data
Before we talk about the right ways to get data for ML training, let’s talk about the wrong ways. Many people can be tempted to scrape data as quickly as possible from public websites because it’s free and can target very specific parameters or because they are unaware of alternative methods, but you’d first want to take a few things into consideration:
- Would your scraping efforts create a burden for the servers of the sites being scraped?
- Are you likely to get reliable data and enough of it?
- Are there any legal or privacy concerns?
- Do those sites have the data already available via other means?
These are all potential quality, legal, or ethical points of friction to obtaining large amounts of data for Machine Learning.
Thankfully, there are several ways to obtain large amounts of relevant data for training and testing. These can be free, expensive, easy, time consuming, high quality or questionable, or anywhere in between. The method you choose will depend on your needs.
Free Data Sources
The first thing you should do when in need of a large data set is look for an existing one that meets your needs. Sites like these have thousands of free data sets available for use:
These data sets range from bare minimum effort to carefully curated and documented so you will want to take the time to search for the right fit and examine the data to be sure it has the parameters you’ll need.
In addition to the sites mentioned, you can try Google’s Dataset Search or check domain-specific resources like your local government site for localized data or medical organizations for medical data or eBay’s data API for consumer sales data. While it’s possible the perfect data set is not freely available for download, there’s a good chance it is and it will save you a lot of time and/or money if you find it.
Paid Data Sources
When the right free data set is not available, another option is to purchase one. Some companies specialize in curating large data sets that they make available for a fee. The pricing model may often be per record or per 1,000 records, but can vary depending on the company, the data, and the number of records you need. If you choose this route, be sure to get sample data and know how much data you need and what it will cost—including any recurring data needs – before making a purchase. This method can be expensive, but is a quick way to get started when free data is not an option.
User-generated Data
One method many people don’t consider is having your users generate data for you. In fact, you’re already doing this for companies every day just by being online. Every time someone participates in the well-known “select all the images of a bicycle” style Google reCAPTCHA, all of the data from that interaction is recorded and used to further develop that tool. Social media interactions are accumulated and used to train algorithms to suggest people to follow or posts to view. Tons of data is generated every hour by just using websites and software.
Build data collection into your site or application to make the most of this data. When using this method, be open with your users about what data is collected and how it is used. You will also need to carefully consider what is private information and take steps to anonymize anything that could create privacy problems for your users. Protecting your users’ private information is not only the right thing to do, but may also be legally required depending on location.
Creating Data Yourself
When needed, you could also create the data yourself, keeping in mind the time and effort involved. For example, if you wanted to create a coin classification model, you could ask friends, family, and colleagues to take photos of their spare change and then you could use a tool like RoboFlow to manually label each coin in each image. This manual process, while tedious and slow, is a great way to be sure you have a quality data set you can trust. Similarly, you might rely on crowd-sourced efforts from Mechanical Turk (mturk.com) or a similar site to do the heavy lifting for you.
Web Scraping: A Last Resort
When all other options are not available or feasible, scraping is a method that may work, but should be approached with caution. As mentioned at the beginning of this article, scraping may introduce several concerns. In addition to the concerns mentioned before, it’s not always a straight-forward process and may even get you blocked from a service or website. Always avoid this approach until other options are exhausted and if you decide to scrape data, do so legally and ethically.
Spending the time up front to source data the right way may save you a lot of time and headache in the future because you’ll have the right data which will help you avoid having to do the same work multiple times and sourcing data appropriately will help you avoid legal and ethical pitfalls along the way.
Final Thoughts: Data’s Role in ML
In conclusion, sourcing data wisely is paramount for effective Machine Learning, ensuring diverse and high-quality datasets, while avoiding legal and ethical pitfalls. The foundation of successful Machine Learning lies in the quality of your data.