Finding Open Datasets
The best ways to build a skill is to practice it, and data analytics is no different. That means you need to find datasets to analyse, and luckily there are plenty of open datasets available. There’s always more ways to find data sets, but here’s a few places to get inspiration.
I’m looking for a particular dataset
Google has a dataset search which is a great place to start.
I’m looking for inspiration
- Data in brief is an Open Journal for publishing datasets, with descriptions
- Kaggle hosts many datasets as well as their competition datasets which are great to benchmark Machine Learning techniques, and many have notebooks showing exploration of the datasets
- Tidytuesday has weekly datasets for exploring, and you can look at other’s explorations
- Jeremy Singer-Vine’s Data is Plural has weekly interesting datasets.
- Reddit’s Datasets subreddit sometimes has interesting data
- Governments often have open data portals, for example Australia, Korea, United Kingdom, EU, United States, Canada, and New Zealand. Also look for regional data portals, for example in Australia there are ones for each state and territory (e.g. Victoria, Queensland).
I want data about my community
As well as open data portals, Governments run censuses, track macro economic and social indicators, agricultural and environmental data, and so on (that may or may not be on the portal). The statistics departments, such as Australia’s Bureau of Statistics (ABS) and UK Office for National Statistics often have a lot of useful aggregate information (although with the ABS it takes some skill to find it).
Open Street Map has a lot of geographical data, and varying levels of data about structures. In Australia the G-NAF contains address data that’s not in Open Street Map.
I want big data
- Papers with Code Datasets have many of the common benchmark machine learning datasets
- Google Bigquery Datasets has some large datasets that can be accessed with BigQuery
- AWS Open Data registry
- CommonCrawl contains a ton of open web crawl data, and good resources for navigating it.
- The Internet Archive contains tons of resources, including the Wayback Machine for web crawl data
I want something special
You can always collect or build your own dataset. If you’ve got an actual problem you’re trying to solve, this is often the best way. This gives you experience not only analysing a dataset, but with collecting and processing data which are very useful to be able to understand and do.
The web contains a ton of public data that can be processed into datasets. For example I built a job ad dataset from Common Crawl. You could further annotate these to create your own dataset.
Another good method is to collect your own data, and if it doesn’t contain any potentially damaging information, share it as an open dataset. For example in Victoria there’s a way to get your own energy usage data and analyse it. Or you could analyse your email data. Or you could stick some sensors in your garden and record measurements that link to your plants growth, Or run a survey on a topic of interest to you.