Curating local data for local people. In Bath.

You Know Nothing Jon Snoxw

This blog post has the best game of thrones/weather/air quality joke you’ll read all day. It also has some notes on what I built at the BathHacked Air Quality hackday.

Going into the hackday I realised several things.

My helpful screenshot

Firstly, there was a lot of data available. BANES council had made an archive of 13 years of air quality data available. I’d also discovered that the DEFRA Air Quality data archive had data for one sensor dating from 1997. That’s a lot of data to get to grips with. I decided that I was going to focus on trying to summarise the dataset rather than build anything on top of it.

Secondly, I also didn’t really know anything about air quality. To get prepared I produced some documentation for everyone attending the hackday that summarised the main data source, and provided some useful pointers. I also did some reading around on both the BANES and UK-AIR websites.

Air quality data analysis is a complex area. There are a lot of factors to take into account including a variety of sources of pollutants, complex interactions between pollutants and impacts from the prevailing weather conditions. BANES publish some summary reports but while informative they didn’t really give me a sense of where the pollution was coming from, or how bad it was at different types of the day or year.

I also discovered the Open Air project which provides an R package to support air quality analysis. It also comes with an amazing set of documentation: the manual is over 200 pages and includes a short introduction to R.

So I read the manual and went into the hack day with a goal of trying to answer two questions:

  1. Can we provide Bath citizens with more insight into the air quality for Bath?
  2. Can we provide the local council with new ways to generate meaningful visualisations and summary reports?

I didn’t get as far as I’d hoped, but I managed to do enough to create what I think is an interesting summary of the data.

Using R and openair I was able to quickly import, normalise and explore the data. In fact R makes it so easy to generate diagrams that I spent a lot of the day just playing with graphs.

The judges also liked the results and I was lucky enough to win the Most Educational Project prize. The report is now featured on the BANES website.

I’ve also published the code on github if you’d like to explore. The main report code could easily be customised to use an alternate DEFRA location if you want to try it on some data from your local area.

There’s more to explore here, not just around the data analysis, but also around the concept of having reproducable data analytics.

Reproducability is an important part of scientific research and analysis. At least one driver of the growing adoption of open source and open data in the research community is to make science more reproducable: it should be possible for someone else to pick up your research to easily check the results and maybe go a step further.

I’ve not yet seen this idea extended to publishing of analysis of open (statistical) data, but the concepts are the same. Reproducability is another way to increase transparency. Open data has been shown to help people find data errors, but open source can also help people find and fix errors in an analysis itself.

I’ll certainly be playing more with R over the coming months. I’m sold on the ease with which it’s possible to really quickly explore a dataset.

The other hacks produced on the day were all really interesting. I recommend you read the run down of the entries on the BathHacked blog.