Meet Data Battery: Inside Energy’s Data Repository

Here at Inside Energy, we love working with data that adds depth to the stories we cover about the human consequences of energy. We love diving into complicated datasets, figuring out what they mean and presenting them to you, our audience, in a way that’s fun and illuminating.

There’s a lot of work that goes on to bring you those graphs and statistics. Now, we invite you to find out what goes on behind the scenes. We’re making all of our data and code public, in what we’re calling our Data Battery. It’s a nerdy name, we know, but we like it because our Data Battery is where we store – and share – all the data and analysis that powers our reporting. We’re making it public, and as easy to access as possible, so it can power your work, too.

As we flip the switch on our Data Battery, we’d like to answer some questions you might have.

What is this thing?

The Data Battery is our data repository, and it’s a place for us to show our work. In it, you’ll find…

the raw datasets and code that we use in our data analysis, visualizations and research
links to the original sources we used to find our data
notes about how we found, cleaned and analyzed the datasets.

More importantly, it’s a place where we invite you to explore the energy datasets we work with. The story doesn’t stop with our analysis – that’s where you can take it further. You’ll be able to see the data that informed our stories – data like the number of coal workers employed in top coal producing states, or the indicators we used to assess the financial health of oil and gas companies – download it, and do your own deep dives.

Why have a data hub?

Tons of reasons! But here are the three big ones:

As a non-profit news organization, our mission is to connect information with the people who need to know. The data that governments, organizations and companies provide is not always easy to access or understand. We can do the heavy lifting of cleaning up confusing datasets and wrangling difficult-to-access numbers, so that our readers and listeners can use them easily.

As journalists, it’s important that we have your trust. Transparency is an essential part of building trust with our audience – and when it comes to data journalism, the best way to be transparent is to show you exactly what data we used and how. Plus it’s good practice.

After a year of writing data-driven stories, we found that we needed a better way to organize and keep track of our own work. The Data Battery helps us when we need to revisit data we’ve used in the past and when people were ask us for the data behind our stories. Now, we have a way to archive our work so that we – and you – can build off it.

Why do it on GitHub?

GitHub is free, and it’s widely used among the news nerd community. Great data-journalism outlets like FiveThirtyEight and The Texas Tribune use GitHub to share the data they use and the code behind their news apps. GitHub is built for sharing, so other users can easily take our data and adapt it in creative ways.

Documentation on GitHub is easy. We can use it to make guides – both to our data and analysis, and to our in-house GitHub tutorials – that are public and shareable. Plus, you can steal our data, code and guides!

It also has built-in features that make it easy for us to provide downloads of our data, without directing anyone through our GitHub page. That’s important, because as much as we love using GitHub, it can be an intimidating and confusing platform for people who aren’t familiar with it. That’s why we added our “Get the data” widget, which you may have noticed popping up under our data visualizations:

We use GitHub to host our data, but click on any of the links to download our data, and you either get an automatic download of a CSV or an Excel file, or a link to a Google Spreadsheet. No need to visit GitHub at all – unless you want to read our nerd notes!

Who was our inspiration, or, who did we copy 😉 ?

One of the best examples of a data hub is FiveThirtyEight’s, and we’re not embarrassed to say that we heavily copied the format of their GitHub page.

We also like the way The Guardian provides a data download below many of their visualizations and encourages people to remix their data.

We also borrowed from ProPublica’s excellent use of GitHub to publish guides about the data it uses and apps it makes, as inspiration for our own data and code notes.

Our team is also inspired by the open data movement, and our data journalist Jordan Wirfs-Brock used to work on projects like Colorado Data Engine and NNIP. We understand the value of making data accessible to as many people as possible. Doing so is a challenge, but it’s one that’s worth the effort, in pursuit of both transparency and better engagement with our audience.

What can you, reader out there in the world, do with the data?

Of course, we hope you look at the stories we tell using data. But we also hope you do something with our Data Battery:

Download our data and make something new with it. For instance, in our story about oil and gas tax revenue, we made a graph showing severance tax revenues as a percentage of all tax collection for the top ten oil producing states. On our Data Battery, we shared data for all 50 states. You might take that data to make a map of tax revenues, or to compare tax revenues in other states. Or to do something else cool and creative, that we haven’t thought of!

Use our code. For example, for our worker fatalities analysis, we used Python to query the Bureau of Labor Statistics API. You can adapt that code for your own query.

Use our model to make your own data hub. Also feel free to use the tutorials we used to train our reporters to use GitHub.

How can you contribute?

If you do something with one of the datasets on our hub – like do original analysis or make a new visualization – let us know. We’d love to see and share your work.

Or, if there’s data out there that you’d like us to do a story about, let us know. Suggest a dataset you’d like to see us liberate!

What’s next?

For us: We don’t yet have all of our data from the last year on GitHub. Over the coming months we’ll be adding our archival datasets and code. We also just finished training all our reporters on using GitHub to upload their data. We did this so that in the future, as our stories are published to the web, our data will be right there with each story.

This is a learning process. Using GitHub is new for most of us, and it can be scary. But we’re doing everything out in the open, so that you can learn from our training process and what we’re doing to integrate GitHub into our newsroom.

For you: Let us know how we’re doing. Is there a way we can make our work more transparent or user-friendly? We want to hear it. Feel free to send a note to Jordan Wirfs-Brock, jordanwirfs-brock@rmpbs.org.

Visit Data Battery: Inside Energy’s data repository

Inside Energy - Bringing energy reporting down to Earth

behind the scenes

Meet Data Battery: Inside Energy’s Data Repository

By Jordan Wirfs-Brock and Catherine Roberts | July 30, 2015

Inside Energy is a collaborative journalism initiative of partners across the US and supported by the Corporation for Public Broadcasting