Inside Energy has been exploring the topic of workplace fatalities in the oil and gas industry. Emily Guerin asked why North Dakota’s industry is so dangerous, and Stephanie Joyce examined what’s behind Wyoming’s improving safety record. To do this reporting, we wrangled a lot of data. And workplace fatality data – specifically the data that goes into calculating workplace fatality rates – is quite possibly the most unruly data Inside Energy has wrangled to date – not because it’s difficult, but because the nature of the data makes it nearly impossible to capture the full story of how dangerous the oil and gas industry is, and how safety varies at a local level.
Here are some of the biggest challenges involved in analyzing workplace fatality data:
Data Challenge 1: When we refer to the “oil and gas” industry, what do we mean?
As it turns out, this is not a straightforward question. The federal government classifies workers in such a way that people who drill wells – and are at the very core of the oil and gas industry – aren’t included in the oil and gas category. To understand why, you need to understand something called NAICS.
NAICS…what’s a NAICS?
To make data tracking easier, the federal government created something called the North American Industry Classification system, or NAICS. NAICS is not a data source; it’s a classification system which ensures that different government agencies are referring to the same industries when they publish data about those industries.
NAICS assigns codes (from two to six-digits) to industries and sub-industries. The length of a NAICS code (which ranges from two to six digits) tells you how specific that classification is.
When the Bureau of Labor Statistics (BLS) or the U.S. Census Bureau publishes data about workers, they use NAICS codes to organize the data. Here’s an example of how you would access BLS Census of Fatal Occupational Injuries data using BLS’s data query tool and NAICS codes to specify the industry:
(Note: Inside Energy didn’t use this web interface to access the fatality data – we used the BLS Public Data API. The process and underlying databases are the same, but the BLS API makes it easier download a lot of data at once.)
NAICS codes of the same length nest into their parent category. For example, mining is an industry (NAICS code 21), and within mining oil and gas extraction is a sub-industry (NAICS code 211), along with “mining except for oil and gas” (NAICS code 212), and “support activities for mining” (NAICS code 213). So data (like number of employees, number of fatalities) in industries described by NAICS 211, 212, and 213 should add up to all the data in NAICS 21.
But here’s where it gets tricky: A lot of oil and gas jobs – including some of the most dangerous ones, like drilling – are classified as a support activity, which means they are under NAICS 213. So to get data on all oil and gas jobs, we also need to look at NAICS codes 213111 (“drilling oil and gas wells”) and 213112 (“support activities for oil and gas exploration”).
The most accurate picture of the entire oil and gas industry is available in codes at NAICS 211, 213111, and 213112. This means we have to add several values together to get “oil and gas” data – which is doable, just means more work for analysts and data journalists.
But where this can complicate the data analysis is that some information is only published at the 2-digit or 3-digit NAICS level – which means some stats you’ll see on oil and gas don’t include those support jobs like drilling, thus aren’t capturing the entire industry, and aren’t comparable to others because they aren’t actually looking at the same pool of workers.
Data Challenge 2: A fatality rate is really hard to calculate.
As you might expect, the number of fatalities varies as the industry grows and shrinks. So what matters more is the fatality rate – how many deaths per 100,000 workers – which can tells us how relatively dangerous an industry is. The rate lets us look at the safety of an industry over time, or compare it to other industries.
To calculate a fatality rate we need two things: the number of fatalities (the numerator) and the number of workers (the denominator). Getting these numbers is surprisingly difficult.
There are several sources for employment numbers, including the Quarterly Census of Employment and Wages (from the Bureau of Labor Statistics) and the County Business Patterns (from the U.S. Census Bureau). The number of workers varies from day to day, week to week, and month to month. So some data sets (like QCEW) average a count taken four times a year to get an annual employee count. Others, like CBP, use a single, once-a-year snapshot of workers (generally sometime in March). These government sources count employees differently. For example: According to QCEW, North Dakota had 15,999 oil and gas workers in 2012; CBP reports 17,838.
We chose to use CBP data because it included employment at the six-digit NAICS code – remember NAICS? – and lets us estimate workers for the entire oil and gas industry at the state level. When worker numbers do not meet publication standards – because the numbers are too low to adequately protect confidentiality – CBP does not publish them. However, this was very rare for the states and industries we were looking at.
In our analysis, we used fatality counts from the Bureau of Labor Statistics Census of Fatal Occupational Injuries, which publishes updated data each year. At the national level, these numbers are available – and reliable – to the six-digit NAICS code. But at the state level, BLS also has restrictions on what they publish: They only publish values that meet certain publication standards. If the number is too small, it could reveal confidential information about a specific person, or about a specific company. (For more information, view the BLS handbook of methods; publication standards and confidentiality concerns are addressed on page 21.)
This leads to some strange results. For instance, state fatality counts don’t add up to the national totals. It’s not because BLS is publishing incorrect data; it’s because the data isn’t comprehensive.
Data Challenge 3: Small numbers make everything more complicated
Workplace fatality counts are small (thankfully). That means even small changes in the number of deaths can lead to huge changes in the fatality rate.
To minimize the small number effect, we calculated a fatality over a two-year period (2011 and 2012, together), averaging both the fatality count and the number of workers over that two year period. (For more information on exactly how we did the math, see our methodology.) When we compared state rates to the national average, we used the national rate calculated using this method.
What does this mean for calculating the fatality rate?
That state-level fatality counts are often underestimates (especially for states with small fatality counts, like Wyoming and Kansas), which means our calculated fatality rate is also an underestimate. The Census Bureau publishes a “noise level” – meaning an estimated error range – along with the CBP employee estimates, typically five percent or less (which means the actual number of employees could be up to five percent higher or lower than the estimate). And even if we assume the largest amount of noise possible, it still doesn’t change the results much: For example, North Dakota has a fatality rate of 75 deaths per 100,ooo workers, or a range of 73 to 76 deaths per 100,000 workers accounting for noise.
Looking at two years of data reduces the effect of the noise in the employee estimates, as well as the under-estimation effect of the BLS fatality counts.
Additionally, we didn’t publish fatality rates for states that had small oil and gas industries (like Indiana), because those fatality rates are less conclusive. We only published for states that had at least an average of 5,000 workers and 20 active drilling rigs. To contrast, the National Institute for Occupational Safety and Health (NIOSH) doesn’t publish rates when the numerator is less than 10; when the numerator is less than 20 they publish it with a note. In our fatality rate table, we indicated rates calculated with a numerator less than 10 in grey.
Data Challenge 4: So what should you do with tricky data?
Well, one thing we can do is check it against other data sources. There are many different methods for calculating fatality rate, and we checked the accuracy of our method against several others. BLS publishes workplace fatality rates using a method that incorporates the number of hours worked, which means it accounts for overtime and part-time workers. (The hours-worked data isn’t consistently available at the state-level, so we couldn’t use it to calculate state-level rates.) And NIOSH calculates national oil and gas fatality rates using QCEW data. Here are the national oil and gas workplace fatality rates in 2012 calculated three different ways – using QCEW employment data, using CBP employment data, and using BLS’s method (which looks at worker hours and doesn’t include NAICS 213113 and 213112): 25.2, 28.2, and 25.1 per 100,000 workers. Using the method we used to calculate state-level fatality rates (a two-year average for 2011 and 2012), the rate is 27.1 deaths per 100,000 workers. It turns out, at the national level, at least, calculating the fatality rate for oil and gas using different methods and different worker counts yields pretty similar results.
As we get more data – for example, when BLS releases the 2013 state workplace fatality counts, and the Census Bureau releases the 2013 employee counts – we’ll update the fatality rate to make it as accurate as possible.
As journalists, Inside Energy’s goal with doing this kind of analysis is to use data to figure out what questions to ask and what issues deserve attention. In this case, our analysis allowed reporter Emily Guerin to ask the governor of North Dakota why the state had an oil and gas fatality rate that was three times the national average. He responded: Let’s study it.
All data has limitations. Just because data has limitations doesn’t mean we shouldn’t use it. But we should do our best to try to understand and be transparent about those limitations