One of the key points I tried to make in Data Principles for Beginners is that if you want to work with data you first need to track the right information. Some data can never be recovered if you don’t track it up front. And some is just impossibly difficult to obtain after the fact.
I want to say that the the example I used in that book, since I’m a writer, was how many hours it takes me to write each title I publish. This is crucial for me because it takes far less time to write a non-fiction book about Excel than it does to write a 120K-word YA fantasy novel. So if I earn the same amount on those two titles it turns out my time is much better spent writing another non-fiction book than another YA fantasy novel because I get the same return with far less time spent to get there.
The reason I bring this up today is because this COVID-19 situation is a perfect example of how important data analysis is to understanding the situation. And many of the concepts I discussed in that book are playing out right now in real life.
For example, it looks like it may be important how those who analyze fatality data bucket age groups. Here, for example, is a chart from New York state:
Here is similar data from Colorado:
Note how Colorado groups anyone over the age of 80 into one bucket whereas New York splits out those over 90 into their own category? And note how in New York that seems to be important. I haven’t run a statistical analysis on those numbers to see if the difference in fatality rate between those two groups is material or not, but it looks like it might be.
Of course, then you need to figure out why that difference exists. Maybe there was a virus that circulated for those 90+ when they were children that has given them partial immunity. Or there’s some commonality among those who live to 90+ that makes them more resilient when dealing with this. Or maybe when you’re 90+ you only bother to go to the hospital for treatment of something like this if you’re generally more healthy, and if we were to account for those who died at home during the same period the difference would go away.
But there’s no way to see that difference if that data isn’t, first, collected and, second, used for analysis. This is why it’s often very important to chart data before you create your categories so you can visually see what you’re dealing with. (I believe in the book the example I used revolved around annual income categories for bank customers. If you’re dealing with high net worth individuals using a top category of $100,000+ isn’t going to work well.)
Now maybe what we’re seeing above is just a quirk in the New York data and if you were to separate out the 90+ age range from the 80-89 age range in Colorado there’d be no difference. But the key is to be able to do so if needed (which means setting the right ranges for your dataset) and then actually attempting to do so.
(There’ve been articles about potential racial difference in outcomes as well. But without information on living situation, health care status, neighborhood pollution levels, income, etc. it’s hard to say whether it’s because of economic disadvantage, systemic racism, or something genetic. Same with the fact that more men than women seem to be dying. Without information on things like smoking history, which was one of the early suggestions that I think has since been disproven, you can’t parse out the actual cause for the differences.)
Another issue I’ve noted is the problem of comparing apples to oranges. I admire Johns Hopkins for what they’ve been doing with their dashboard but it also makes me want a strong drink. Here it is as of this morning:
What annoys me about it is the Total Confirmed numbers on the left-hand side cannot be readily compared to the Total Deaths numbers on the right-hand side. If you look at the bottom of the total confirmed numbers you’ll see Admin0, Admin1, Admin2. These used to be better labeled. What they do is allow you to toggle between a country-level view and a more granular level of data.
By default for confirmed cases you get country-level case data.
Problem is that the death values on the right-hand side are NOT country-level data. You can now see this clearly when you look at the fifth entry in the image above which is not even for New York state, but is instead for New York city. Scroll down further and you’ll see additional entries for New York state.
There is no easy way to find the total values for the U.S. nor for the most-impacted states. It’s very frustrating. And until CNN published their U.S. tracker and Stat News published theirs (and got it working so it’s current and not weirdly delayed) I was highly annoyed by this situation. Because the data was there but it was being presented in a very ineffective and perhaps even misleading manner. (Most people don’t dig into the data they’re shown, they just take what they see on the surface so it was easy to look at the death values and assume the U.S. wasn’t as high up on the list as it actually was.)
It should be easy enough to put the same Admin0, Admin1, Admin2 category options on the death data as it was to put it on the confirmed cases data. And then the user could easily compare cases to deaths with just a glance.
Of course, as I’ve discussed before, we’re not testing enough for this data to actually be a full picture of what’s happening anyway.
There are people who have died at home who were never tested so are not part of the fatality data. There are people who very clearly have had it who also were never tested. There are people who are going to die from something else because they will either choose to stay home rather than seek care or because they won’t able to get the care they need to save their lives.
At some point in time someone with good data skills is going to have to go back and look at baseline fatality levels for a similar timeframe over say the last five years, adjust for the current year trend for the last six months or so before the virus hit, and then extrapolate the number of direct and indirect deaths caused by COVID-19 to give us a legitimate picture of the actual impact of the virus. (And of course if we’re going to give the virus blame for the indirect deaths due to lack of care we also need to give it credit for lower traffic fatalities, etc.)
Whoever does that will then have to probably back into total infection numbers once we have some idea of infection vs. fatality/hospitalization rates by region. If that’s even possible.
Of course, no good data, no good analysis. The key starting point to be able to do any of that is the data. Data is key. You have to collect the right information and in the right format. And then you have to use it effectively and ask the right questions. (Which is why one of the first chapters in that book was also about how you need subject matter experts who understand the data you’re working with not just smart people who can run a regression analysis.)
Anyway. Data and how you use it matters.
For anyone looking for the sources I referenced above: