How they lie with numbers.

Let me preface this piece by saying that Statistics as a method is not fundamentally flawed, but ask any Machine Learning engineer in your local cafe, and they’ll be quick to tell you that the process of data collection and storage is usually flawed. A new engineer is quick on their feet to go to Kaggle and download a dataset, perform a few cleaning operations, do a little analysis, classification, prediction and update their Linkedin headlines with “Machine Learning Enthusiast”, “AI Expert” and “Data Scientist”. While I find little points of contention with the first two titles, the last one is questionable, since you’re working with what is essentially a DIY data cleaning project, instead of true data processing, but I digress.

Data, especially in these “unprecedented times” of COVID-19 pandemic, are much far from the utopia that is Kaggle. Data in real life is messy, full of clutter you don’t want and has flaws that we will talk about. I will also go over how a lot statistics are not absolute truths. Darrell Huff’s book “How to lie with statistics” has 10 chapters barring introduction and acknowledgements, but we will look at 3 concepts.

The Sample with the Built-in Bias

A few months ago, just when the lockdowns were to start in India, an Islamic missionary group named Tablighi Jamaat had a congregation in South Delhi. This ended up becoming a hotspot for Coronavirus in India. Twitter trolls and journalists were too eager to attack the entire Muslim community. Muslims were lynched, attacked and discriminated against.

For example, here is an article by the Economic Times, that starts off with —

NEW DELHI: Over 95% of the coronavirus cases reported over the last two days in India have been found to have links with the Tablighi Jamaat congregation in Delhi.

But nowhere in the article did it mention the sampling bias in testing. Once the Tablighi Jamat was discovered, there was aggressive testing for the Jamaatis (the people of the Missionary), and a lot of them tested positive. The fact that the missionary had a structure, helped in contract tracing of spreaders. This level of testing was not performed on other groups of people. In some states, it was legally mandated for Jamatis to test. Some journalists stated that Jamat was the single reason why India was in dire straits in terms of fighting COVID-19.

This is what is essentially a biased sample, you aggresively test one section of people, and then you get more than half the folks testing positive, but you haven’t tested any other cluster as well, so you establish a co-relation between the high number of cases and tablighi jamat, and that is the only co-relation you present. Intellectual Dishonesty aside, this is simply muddy journalism.

The Semi-Attached Figure

In his book, Huff writes

If you can’t prove what you want to prove, demonstrate something else and
pretend that they are the same thing. In the daze that follows the collision of statistics with the human mind, hardly anybody will notice the difference.

Researching for this article was incredibly difficult. Not for lack of data or reporting, but the clear click baiting and bias of the Google search engine was an obstacle. Nonetheless, I came across some interesting articles that did this, and at the forefront is CNN.

The article leads with the title

Black Lives Matter protests have not led to a spike in coronavirus cases, research says

Alright, that seems like Good News! However, I am not one to trust media websites, so I did a little digging.

A new study, published this month by the National Bureau of Economic Research, used data on protests from more than 300 of the largest US cities, and found no evidence that coronavirus cases grew in the weeks following the beginning of the protests.

Now the following part of the article is almost a cartoon-ish, hilarious representation of the joke that modern media is.

In fact, researchers determined that social distancing behaviours actually went up after the protests — as people tried to avoid the protests altogether.

So the researchers state that the people are afraid to go back out, and thus the number of cases have dropped, because people who are scared are staying indoors.

“Our findings suggest that any direct decrease in social distancing among the subset of the population participating in the protests is more than offset by increasing social distancing behaviour among others who may choose to shelter-at-home and circumvent public places while the protests are underway”

In simpler words: Any increase is cases due to lack of Social Distancing at protests have been overshadowed by the decrease in cases due people staying indoors because they are scared of the violence of the protests.

(Please note that I support the Black Lives Matter movement, I am only critiquing disingenuousness in reporting. As someone who is affected by an increased rate of COVID-19 spread, I am disgusted by introduction of biased journalism in reporting of two very grave phenomena: Systemic Racism and the spread of a life-threatening pandemic)

Here is another article, by the Healthline discussing the correlation between COVID-19 surge and BLM protests. Here the headline isn’t as misleading because there is little data to draw any conclusions.

“I have not seen any peer-reviewed research linking outdoor protests (or really any major outdoor events) to the surge here in Texas” said Rodney Rohde, PhD, an associate dean for research at Texas State’s College of Health Professions who focuses on public health microbiology.

I’ll break this down as well, but let me take a second to address the one thing I have noticed in deceptive news articles: That is that they bury the fact in between lines of fluff, and resort to using titles of other people of authority (here, they use Rodney Rhode) as a supplement for factual, ethical reporting. Anyway, in this article, Rodney simply states that he has not seen any paper co-relating outdoor protest and surge in cases. Ridiculousness of acquiring such an isolated dataset aside, you cannot simply say you haven’t found any papers of such kind, and thus the conclusion is not true. On part of healthline.com, this is simple intellectual dishonesty.

Rhodes goes on to say

The COVID-19 spike in Texas is likely tied to the reopening, not the protests

One can be equally intellectually dishonest and feign ignorance and state that thus, there is no co-relation between the two, but there definitely is some degree of co-relation.

Correlation is not a binary value. Here is the formula for co-relation between two datasets, x and y. Using a co-relation between datasets A and B, and presenting it to be a co-relation between C and B, is as bad as journalism gets, but are we surprised anymore?

The outrage against people wanting the economy to safely open up was ridiculous, and the discrediting of co-relation between BLM protests and Corona-virus is insane.

The One Dimensional Picture

Alright, now let’s compare the numbers for countries with COVID-19 cases for a second. At the time of writing, these are the numbers, higher to lower.

From worldometers.info/coronavirus/ on 15th Aug, 2020
From worldometers.info/coronavirus/ on 15th Aug, 2020

As an aside, I always wonder how many stories are hidden in each of these numbers. Statistics is merciless for no fault of its own — stories, people, lives become just a number. I remember the first time I made my Coronavirus tracking module, I was sad as fuck after deploying. 3 months later, I have been numbed to this.

So here’s how it works, the more people you test, the more will test positive. Some deaths are marked COVID-related death despite not being COVID-related. For example, some countries test only if you have symptoms. Otherwise, please explain how do smaller countries with no form of contract tracing, a mostly labour based economy (thus relatively less strict lock-down) and little healthcare infrastructure have lesser cases than some of the most successful countries relatively.

A better measure of cases would be number of cases vs number of tests done. Instead of measures such as death rate, which is almost impossible to concretely lay a finger upon, a better measure would be rate of hospitalization and rate of infection, that would help countries prepare faster. Death Rate is variable due to non-standard way of classifying a death as a result of COVID or comorbidity. Death Rate can be calculated by anybody with a computer, but some data is held by the government, such as number of tests.

This essentially makes data a privilege to have. Lying with data becomes easier. Furthermore, the network effect of these media outlets is larger than what an individual data researcher has, thus they manipulate according to their ideas and beliefs. In the cacophony of politics, we have lost the essence of community and trying to save people. There are media outlets in India attacking minorities and media outlets in USA attacking blue collared workers wanting to come back into business safely (of course it’s a more nuanced discussion than that, but I digress). Left does it, Right does it, Centre does it — everyone does it, and we are left without solid data proofs. It is easy to lie with graphs and percentages, it is harder to lie with solid data.

Subscribe to August Radjoe (also: abhinavmir)
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.