Is there something, or nothing?
Data is a window into something you actually care about. If you’re listening to a patient’s heart, the sound waves are data. But you really care about the heart valves you can’t see or touch directly.
So data is useful, but it can be misleading. This is unavoidable. The best way to avoid being misled is to think carefully around your data: what other explanations might explain what I saw?
Here I’ll talk about one major alternative explanation: your variables aren’t linear but you’re analysing them like they are. We’ll walk through what this means at a birds-eye and then work through a specific example.
Interactive Notebook (simulated data): https://colab.research.google.com/drive/1d4Ly9HPQTKr_WBS4wqmcsuHc0GxkFlrL?usp=sharing
At first there was nothing…
We’re going to “do an experiment”
Looks like changing the variable we’re controlling (independent x) doesn’t change the variable we’re watching (dependent y).
If we do a correlation on this we’ll find that there isn’t a statistically significant one between X and Y. We can’t reject the null hypothesis that they’re not related.
In other words, they don’t look like they’re related.
…then poof, relationship!
But let’s say your friend does the same experiment and tell you that X and Y are related. What?! How could they be so silly?
You then look at their experiment and see
Whoa, it sure looks like there’s not a flat line between X and Y. It looks like they’re related!
Turns out your friend ran the same experiment, but for a different range of X. And it’s pretty easy to see that the Y ticks up as X goes up and beyond 10.
What’s the truth?
The great thing about simulating data is that you know what the truth is: you built it!
So let’s take a look at what generated our data under the hood: https://www.desmos.com/calculator/xwb1cexqep
The red line is the truth that generates the data right before we add some static over it. The red line has this equation — it may look complicated but it’s one of the least complex complicated things you’ll encounter.
What happened here is that we ran our experiment on a small range of X and assumed that that small range was representative of the total range.
Collecting more samples inside of that small range would not solve our problem. The problem is we assumed a linear relationship in the variables when the real relationship wasn’t linear.
Back in the day, we’d be stuck trying to piece together the big picture from little patches of lines. But today, largely because of machine learning, we can avoid making the linearity assumption because computers can do more complex calculations than we could on paper.
Being able to spot nonlinearities can be the difference between “this isn’t a promising treatment” and “miracle medicine”. Next we’ll dive into more sophisticated ways to see through data to the relationships we care about.