Three Simple Things About Regression That Every Data Scientist Should Know

Understanding these three things will improve how you go about linear and generalized linear modeling

Keith McNulty

--

I consider myself more of a mathematician than a data scientist. I can’t bring myself to execute methods blindly, with no understanding of what’s going on under the hood. I have to get deep into the math to trust the results. That’s a good thing because it’s very easy nowadays to just run models and go home.

A model is only as good as your understanding of it, and I worry that a lot of people are running models and just accepting the first thing that comes out of them. When it comes to regression modeling — one of the most common forms of modeling out there — you’ll be a better data scientist if you can understand a few simple things about how these models work and why they are set up the way they are.

1. You are predicting an average — not an actual value

When you run a regression model, usually you are finding a relationship between the input variables and some sort of mean value related to the outcome. Let’s look at linear regression. When we run a linear regression we are making two very important assumptions about our outcome variable y:

  1. That the possible values of y for any given input variables are distributed around a mean.
  2. That the mean of y has an additive relationship with the input variables. That is, to get the mean of y you add up some numbers that depend on each input variable.

When you use your model to make predictions, the predicted (or modeled) value of y for a given set of input values is an estimate of the mean of all the possible values y could take. Therefore, in communicating the results of your model, you should always be careful to ensure that this uncertainty is clear.

One way to do this is to use the prediction interval which assumes a normal distribution of y around the modeled mean. Note that this is different from the confidence interval which is often produced by your model — that is simply an interval of uncertainty around your mean value, and so is usually much more narrow than the prediction interval. In the chart below, I show a fitted linear regression for the day of the first cherry blossom bloom in Japan, related to the average temperature in March…

--

--

Keith McNulty

Pure and Applied Mathematician. LinkedIn Top Voice in Tech. Expert and Author in Data Science and Statistics. Find me on LinkedIn, Twitter or keithmcnulty.org