Three Simple Things About Regression That Every Data Scientist Should Know

Understanding these three things will improve how you go about linear and generalized linear modeling

Keith McNulty

--

I consider myself more of a mathematician than a data scientist. I can’t bring myself to execute methods blindly, with no understanding of what’s going on under the hood. I have to get deep into the math to trust the results. That’s a good thing because it’s very easy nowadays to just run models and go home.

A model is only as good as your understanding of it, and I worry that a lot of people are running models and just accepting the first thing that comes out of them. When it comes to regression modeling — one of the most common forms of modeling out there — you’ll be a better data scientist if you can understand a few simple things about how these models work and why they are set up the way they are.

1. You are predicting an average — not an actual value

When you run a regression model, usually you are finding a relationship between the input variables and some sort of mean value related to the outcome. Let’s look at linear regression. When we run a linear regression we are making two very important assumptions about our outcome variable y:

  1. That the possible values of y for any given input variables are distributed around a mean.
  2. That the mean of y has an…

--

--

Keith McNulty

Pure and Applied Mathematician. LinkedIn Top Voice in Tech. Expert and Author in Data Science and Statistics. Find me on LinkedIn, Twitter or keithmcnulty.org