Five Massively Misused Words in Data Science

Watch your language if you want to have impact

Keith McNulty

--

Photo by Julien L on Unsplash

As we all know, data science as a discipline is very new to our world. This makes it a very exciting field in which to work. But it also creates problems. Today I want to talk about one of those problems which I deal with all the time: using the wrong language to describe data science results or concepts.

Here are five words that I commonly see misused, as well as an explanation of the typical misuses. Hopefully, this will help you become more aware of booby traps in the communication and implementation of data science results.

1. Predictive

OMG, people LOVE the world predictive, don’t they? Since around 2010 when it started to come into fashion, I don’t think I have heard a word get banded about like the p-word. The biggest misuse I have seen is when it is used to describe any positive result for any variable in any model. Variable x is significant in a linear model, therefore variable x is predictive. That’s quite a jump to make.

Variables that have a significant effect in trained statistical models are only predictive on the training sample, and even then their effect might be so minuscule as to be practically irrelevant, and so it might be a misrepresentation of reality to describe them as ‘predictive’. There’s a whole testing process required on new data to be able to describe models or variables as predictive in real life. Only the other day I heard another person describe the results of a logistic regression model I ran as ‘predictive’ when I had not done any train-test split, and I wasn’t even trying to determine predictiveness. As a rule, never describe a variable or a model as predictive unless you have used a left-out testing sample to verify a predictive effect.

2. R-squared

R-squared as a generally accepted measure of model quality only exists for a very small class of linear, additive models. But I often hear people describing probabilistic models or classifiers as having ‘a high R-squared’. What does that even mean? Even for simple generalized linear models there are multiple ways of defining the overall model quality. There are at least five different types of “pseudo-R-squared” metrics…

--

--

Keith McNulty

Pure and Applied Mathematician. LinkedIn Top Voice in Tech. Expert and Author in Data Science and Statistics. Find me on LinkedIn, Twitter or keithmcnulty.org