Three Common Hypothesis Tests All Data Scientists Should Know
Hypothesis testing is one of the most fundamental elements of inferential statistics. In modern languages like Python and R, these tests are easy to conduct — often with a single line of code. But it never fails to puzzle me how few people use them or understand how they work. In this article I want to use an example to show three common hypothesis tests and how they work under the hood, as well as showing how to run them in R and Python and to understand the results.
The general principles and process of hypothesis testing
Hypothesis testing exists because it is almost never the case that we can observe an entire population when trying to make a conclusion or inference about it. Almost always, we are trying to make that inference on the basis of a sample of data from that population.
Given that we only ever have a sample, we can never be 100% certain about the inference we want to make. We can be 90%, 95%, 99%, 99.999% certain, but never 100%.
Hypothesis testing is essentially about calculating how certain we can be about an inference based on our sample. The most common process for calculating this has several steps:
- Assume the inference is not true on the population — this is called the null hypothesis
- Calculate the statistic of the inference on the sample
- Understand the expected distribution of the sampling error around that statistic
- Use that distribution to understand the maximum likelihood of your sample statistic being consistent with the null hypothesis
- Use a chosen ‘likelihood cutoff’ — known as alpha — to make a binary decision on whether to accept the null hypothesis or reject it. The most commonly used value of alpha is 0.05. That is, we usually reject a null hypothesis if it renders the maximum likelihood of our sample statistic to be less than 1 in 20.
The salespeople data set
To illustrate some common hypothesis tests in this article I will use the
salespeople dataset which can be obtained here. Let’s download it in R and take a quick look at the first few rows.
url <- "http:://peopleanalytics-regression-book.org/data/salespeople.csv"