120+ Data Scientist Interview Questions and Answers You Should Know in 2021

Original article was published by Terence S on Artificial Intelligence on Medium

Statistics, Probability, and Mathematics

Q: What is the p-value defined as?

The pvalue is the probability of obtaining the observed results of a test, assuming that the null hypothesis is correct; a smaller p-value means that there is stronger evidence in favor of the alternative hypothesis.

Q: What are covariance and correlation? How are they related?

Covariance is a quantitative measure of the extent to which the deviation of one variable from its mean matches the deviation of the other from its mean.

Correlation is a measurement of the relationship between two variables. It is the covariance of the two variables, normalized by the variance of each variable.

Q: What is the law of large numbers?

The Law of Large Numbers is a theory that states that as the number of trials increases, the average of the result will become closer to the expected value.

Eg. flipping heads from fair coin 100,000 times should be closer to 0.5 than 100 times.

Q: What is the Central Limit Theorem? Explain it. Why is it important?

The central limit theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size gets larger no matter what the shape of the population distribution.

The central limit theorem is important because it is used in hypothesis testing and also to calculate confidence intervals.

Q: What is the Markov Property?

When modeling a stochastic process, one in which an agent makes random decisions over time, such an assumption is referred to as the Markov property.

Q: What is statistical power?

‘Statistical power’ refers to the power of a binary hypothesis, which is the probability that the test rejects the null hypothesis given that the alternative hypothesis is true.

Q: What are confounding variables?

A confounding variable, or a confounder, is a variable that influences both the dependent variable and the independent variable, causing a spurious association, a mathematical relationship in which two or more variables are associated but not causally related.

Q: How does experimental data contrast with observational data?

Observational data comes from observational studies which are when you observe certain variables without intervening and try to determine if there is any correlation.

Experimental data comes from experimental studies (with intervention) which are when you control certain variables and hold them constant to determine if there is any causality.

Q: Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?

Selection bias is the phenomenon of selecting individuals, groups, or data for analysis in such a way that proper randomization is not achieved, ultimately resulting in a sample that is not representative of the population.

Understanding and identifying selection bias is important because it can significantly skew results and provide false insights about a particular population group.

Types of selection bias include:

  • Sampling bias: a biased sample caused by non-random sampling
  • Time interval: selecting a specific time frame that supports the desired conclusion. e.g. conducting a sales analysis near Christmas.
  • Exposure: includes clinical susceptibility bias, protopathic bias, indication bias. Read more here.
  • Data: includes cherry-picking, suppressing evidence, and the fallacy of incomplete evidence.
  • Attrition: attrition bias is similar to survivorship bias, where only those that ‘survived’ a long process are included in an analysis, or failure bias, where those that ‘failed’ are only included
  • Observer selection: related to the Anthropic principle, which is a philosophical consideration that any data we collect about the universe is filtered by the fact that, in order for it to be observable, it must be compatible with the conscious and sapient life that observes it.

Handling missing data can make selection bias worse because different methods impact the data in different ways. For example, if you replace null values with the mean of the data, you adding bias in the sense that you’re assuming that the data is not as spread out as it might actually be.

Q: What is the difference between interpolation and extrapolation and why does it matter?

Interpolation is when a prediction is made using inputs that lie within the set of observed values.

Extrapolation is when a prediction is made using an input that’s outside of the set of observed values.

It’s important to know the distinction because interpolations are generally more accurate than extrapolations.

Q: Give an example where the median is a better measure than the mean

When there are a number of outliers that positively or negatively skew the data.

Q: What is survivorship bias?

The phenomenon where only those that ‘survived’ a long process are included or excluded in an analysis, thus creating a biased sample.

A great example provided by Sreenivasan Chandrasekar is the following:

“We enroll for gym membership and attend for a few days. We see the same faces of many people who are fit, motivated and exercising everyday whenever we go to gym. After a few days we become depressed why we aren’t able to stick to our schedule and motivation more than a week when most of the people who we saw at gym could. What we didn’t see was that many of the people who had enrolled for gym membership had also stopped turning up for gym just after a week and we didn’t see them.”

Q: Walk me through the probability fundamentals

Eight rules of probability

  • Rule #1: For any event A, 0 ≤ P(A) ≤ 1; in other words, the probability of an event can range from 0 to 1.
  • Rule #2: The sum of the probabilities of all possible outcomes always equals 1.
  • Rule #3: P(not A) = 1 — P(A); This rule explains the relationship between the probability of an event and its complement event. A complement event is one that includes all possible outcomes that aren’t in A.
  • Rule #4: If A and B are disjoint events (mutually exclusive), then P(A or B) = P(A) + P(B); this is called the addition rule for disjoint events
  • Rule #5: P(A or B) = P(A) + P(B) — P(A and B); this is called the general addition rule.
  • Rule #6: If A and B are two independent events, then P(A and B) = P(A) * P(B); this is called the multiplication rule for independent events.
  • Rule #7: The conditional probability of event B given event A is P(B|A) = P(A and B) / P(A)
  • Rule #8: For any two events A and B, P(A and B) = P(A) * P(B|A); this is called the general multiplication rule

Counting Methods

Factorial Formula: n! = n x (n -1) x (n — 2) x … x 2 x 1
Use when the number of items is equal to the number of places available.
Eg. Find the total number of ways 5 people can sit in 5 empty seats.
= 5 x 4 x 3 x 2 x 1 = 120

Fundamental Counting Principle (multiplication)
This method should be used when repetitions are allowed and the number of ways to fill an open place is not affected by previous fills.
Eg. There are 3 types of breakfasts, 4 types of lunches, and 5 types of desserts. The total number of combinations is = 5 x 4 x 3 = 60

Permutations: P(n,r)= n! / (n−r)!
This method is used when replacements are not allowed and order of item ranking matters.
Eg. A code has 4 digits in a particular order and the digits range from 0 to 9. How many permutations are there if one digit can only be used once?
P(n,r) = 10!/(10–4)! = (10x9x8x7x6x5x4x3x2x1)/(6x5x4x3x2x1) = 5040

Combinations Formula: C(n,r)=(n!)/[(n−r)!r!]
This is used when replacements are not allowed and the order in which items are ranked does not matter.
Eg. To win the lottery, you must select the 5 correct numbers in any order from 1 to 52. What is the number of possible combinations?
C(n,r) = 52! / (52–5)!5! = 2,598,960

Q: What is root cause analysis? How can you identify a cause vs. a correlation? Give examples.

Root cause analysis is a method of problem-solving used for identifying the root cause(s) of a problem

You can identify a correlation using simple data analyses. You can then identify causation by conducting an experiment so that all other variables are isolated (ideally).

Q: You are at a Casino and have two dices to play with. You win $10 every time you roll a 5. If you play till you win and then stop, what is the expected payout?

Image created by Author

Let’s assume that it costs $5 every time you want to play.

There are 36 possible combinations with two dice. Of the 36 combinations, there are 4 combinations that result in rolling a five (see blue). This means that there is a 4/36 or 1/9 chance of rolling a 5.

A 1/9 chance of winning means you’ll lose eight times and win once (theoretically).

Therefore, your expected payout is equal to $10.00 * 1 — $5.00 * 9= -$35.00.

Q: Give me 3 types of statistical biases and explain each of them with an example.

  • Sampling bias refers to a biased sample caused by non-random sampling.
    To give an example, imagine that there are 10 people in a room and you ask if they prefer grapes or bananas. If you only surveyed the three females and concluded that the majority of people like grapes, you’d have demonstrated sampling bias.
Image created by Author, Icons provided by Freepik
  • Confirmation bias: the tendency to favour information that confirms one’s beliefs.
  • Survivorship bias: the phenomenon where only those that ‘survived’ a long process are included or excluded in an analysis, thus creating a biased sample.

Q: Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?

A long-tailed distribution is a type of heavy-tailed distribution that has a tail (or tails) that drop off gradually and asymptotically.

3 practical examples include the power law, the Pareto principle (more commonly known as the 80–20 rule), and product sales (i.e. best selling products vs others).

It’s important to be mindful of long-tailed distributions in classification and regression problems because the least frequently occurring values make up the majority of the population. This can ultimately change the way that you deal with outliers, and it also conflicts with some machine learning techniques with the assumption that the data is normally distributed.

Q: What is A/B testing? When is it used in practice?

A/B Testing is a statistical hypothesis testing meant for a randomized experiment with two variables, A and B. It is commonly used in product development and marketing.

Q: How do you control for biases?

There are many things that you can do to control and minimize bias. Two common things include randomization, where participants are assigned by chance, and random sampling, sampling in which each member has an equal probability of being chosen.

Q: Given two fair dices, what is the probability of getting scores that sum to 4? to 8?

There are 4 combinations of rolling a 4 (1+3, 3+1, 2+2):
P(rolling a 4) = 3/36 = 1/12

There are combinations of rolling an 8 (2+6, 6+2, 3+5, 5+3, 4+4):
P(rolling an 8) = 5/36

Q: Give examples of data that does not have a Gaussian distribution, nor log-normal.

  • Any type of categorical data won’t have a gaussian distribution or lognormal distribution.
  • Exponential distributions — eg. the amount of time that a car battery lasts or the amount of time until an earthquake occurs.

Q: How do you assess the statistical significance of an insight?

You would perform hypothesis testing to determine statistical significance. First, you would state the null hypothesis and alternative hypothesis. Second, you would calculate the p-value, the probability of obtaining the observed results of a test assuming that the null hypothesis is true. Last, you would set the level of the significance (alpha) and if the p-value is less than the alpha, you would reject the null — in other words, the result is statistically significant.

Q: The homicide rate in Scotland fell last year to 99 from 115 the year before. Is this reported change really noteworthy?

Since this is a Poisson distribution question, mean = lambda = variance, which also means that standard deviation = square root of the mean.

  • a 95% confidence interval implies a z score of 1.96
  • one standard deviation = sqrt(115) = 10.724

Therefore the confidence interval = 115+/- 21.45 = [93.55, 136.45]. Since 99 is within this confidence interval, we can assume that this change is not very noteworthy.

Q: How many ways can you draw 6 cards from a deck of 52 cards?


Q: If a variable has 3 different categories (vanilla, chocolate, strawberry), what is the minimum number of dummy variables required to represent it?

You would need 2 dummy variables to represent 3 different categories. For example:

  • Chocolate → x1=1, x2=0
  • Vanilla → x1=0, x2=1
  • Strawberry → x1=0, x2=0

Q: What is the difference between a boxplot and a histogram?

While boxplots and histograms are visualizations used to show the distribution of the data, they communicate information differently.

Histograms are bar charts that show the frequency of a numerical variable’s values and are used to approximate the probability distribution of the given variable. It allows you to quickly understand the shape of the distribution, the variation, and potential outliers.

Boxplots communicate different aspects of the distribution of data. While you can’t see the shape of the distribution through a box plot, you can gather other information like the quartiles, the range, and outliers. Boxplots are especially useful when you want to compare multiple charts at the same time because they take up less space than histograms.

Q: What is the meaning of ACF and PACF?

To understand ACF and PACF, you first need to know what autocorrelation or serial correlation is. Autocorrelation looks at the degree of similarity between a given time series and a lagged version of itself.

Therefore, the autocorrelation function (ACF) is a tool that is used to find patterns in the data, specifically in terms of correlations between points separated by various time lags. For example, ACF(0)=1 means that all data points are perfectly correlated with themselves and ACF(1)=0.9 means that the correlation between one point and the next one is 0.9.

The PACF is short for partial autocorrelation function. Quoting a text from StackExchange, “It can be thought as the correlation between two points that are separated by some number of periods n, but with the effect of the intervening correlations removed.” For example. If T1 is directly correlated with T2 and T2 is directly correlated with T3, it would appear that T1 is correlated with T3. PACF will remove the intervening correlation with T2.

Q: How would you design an experiment for a new feature we’re thinking about. What metrics would matter?

Image created by Author

I would conduct an A/B test to determine if the introduction of a new feature results in a statistically significant improvement in a given metric that we care about. The metric(s) chosen depends on the goal of the feature. For example, a feature may be introduced to increase conversion rates, or web traffic, or retention rates.

First I would formulate my null hypothesis (feature X will not improve metric A) and my alternative hypothesis (feature X will improve metric A).

Next, I would create my control and test group through random sampling. Because the t-test inherently considers the sample size, I’m not going to specify a necessary sample size, although the larger the better.

Once I collect my data, depending on the characteristics of my data, I’d then conduct a t-test, Welch’s t-test, chi-squared test, or a Bayesian A/B test to determine whether the differences between my control and test group are statistically significant.

Q: In any 15-minute interval, there is a 20% probability that you will see at least one shooting star. What is the proba­bility that you see at least one shooting star in the period of an hour?

The probability of not seeing any shooting star in 15 minutes:

= 1 — P( Seeing one shooting star )
= 1–0.2 = 0.8

The probability of not seeing any shooting star in the period of one hour:

= (0.8) ^ 4 = 0.4096

The probability of seeing at least one shooting star in the one hour:

= 1 — P( Not seeing any star )
= 1–0.4096 = 0.5904

Q: You randomly draw a coin from 100 coins — 1 unfair coin (head-head), 99 fair coins (head-tail) and roll it 10 times. If the result is 10 heads, what is the probability that the coin is unfair?

This can be answered using the Bayes Theorem. The extended equation for the Bayes Theorem is the following:

Assume that the probability of picking the unfair coin is denoted as P(A) and the probability of flipping 10 heads in a row is denoted as P(B). Then P(B|A) is equal to 1, P(B∣¬A) is equal to 0.⁵¹⁰, and P(¬A) is equal to 0.99.

If you fill in the equation, then P(A|B) = 0.9118 or 91.18%.

Q: How can you generate a random number between 1–7 with only a die?

If you roll a die twice and consider the event of two rolls, there are 36 different outcomes. If we exclude the combination (6,6), there will be 35 possible outcomes. You can then assign 5 combinations to each number from 1 to 7.

Q: There’s a game where you are given two fair six-sided dice and asked to roll. If the sum of the values on the dice equals seven, then you win $21. However, you must pay $5 to play each time you roll both dice. Do you play this game?

The odds of rolling a 7 is 1/6.
This means that you are expected to pay $30 (5*6) to win $21.
Take these two numbers and the expected payout is -$9 (21–30).
Since the expected payout is negative, you should not play this game.

Q: We have two options for serving ads within Newsfeed. Option 1: 1 out of every 25 stories, one will be ad. Option 2: every story has a 4% chance of being an ad. For each option, what is the expected number of ads shown in 100 news stories?

The expected number of odds for both options is 4 out of 100.

For Option 1, 1/25 is equivalent to 4/100.
For Option 2, 4% of 100 is 4/100.

Q: If there are 8 marbles of equal weight and 1 marble that weighs a little bit more (for a total of 9 marbles), how many weighings are required to determine which marble is the heaviest?

Image created by author

Two weighings would be required (see part A and B above):

  1. You would split the nine marbles into three groups of three and weigh two of the groups. If the scale balances (alternative 1), you know that the heavy marble is in the third group of marbles. Otherwise, you’ll take the group that is weighed more heavily (alternative 2).
  2. Then you would exercise the same step, but you’d have three groups of one marble instead of three groups of three.

Q: The probability that item an item at location A is 0.6, and 0.8 at location B. What is the probability that item would be found on Amazon website?

We need to make some assumptions about this question before we can answer it. Let’s assume that there are two possible places to purchase a particular item on Amazon and the probability of finding it at location A is 0.6 and B is 0.8. The probability of finding the item on Amazon can be explained as so:

We can reword the above as P(A) = 0.6 and P(B) = 0.8. Furthermore, let’s assume that these are independent events, meaning that the probability of one event is not impacted by the other. We can then use the formula…

P(A or B) = P(A) + P(B) — P(A and B)
P(A or B) = 0.6 + 0.8 — (0.6*0.8)
P(A or B) = 0.92

Q: How do you prove that males are on average taller than females by knowing just gender height?

You can use hypothesis testing to prove that males are taller on average than females.

The null hypothesis would state that males and females are the same height on average, while the alternative hypothesis would state that the average height of males is greater than the average height of females.

Then you would collect a random sample of heights of males and females and use a t-test to determine if you reject the null or not.

Q: If a PM says that they want to double the number of ads in Newsfeed, how would you figure out if this is a good idea or not?

You can perform an A/B test by splitting the users into two groups: a control group with the normal number of ads and a test group with double the number of ads. Then you would choose the metric to define what a “good idea” is. For example, we can say that the null hypothesis is that doubling the number of ads will reduce the time spent on Facebook and the alternative hypothesis is that doubling the number of ads won’t have any impact on the time spent on Facebook. However, you can choose a different metric like the number of active users or the churn rate. Then you would conduct the test and determine the statistical significance of the test to reject or not reject the null.

Q: A box has 12 red cards and 12 black cards. Another box has 24 red cards and 24 black cards. You want to draw two cards at random from one of the two boxes, one card at a time. Which box has a higher probability of getting cards of the same color and why?

The box with 24 red cards and 24 black cards has a higher probability of getting two cards of the same color. Let’s walk through each step.

Let’s say the first card you draw from each deck is a red Ace.

This means that in the deck with 12 reds and 12 blacks, there’s now 11 reds and 12 blacks. Therefore your odds of drawing another red are equal to 11/(11+12) or 11/23.

In the deck with 24 reds and 24 blacks, there would then be 23 reds and 24 blacks. Therefore your odds of drawing another red are equal to 23/(23+24) or 23/47.

Since 23/47 > 11/23, the second deck with more cards has a higher probability of getting the same two cards.

Q: How can you tell if a given coin is biased?

This isn’t a trick question. The answer is simply to perform a hypothesis test:

  1. The null hypothesis is that the coin is not biased and the probability of flipping heads should equal 50% (p=0.5). The alternative hypothesis is that the coin is biased and p != 0.5.
  2. Flip the coin 500 times.
  3. Calculate Z-score (if the sample is less than 30, you would calculate the t-statistics).
  4. Compare against alpha (two-tailed test so 0.05/2 = 0.025).
  5. If p-value > alpha, the null is not rejected and the coin is not biased.
    If p-value < alpha, the null is rejected and the coin is biased.

Q: Make an unfair coin fair

Since a coin flip is a binary outcome, you can make an unfair coin fair by flipping it twice. If you flip it twice, there are two outcomes that you can bet on: heads followed by tails or tails followed by heads.

P(heads) * P(tails) = P(tails) * P(heads)

This makes sense since each coin toss is an independent event. This means that if you get heads → heads or tails → tails, you would need to reflip the coin.

Q: You are given 40 cards with four different colors- 10 Green cards, 10 Red Cards, 10 Blue cards, and 10 Yellow cards. The cards of each color are numbered from one to ten. Two cards are picked at random. Find out the probability that the cards picked are not of the same number and same color.

Since these events are not independent, we can use the rule:
P(A and B) = P(A) * P(B|A) ,which is also equal to
P(not A and not B) = P(not A) * P(not B | not A)

For example:

P(not 4 and not yellow) = P(not 4) * P(not yellow | not 4)
P(not 4 and not yellow) = (36/39) * (27/36)
P(not 4 and not yellow) = 0.692

Therefore, the probability that the cards picked are not the same number and the same color is 69.2%.

Q: Infection rates at a hospital above a 1 infection per 100 person-days at risk are considered high. A hospital had 10 infections over the last 1787 person-days at risk. Give the p-value of the correct one-sided test of whether the hospital is below the standard.

Since we looking at the number of events (# of infections) occurring within a given timeframe, this is a Poisson distribution question.

The probability of observing k events in an interval

Null (H0): 1 infection per person-days
Alternative (H1): >1 infection per person-days

k (actual) = 10 infections
lambda (theoretical) = (1/100)*1787
p = 0.032372 or 3.2372% calculated using .poisson() in excel or ppois in R

Since p-value < alpha (assuming 5% level of significance), we reject the null and conclude that the hospital is below the standard.

Q: You roll a biased coin (p(head)=0.8) five times. What’s the probability of getting three or more heads?

Use the General Binomial Probability formula to answer this question:

General Binomial Probability Formula

p = 0.8
n = 5
k = 3,4,5

P(3 or more heads) = P(3 heads) + P(4 heads) + P(5 heads) = 0.94 or 94%

Q: Consider the number of people that show up at a bus station is Poisson with a mean of 2.5/h. What is the probability that at most three people show up in a four hour period?

x = 3
mean = 2.5*4 = 10

using Excel…

p = poisson.dist(3,10,true)
p = 0.010336

Q: An HIV test has a sensitivity of 99.7% and a specificity of 98.5%. A subject from a population of prevalence 0.1% receives a positive test result. What is the precision of the test (i.e the probability he is HIV positive)?

Equation for Precision (PV)

Precision = Positive Predictive Value = PV
PV = (0.001*0.997)/[(0.001*0.997)+((1–0.001)*(1–0.985))]
PV = 0.0624 or 6.24%

Q: You are running for office and your pollster polled hundred people. Sixty of them claimed they will vote for you. Can you relax?

  • Assume that there’s only you and one other opponent.
  • Also, assume that we want a 95% confidence interval. This gives us a z-score of 1.96.
Confidence interval formula

p-hat = 60/100 = 0.6
z* = 1.96
n = 100
This gives us a confidence interval of [50.4,69.6]. Therefore, given a confidence interval of 95%, if you are okay with the worst scenario of tying then you can relax. Otherwise, you cannot relax until you got 61 out of 100 to claim yes.

Q: Geiger counter records 100 radioactive decays in 5 minutes. Find an approximate 95% interval for the number of decays per hour.

  • Since this is a Poisson distribution question, mean = lambda = variance, which also means that standard deviation = square root of the mean
  • a 95% confidence interval implies a z score of 1.96
  • one standard deviation = 10

Therefore the confidence interval = 100 +/- 19.6 = [964.8, 1435.2]

Q: The homicide rate in Scotland fell last year to 99 from 115 the year before. Is this reported change really noteworthy?

  • Since this is a Poisson distribution question, mean = lambda = variance, which also means that standard deviation = square root of the mean
  • a 95% confidence interval implies a z score of 1.96
  • one standard deviation = sqrt(115) = 10.724

Therefore the confidence interval = 115+/- 21.45 = [93.55, 136.45]. Since 99 is within this confidence interval, we can assume that this change is not very noteworthy.

Q: Consider influenza epidemics for two-parent heterosexual families. Suppose that the probability is 17% that at least one of the parents has contracted the disease. The probability that the father has contracted influenza is 12% while the probability that both the mother and father have contracted the disease is 6%. What is the probability that the mother has contracted influenza?

Using the General Addition Rule for probability:
P(mother or father) = P(mother) + P(father) — P(mother and father)
P(mother) = P(mother or father) + P(mother and father) — P(father)
P(mother) = 0.17 + 0.06–0.12
P(mother) = 0.11

Q: In a population of interest, a sample of 9 men yielded a sample average brain volume of 1,100cc and a standard deviation of 30cc. What is a 95% Student’s T confidence interval for the mean brain volume in this new population?

Confidence interval for a sample

Given a confidence level of 95% and degrees of freedom equal to 8, the t-score = 2.306

Confidence interval = 1100 +/- 2.306*(30/3)
Confidence interval = [1076.94, 1123.06]

Q: A diet pill is given to 9 subjects over six weeks. The average difference in weight (follow up — baseline) is -2 pounds. What would the standard deviation of the difference in weight have to be for the upper endpoint of the 95% T confidence interval to touch 0?

Upper bound = mean + t-score*(standard deviation/sqrt(sample size))
0 = -2 + 2.306*(s/3)
2 = 2.306 * s / 3
s = 2.601903
Therefore the standard deviation would have to be at least approximately 2.60 for the upper bound of the 95% T confidence interval to touch 0.