When interviewing for a data scientist position, The interviewer will quickly discover a lot about the candidate. An ideal data scientist’s skill set spans math, statistics, programming as well as databases, and business expertise. They need to probe each of these areas and also want to find out if they are smart and get things done on their own. Normally the interviewer typically only have an hour to get a feel for how a candidate fares in each of those areas (followed a week or two later by a case study if the interviewer thinks the candidate has potential). As you can imagine, this makes for an intense interview.
When I first started interviewing candidates, I had no agenda. I would just talk to candidates for an hour and see where the conversation flowed. These interviews crashed and burned, so I made a standardized interview itinerary. Have a consistent agenda has been the single best thing for my interview process, so if you take anything from this article take that. My agenda is:
I give them an overview of our company and my honest pitch on why our analytics team is great. I ask them about themselves to get a feel for their background and ask them what they are looking for in a job. If I do this right, the candidate is excited for the rest of the interview.
Description of a Recent Problem
I want to hear about a project they’ve worked on recently. I ask them about how the project started, how they determined it was worth time and effort, their process, and their results. I also ask them about what they learned from the project. I gain a lot from answers to this question: if they can tell a narrative, how the problem related to the bigger picture, and how they tackled the hard work of doing something.
I check if they have skills in statistics, databases, programming, and interpreting data. The rest of the article will be about this, so get ready.
Ask them for questions — There are so many possible questions they could ask that if they don’t have any questions it’s a warning sign that they don’t think before they act. Not having questions could also be a sign that they don’t want the job (you can’t win ’em all).
An interview has one purpose: to see if this person will be successful in the role you’re offering. From a technical standpoint, that means checking they have the prerequisite knowledge for the job. But people are dynamic creatures who learn and grow, and if a person is missing knowledge they can go read Stack Overflow and pick it up. So a technical interview shouldn’t be a test of exactly how much they know on a topic from memory. My interview questions are guided by three principles:
No trick questions or tests of cleverness. No question should require a candidate to get to an “a-ha” during the interview. You should only test them on things they should feel comfortable answering with their existing knowledge. A brain teaser question like “suppose you have a stack of pancakes in a random size order, how would you take a spatula and order them in the minimum number of moves?” doesn’t relate to what doing data science is in practice. Further, many strong candidates may not be able to answer this in seconds, because they need time to think and process. I secretly think these sorts of questions generally are there to make the interviewer feel clever for knowing the answer.
Only test on the basics. Given a particular area, I only ask questions at an introductory level. For instance, if asking a question about machine learn models, I would only ask about linear and logistic regressions and avoid asking about more advanced topics like Random Forests or boosting. The reason for this is that if they understand the basics they should be able to pick up the advanced topics on the job. Further, as you get into more advanced topics there is a higher likelihood that the candidate just never happened to deal with that topic. If they do know a lot of advanced materials, that will likely be noticeable in how they answer the basic questions.
Keep questions a discussion. I try to avoid questions where there is a single right answer because the only information you get is if the candidate knows that exact answer or not. For instance, instead of asking “what is the difference between a left join and an inner join” I would ask “what are joins in SQL” and if they give a decent answer I would ask “what are some different types of joins?” The candidate may come up with a more interesting answer and I can probe into times they’ve had tricky joins to do.
For each area, I first ask them about their familiarity with the topic. If they say they don’t have much, I skip it. I want to avoid having the candidate feel overwhelmed or frustrated by that topic, as that could jeopardize the rest of the interview.
Without further ado, here is the set of technical questions I ask and why:
- How would you explain a linear regression to a business executive?
This question tests if they have a good mental model of what a linear regression is and if they can explain it in non-technical terms. The common way people mess it up by being too technical “suppose we have normally distributed errors in our dependent variables.” I’m looking for an answer that sounds something like “it’s a way of predicting a value as being proportional to some other values” along with a simple example.
- What are some alternative models to a linear regression? Why are they better or worse?
If they understand data science, they should be able to explain other models and why they are better (Random Forest, SVM, Neural Nets, whatever). I don’t care about the number of models they know, just that their explanations are well thought out.
Given a table:
Write a SQL query to create a table that shows, for each class, the value of the highest grade in the class.
If the person has any familiarity with SQL, they should recognize that they need to use a GROUP BY over Class and MAX(Grade). If they mess up the syntax a bit that’s fine, I’m confident they can pick it back up on the job. Even if they totally bomb the code but still mention aggregating by class I consider it a pass (but I won’t ask them the medium question).
To me, this question feels comically easy, and yet half of the people who list SQL on their resume are unable to do it.
- Suppose I had the same table as the previous question, but instead for each class, I want to find the name of the student who got the highest grade. Write a query to do that.
This question seems like it should be as easy as the previous one, but when you start working on it then it turns out more complicated. The solution requires either joining a temporary table or using a subquery. A particularly astute interviewee will notice the question doesn’t tell you what to do in the case of ties in the highest grade.
In pseudo-code or whatever language you would like: write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”.
Yes, the classic FizzBuzz, known as the question that any capable software developer should be able to answer. You can find lots of writing on it, including a great enterprise version. Since data scientists are weaker programmers than software developers, this question is excellent for interviewing.
When a person is working on their problem, I listen to how they reason. If they make statements like “hmm, well I should probably look at each number so I’ll use a for loop” or “I’ll handle this by using an if-else statement — oh! I guess I need to check for FizzBuzz first since both the other conditions apply” that’s a great sign. Provided they stumble to any working answer I consider this an easy pass.