Just recently, links to an essay entitled "20 Questions to Detect Fake Data Scientists" was making the rounds in some of the forums that I frequent. And following up the 20 questions suggested by the author was the following quote: "A 'real' data scientist knows how to apply mathematics, statistics, how to build and validate models using proper experimental designs. Having IT skills without statistics skills makes you a data scientist as much as it makes you a surgeon to know how to build a scalpel." The question that arises immediately in my mind in response to this statement is "What are 'IT skills', exactly?".
Of course, there are many definitions of "data scientist". However, the best definition presented to date is probably the one which originated from Josh Wills, Director of Data Science at Cloudera, with which I continue to cross paths: "Person who is better at statistics than any software engineer and better at software engineering than any statistician." While the reality is likely a bit more complex, this is a clean definition which emphasizes the combined need for both the skills of the software engineer and the skills of the statistician within a single individual, and speaks to the relative scarcity of individuals who can fill this need.
For reader convenience, the original 20 questions from the essay cited above are reproduced below.
- Explain what regularization is and why it is useful.
- Which data scientists do you admire most? which startups?
- How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression.
- Explain what precision and recall are. How do they relate to the ROC curve?
- How can you prove that one improvement you’ve brought to an algorithm is really an improvement over not doing anything?
- What is root cause analysis?
- Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples.
- What is statistical power?
- Explain what resampling methods are and why they are useful. Also explain their limitations.
- Is it better to have too many false positives, or too many false negatives? Explain.
- What is selection bias, why is it important and how can you avoid it?
- Give an example of how you would use experimental design to answer a question about user behavior.
- What is the difference between “long” and “wide” format data?
- What method do you use to determine whether the statistics published in an article (e.g. newspaper) are either wrong or presented to support the author’s point of view, rather than correct, comprehensive factual information on a specific subject?
- Explain Edward Tufte’s concept of “chart junk.”
- How would you screen for outliers and what should you do if you find one?
- How would you use either the extreme value theory, monte carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?
- What is a recommendation engine? How does it work?
- Explain what a false positive and a false negative are. Why is it important to differentiate these from each other?
- Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?
So why were no software engineering questions included in this list? The following is my counterpoint list of software engineering questions to detect fake data scientists. Remember, if all we are talking about is statistics (and related subjects), the skill set is essentially the same as a statistician, and there is no point in deriving a new breed of individual. To make it interesting (and to have some fun), I initially made an attempt to formulate software engineering questions which are either similar in nature, or constrained to the particular subject area of the original questions (even giveaway questions such as #2, questions not directly related to data science such as #6, redundant questions such as #10 and #19, and relatively easy questions such as #15 and #20), although I did end up cheating in a number of cases.
- As software engineers, we often need to balance short-term and long-term needs, and to distinguish between needs and wants. Tell me about the philosophy you used with a client to address this subject, and how you approached implementation of related decisions.
- What types of technology-minded meetups do you generally attend?
- How do you test code which you have written?
- What is your philosophy with regard to project planning? Tell me about a time when results did not meet your expectations, and the factors which led to these results.
- When you choose to improve code written by a colleague (or an open source committer), what general guidelines do you follow? Tell me about a recent scenario in which you were able to improve previously written code not written by yourself. To which "-ities" did this improvement contribute?
- Since we are having this discussion, you already realize that some level of hands-on programming is essential as a data scientist. How else do you spend your time on a given project?
- Tell me about the development processes you have used on projects, and how the process steps change with regard to the type of project.
- What is premature optimization?
- With which database models have you worked? With which database products were these database models associated? What effect did interfacing with these database products have on your development work?
- Let's go to the whiteboard and walk through the application architecture of one of your current projects. Don't worry about the notation you use.
- Some developers describe themselves as "full-stack" developers. What does "full-stack" mean to you in the context of being a data scientist?
- Tell me about your experiences with user interface design. What is the difference between UI and UX?
- Many software developers do not value data because they take it as a given as being available, think of it as "the easy part", do not understand it, or do not wish to work on it. Tell me how you view data within the context of an organization, from the perspective of both the software engineer and statistician. Who do you expect to do a bulk of the data work prior to analyses?
- Provide an example of a time when you made a technology choice for an application architecture you designed, which you determined was the wrong decision during implementation. How did you rectify this decision?
- In thinking about the code base for a software application, what do you view as the equivalent of "chartjunk"? How have you prevented it in your code? (Note: the original data scientist question #15 incorrectly referred to "chart junk".)
- Are there any areas of software engineering in which you feel you are especially adept? How do you balance these areas with other tasks that are needed, especially those which you do not particularly enjoy?
- Let's go back to the whiteboard and talk about the data persistence layer in the application architecture you drew. What were your options? Now walk me through some portion of the data model which which you are intimately familiar, and discuss the change that it might need to undergo over time. To what degree does the chosen database model support change?
- Risks are inherent in the software development process. Provide some examples of how you mitigated risk on a recent project.
- Which programming languages do you currently use in your daily work? Why? How has your use of languages changed over the course of your career?
- Tell me about the development tools you use on a regular basis. Do you generally prefer commercial or open source tooling? Why?
Are these fair questions on the subject of software engineering? I think so. Are these questions that I would really ask? Perhaps. My interviews tend to be practical in nature, and are intended to get a feel for candidate thought processes. While some of these questions are practical, some of these questions are theoretical in nature, and should only be asked if the intention is to provide seeds for follow-up questions which are practical in nature. In other words, not whether the candidate is able to give a text book answer they came across during coursework, but whether the candidate is able to project confidence that they are able to succeed in the workplace. All too often, these two areas of questioning are thought to be one and the same.
Since the "data scientist", for all intents and purposes, is a specialist, regardless of definition, all of these generalized software engineering questions are not going to apply for all scenarios. And it can be effectively argued that some subset of these questions will rarely apply, partially because data science has specializations itself, and partially because the scope of responsibilities for a given individual may be broader or narrower due to project team makeup, and whether the "data science" portion of such project is only a small area of a broader effort.
For example, projects purely focused on software development after go-live of a system might include team members who are solely "coders", and only want to "code", while other projects (such as for new development) might consist of individuals with diverse skillsets able to provide broader services to the team. In researching the data space, it does not take long to discover that significant data work exists outside activities such as data exploration and data modeling. The questions that remain (in reference to the above software engineering questions #6, #13, and #16) involve asking who is going to do this work, and whether it matters what role names are used for these individuals. Another blog post I came across recently urges "data scientists" to demand prospective employers to provide all data in ready-to-go format: cleaned, organized, and prepared for immediate analysis. Good luck.
Counterpoint: "A doctor's note for virulent consultants"