Counterpoint (7 Years Later) - 20 Questions to Detect Fake Data Scientists
Erik on Software anniversary post #500.
Almost exactly seven years ago, I published what was for a long time one of my most popular posts: a counterpoint to a KDNuggets essay entitled "20 Questions to Detect Fake Data Scientists". The original essay to which I had reacted was making the rounds in some of the forums that I had been frequenting at that time, and it wasn't rubbing me the right way. The author had suggested 20 questions that were intended to help separate the real and the fake when it comes to data science, and the commentary that they had provided alongside their list didn't seem to provide any clarity.
In particular, they had made the following statement: "A 'real' data scientist knows how to apply mathematics, statistics, how to build and validate models using proper experimental designs. Having IT skills without statistics skills makes you a data scientist as much as it makes you a surgeon to know how to build a scalpel." My response to this statement hasn't changed: "What are 'IT skills', exactly?".
While many definitions of "data scientist" still exist, I commented that the best definition at the time was probably the one which originated from Josh Wills, Director of Data Science at Cloudera: "Person who is better at statistics than any software engineer and better at software engineering than any statistician." This definition is clean, emphasizing a combined need for the skills of both the software engineer and the statistician within a single individual, while also speaking to the relative scarcity of individuals who can fill this need.
Interestingly, KDNuggets published a follow-up to the original essay about a month ago, but instead of tackling the challenge to revisit the relevance of the editors' original questions and answers, the questions were instead submitted to ChatGPT, a chat-based interface which currently makes use of the GPT-3.5 large language model from OpenAI. If you're already familiar with my original post, you'll know that the intention of my (somewhat) tongue-in-cheek counterpoint was to tackle the software engineering side of data science. So of course I also felt compelled to submit to ChatGPT my counterpoint list of software engineering questions to detect fake data scientists. And what's this? The value of my questions was arguably made more pronounced in the process.