By Erik Gfesser — Aug 4, 2019

New Book Review: Machine Learning at Enterprise Scale

New book review for Machine Learning at Enterprise Scale: How Real Practitioners Handle Six Common Challenges, by Piero Cinquegrana and Matheen Raza, O'Reilly, 2019, reposted here:

Copy provided by Qubole.

This freely available O'Reilly text discusses six common challenges the authors have identified with respect to what data science teams need to address as they scale up their machine learning operations moving to the cloud: (1) reconciling disparate interfaces, (2) resolving environment dependencies, (3) ensuring close collaboration among all machine learning stakeholders, (4) building (or renting) adequate machine learning infrastructure, (5) scaling to meet machine learning requirements, and (6) enabling smooth deployment of machine learning projects.

According to the authors, these challenges can all be solved through cloud-native data platforms. Interestingly, the first problem presented here defines "interfaces" as front-end development interfaces, not programmatic interfaces. The authors comment that since it's important to permit data science professionals to use preferred interfaces and to provide a common workspace in which the entire team can collaborate, a cloud-native approach is essential to support the entire machine learning workflow.

During their discussion of this first problem, the authors define their own definitions of the two personas addressed in this report, data scientists and machine learning engineers, and then discuss the challenges that arise due to these roles employing different tooling, such as collaborating with peers, connecting to existing databases or data stores, and connecting to large-scale execution engines. While the authors present some good points, it seems they are a bit forgiving when it comes to the skills expected of data scientists.

This situation is probably the case because while they comment that machine learning engineers typically have software development skills whereas data scientists do not, they also recommend that data scientists pick up some of these skills to help alleviate the disconnect between design and implementation. However, keep in mind that what is being said here is due to the definitions being used for these roles, as the best definition of "data scientist" that I have come across (from Josh Wills, Director of Data Science at Cloudera) is a "person who is better at statistics than any software engineer and better at software engineering than any statistician".

At the outset in discussing the second of six problems, the authors comment that "data scientists are expected to stand up their environments themselves, or wait until DevOps professionals have the time to spin up an environment for them. Either way, challenges arise – either the data scientist doesn’t understand the production environment sufficiently to avoid migration issues, or the DevOps team is too busy and becomes a bottleneck in the process."

But why is this the case? Why are data scientists not expected to be practical, or take reality into account? And why is there even a "DevOps team"? The very definition of "DevOps" involves a cross-functional skillset of development and operations, which comprises tasks such as enabling automated deployments. If these two are separated, doesn't DevOps cease to exist? When development and operations are separated, developers cannot relate to deployment and execution in production, and operations therefore cannot relate to the code that has been written.

In their discussions of the six challenges in their presentation, the authors delve into a lot of related issues. For example, the authors further the discussion of roles while addressing the third problem. "You basically don't allow primadonna behavior where a data scientist decides that all he or she does is write algorithms and how the model runs and where it runs is not really their problem. Or a data engineer who says, 'I deal with the infrastructure and I don't really know what these algorithms are for, and I don't want to figure it out.'"

As a long-time practical, hands-on architect, I noticed that the role of architect is nowhere to be found in this book, although the term "architecture" is mentioned on several occasions. The answer to this architectless philosophy perhaps can be summarized by what the authors mention during this part of their discussion: "Some vendors are trying to address the code sharing problem by creating concept data boxes (also called model or architecture boxes), which are sets of prebuilt architecture or models that you can apply to a new dataset, or which can get you started for building a new model.

The question that immediately comes to my mind, however, is to whom one should delegate the responsibility of such an architecture? It comes as no surprise that commercial and commercialized open source products are mentioned throughout this report by vendors such as H2O and Qubole, because of the backgrounds of interviewed practitioners, but is fully delegating this responsibility really the best approach? In my view, architects still retain this responsibility, and can work through many of the challenges faced by data scientists and machine learning engineers.

While not explicitly stated, it should be a given that the selection of cloud-native data platform frameworks etc still needs to be selected, that the decision making process doesn't end with this selection, and that the building out and maintenance of the code needs to be performed. These matters are important, especially due the fast moving nature of this space, because the viability of such selections over any given time period will have implications for enterprise cost and the ability to attract and retain talent.

As an open source advocate, I appreciated the comment made by one interviewee during the discussion of the fifth challenge that precisely because of scalability, going with open source solutions is quite important. "Data science people used to depend on MATLAB, which is still quite heavily used in engineering fields. It does have some nice features and some good packages that people can use, and everything's curated. So there are advantages. But after you move away from using something like MATLAB on your own computer to scaling it in a distributed system, like a cluster in the cloud, the licensing becomes prohibitive. Whereas with an open source product, you never need worry whether you are using 100 nodes or scaling to 1,000 nodes."

Of course, there are many more benefits to making use of open source. From my own experience, I know that careful selection of open source products, frameworks, and languages also leads to visibility into the code base, access to strong developer communities, and perhaps most important of all, innovation. While this book tends to encourage use of monolith data platform products, I've found that use of open source products, frameworks, and languages alongside an agile architecture composed of a mix of open source and commercial components provides the nimbleness that many firms need but don't immediately realize at the outset.

My software development background especially appreciated a comment made by one of the interviewees during the discussion of the sixth challenge. "I started my career as a programmer, and the mindset when you're a programmer is that you want to get this thing to do exactly – I mean exactly – what you specified it to do, and if it veers from that even a little bit, that's a bug and you need to fix it."

"As a data scientist, on the other hand, you're creating general algorithms, and what you want them to do is to learn and to produce behavior that you hadn't anticipated." The authors provide a good survey of machine learning challenges and how some are tackling these challenges. In the spirit of this last quote, keep in mind that this space is evolving quickly, and not always as anticipated. Choose your tooling and approaches wisely.

Subscribe to Erik on Software