New Book Review: "Spark: The Definitive Guide"
New book review for Spark: The Definitive Guide, by Bill Chambers and Matel Zaharla, O'Reilly, 2018:


Copy provided by O'Reilly.
Over the past few months, I've had the chance to work through three related books in the following order: "Spark: The Definitive Guide" (2018), "High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark" (2017), and "Practical Hive: A Guide to Hadoop's Data Warehouse System" (2016). If you are new to Apache Spark, these three texts will help enable your going in the right direction, although keep in mind that the related tech stack is evolving and you will obviously need to supplement this material with web documentation and developer forums, as well as to get hands-on with the tooling. Reading these books in opposite order of publication date enabled exposure to more current material sooner rather than later, but this was largely just a coincidence. The fact of the matter is that while I have high regard for this particular book, the others helped fill some of the gaps for me.
What distinguishes this Spark book from the others still available out there, other than its recent publication date (which is not a trivial matter), is that notebooks are available for much of the content, allowing readers to follow along while issuing Spark requests. The only drawback for me is that the accompanying notebooks provided in GitHub do not contain any comments whatsoever. My mistake was to read through the material in the book first and then work through the notebooks, leading me to discover that since the notebooks provide no context, I ended up revisiting pertinent content in the text. Sure, this may have helped solidify some of the material for me, but I wouldn't want anyone to make the same mistake. If you decide to work through the material with an electronic copy of the text, I suggest you copy and paste surrounding discussions of code into your notebooks for future reference so that it is all in one place.