By Erik Gfesser in book review — Dec 22, 2018

New Book Review: "High Performance Spark"

New book review for High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark, by Holden Karau and Rachel Warren, O'Reilly, 2017:

The authors state in their preface that "this book is intended for those who have some working knowledge of Spark, and may be difficult to understand for those with little or no experience with Spark or distributed computing", that they "expect this text will be most useful to those who care about optimizing repeated queries in production, rather than to those who are doing primarily exploratory work", and that they "want to help our readers ask questions such as 'How is my data distributed?', 'Is it skewed?', 'What is the range of values in a column?', and 'How do we expect a given value to group?' and then apply the answers to those questions to the logic of their Spark queries."

This book is the second of three related books that I've had the chance to work through over the past few months, in the following order: "Spark: The Definitive Guide" (2018), "High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark" (2017), and "Practical Hive: A Guide to Hadoop's Data Warehouse System" (2016). If you are new to Apache Spark, these three texts will help enable your going in the right direction, although keep in mind that the related tech stack is evolving and you will obviously need to supplement this material with web documentation and developer forums, as well as to get hands-on with the tooling. Reading these books in opposite order of publication date enabled exposure to more current material sooner rather than later, but this was largely just a coincidence.

This post is for subscribers only

Already have an account? Sign in.

This post is for subscribers only

Subscribe to Erik on Software