By Erik Gfesser in book review — Feb 15, 2021

New Book Review: "Learning Spark (Second Edition)"

New book review for Learning Spark (Second Edition), by Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee, O'Reilly, 2020:

Copy provided by Databricks at Spark + AI Summit 2020.

The foreword and preface to this book comment that an update to the first edition, published in 2015, was long overdue. After all, the first edition makes use of Apache Spark 1.3.0, whereas this update makes use of Apache Spark 3.0.0-preview2 (the latest version available at the time of writing). For the most part, I successfully ran all notebook code out of the box using Databricks Runtime 7.6 ML (includes Apache Spark 3.0.1, Scala 2.12), albeit minor issues are explained later in this review alongside my resolutions to these. I was, however, able to successfully run all standalone PySpark applications from chapters #2 and #3 out of the box using Apache Spark 3.0.1 and Python 3.7.9. As explained, the approach used here is intended to be conductive to hands-on learning, but with a focus on Spark's Structured APIs, so there are a few topics that aren't covered, such as the following: the older low-level Resilient Distributed Dataset (RDD) APIs, GraphX (Spark's API for graphs and graph-parallel computation), how to extend Spark's Catalyst optimizer, how to implement your own catalog, and how to write your own DataSource V2 data sinks and sources.

This post is for subscribers only

Already have an account? Sign in.

This post is for subscribers only

Subscribe to Erik on Software