My responses ended up being included in an article at TechTarget (May 7, 2021). Extent of verbatim quote highlighted in orange, paraphrased quote highlighted in gray. Above image from cited article.
The responses I provided to a media outlet on March 15, 2021:
Media: This article will compare Hadoop and Spark and detail how the two technologies differ and the uses for which each is suited, as well as how they aren't mutually exclusive and can be used together in some big data applications.
Media: What do you think are Hadoop's key strengths and weaknesses? What do you see as Sparks key strengths and weaknesses?
Gfesser: Before discussing the strengths and weaknesses of Apache Hadoop and Apache Spark, it helps to understand that there may exist overlap between these two in the minds of some, that Hadoop itself is often discussed in terms of its components, and that Spark was traditionally part of what became known as the Hadoop ecosystem. However, in the minds of many Hadoop is synonymous with Hadoop MapReduce. This framework made the batch processing of vast amounts of data possible by splitting it up and processing each chunk across distributed cluster nodes. At the same time, processing tends to be slow and is limited to batch, and it can be challenging to manage.
From a high level perspective, Spark can also process vast amounts of data by splitting it up and processing across distributed cluster nodes, but it typically does so much more quickly because its processing is performed in memory. And because Spark processes data in memory, it can satisfy use cases that Hadoop cannot, such as data streaming. Somewhat confusing is a component called Hadoop Streaming, which does not provide data streaming, but instead provides the ability to write MapReduce jobs in languages other than Java. Spark also supports multiple languages, but in a consistent manner that Hadoop does not provide, using APIs which resolve to common underlying code.
From a high level perspective, Spark can also process vast amounts of data by splitting it up and processing across distributed cluster nodes, but it typically does so much more quickly because its processing is performed in memory. And because Spark processes data in memory, it can satisfy use cases that Hadoop cannot, such as data streaming. Somewhat confusing is a component called Hadoop Streaming, which does not provide data streaming, but instead provides the ability to write MapReduce jobs in languages other than Java. Spark also supports multiple languages, but in a consistent manner that Hadoop does not provide, using APIs which resolve to common underlying code.
Media: How would you say Hadoop and Spark compare and contrast in terms of (please feel free to pick 3 of these): architecture, data processing capabilities, performance, scalability, applications and use cases, deployment and processing costs.
Gfesser: Performance: On average, benchmarks have demonstrated that Spark executes at a rate up to 100x faster than Hadoop. Additionally, the Databricks implementation of Spark is highly optimized, providing performance gains of as much as 50x that of Apache Spark. Unlike Hadoop MapReduce, Spark makes use of an advanced directed acyclic graph (DAG) processing engine which optimizes the execution plan of the processing that needs to be performed across all steps of the code that has been written to be executed by it. Combined with Spark’s use of an advanced DAG, its ability to perform processing in memory is what helps enable this speed.
Applications and use cases: While the use cases to use Hadoop and Spark overlap for batch, this isn’t the case for others. For example, Spark enables streaming of data for use cases in which decisions need to be made quickly. Some refer to this type of processing as “real-time”, but since there are so many interpretations as to what this means I typically don’t use this term, instead referring to this type of processing as non-scheduled and non-batch, subsequently exploring what is really needed in terms of timing. While other frameworks also offer the ability to stream data, some faster than Spark, Spark enables doing so for other use cases using a common framework.
Applications and use cases: While the use cases to use Hadoop and Spark overlap for batch, this isn’t the case for others. For example, Spark enables streaming of data for use cases in which decisions need to be made quickly. Some refer to this type of processing as “real-time”, but since there are so many interpretations as to what this means I typically don’t use this term, instead referring to this type of processing as non-scheduled and non-batch, subsequently exploring what is really needed in terms of timing. While other frameworks also offer the ability to stream data, some faster than Spark, Spark enables doing so for other use cases using a common framework.
See all of my responses to media queries here.
© 2009-2024 Erik Gfesser (Erik on Software)