15 Jun 2021

From MapReduce to Spark: Execution and Programming Models

MapReduce gives Hadoop a simple batch execution model: map tasks, a shuffle, an optional reduce, and durable output between jobs.¹ That model is easy to reason about when a job reads input, produces output, and stops. It becomes more expensive when a pipeline parses the same records, joins or aggregates them, feeds the result into another job, and then repeats that pattern for model training or reporting.

Spark changed which costs engineers had to manage. It can keep a dependent computation inside one application instead of forcing every intermediate result through replicated storage, record lineage so lost partitions can be recomputed, hold reused data resident under an explicit storage policy, reuse executor Java Virtual Machines (JVMs) across many tasks, and apply relational optimization when work is expressed as SQL or DataFrames. Each mechanism trades an old cost for a new tuning problem. For many iterative, multi-stage, and structured workloads, Spark is faster; the size of that advantage, and the exceptions to it, depend on the plan and hardware.

Those tradeoffs affected both runtime behavior and pipeline design. On the interface side, Spark exposes SQL, DataFrames, machine learning, and streaming APIs over the same engine; the optimizer sees the most when a program is expressed in a structured form.

Five common claims, narrowed

The shift accumulated a handful of slogans. They become useful only after being narrowed, because most conflate a stack, an engine, and a workload-specific performance result.

Common claim	More precise view
Spark replaced Hadoop.	Spark can replace MapReduce as the compute engine, while HDFS, YARN, and Hadoop libraries remain in use.
Spark is always faster.	Spark is often faster for iterative, multi-stage, and structured workloads, but the margin depends on the execution plan, data reuse, shuffle volume, storage, network, and CPU cost, and specific plans can favor MapReduce.
Spark computes in memory while MapReduce does not.	Both engines use memory and disk. Spark avoids some mandatory distributed-file-system boundaries and can explicitly persist reusable data in memory or on disk.
Hadoop or MapReduce is cheaper.	Cost depends on job duration, cluster utilization, storage and network use, operations, and migration effort, not memory price alone.
Hadoop or MapReduce is more stable.	Failure recovery, behavior under memory pressure, and operational maturity are different questions. Neither engine has a universal stability advantage.

The first claim is the place to start, because it mistakes a stack for an engine. In practice, a Hadoop stack usually means the Hadoop Distributed File System (HDFS) for storage, YARN (Yet Another Resource Negotiator) for resource management, and MapReduce for batch execution; Hadoop Common supplies shared libraries underneath them.

In a Hadoop-centered deployment, Spark can read from HDFS, write results to HDFS, and run executors in YARN containers. Replacing MapReduce therefore changes the execution engine without necessarily replacing storage or resource management. Spark 3.1 also supports its standalone manager, Mesos, and Kubernetes, along with storage systems other than HDFS. The remaining claims are about that engine. They resolve once “faster” is decomposed into the costs Spark can remove, reduce, or move.

Execution costs that changed

The performance comparison is a cost accounting exercise. A simple pipeline from extract-transform-load (ETL) steps to model training makes the accounting concrete: parsing raw records, filtering and joining them, preparing features, training iteratively, and writing final outputs. MapReduce can implement that pipeline, but each job boundary normally forces the next job to consume materialized output. Spark’s mechanisms mainly reduce or move those costs by changing where data is materialized, which results stay resident, how tasks are scheduled, how executor processes are reused, and when database-style optimization applies.

Mechanism	Cost changed	Helps when	Tradeoff
Multi-stage applications	Avoids writing every intermediate multi-job result as replicated HDFS data.	Pipelines have several dependent stages.	Fewer durable restart points inside an application.
Explicit persistence	Keeps selected reused partitions or parsed objects resident.	Iterative jobs or repeated scans revisit the same prepared data.	Cache policy, memory pressure, spill, and recomputation become tuning choices.
DAG scheduling	Pipelines narrow dependencies inside a stage.	Adjacent narrow transformations run without a shuffle.	Shuffles still materialize data and move it across the network.
Executor reuse	Avoids repeated task-attempt JVM startup in typical MapReduce-on-YARN execution.	Jobs have many small tasks within one application.	Long-lived processes make heap sizing and garbage collection more visible.
Structured query execution	Uses Catalyst, code generation, and adaptive planning for DataFrame and SQL plans.	Workloads can be expressed through structured APIs.	Arbitrary user functions remain opaque to semantic optimization.

Multi-stage applications

A MapReduce job has a fixed map, shuffle, and optional reduce shape. It can perform substantial work, including some joins, within that shape. When a pipeline requires multiple MapReduce jobs, however, one job normally writes its result to a distributed file system before the next job reads it. That durable boundary provides a useful recovery point, but it also adds serialization, replication, and I/O.

Spark can represent a multi-stage computation within one application. In the ETL-to-model pipeline, the filtered or joined data can feed later stages without first becoming a replicated HDFS dataset. Shuffle data is still materialized, commonly in local files, but an intermediate result does not have to cross a distributed file system boundary merely to feed the next stage.

Explicit persistence

Removing the boundary is one saving; keeping reused data resident is another. Data that will be reused can be explicitly persisted.² Spark supports memory-only, memory-and-disk, and disk-only persistence; caching is neither automatic for every dataset nor limited to memory.

This matters for iterative machine learning, graph algorithms, and exploratory analysis that repeatedly operate on the same prepared data. If the model-training step revisits the same feature table, persistence can save repeated parsing or recomputation. In one k-means study, most of the benefit came from preserving parsed objects rather than avoiding disk reads. Disk-only Spark persistence performed nearly as well as memory persistence in that experiment because both avoided repeated text parsing.³ The exact ratios belong to that 2015 setup; the general point is that persistence changed which cost was being paid.

If a result is consumed once, persistence does not help; it adds cache management without reuse. A dataset also does not have to fit in memory for Spark to process it: shuffle data and persisted partitions can spill to disk, while uncached partitions can be recomputed. Performance depends on parsing cost, storage bandwidth, garbage collection, and the selected persistence level.

DAG scheduling and shuffle boundaries

Materialization also occurs within a job, between steps, and the execution graph addresses it. Spark records transformations lazily and builds a directed acyclic graph (DAG) before an action runs. The scheduler groups operations connected by narrow dependencies (each parent partition feeds at most one child partition) into the same stage. Adjacent filters and projections can then run in one task without materializing every intermediate collection. A shuffle, where data has to be redistributed across partitions, creates a stage boundary.

This is scheduling and pipelining, not general semantic optimization. Spark cannot inspect the meaning of an arbitrary function passed to a Resilient Distributed Dataset (RDD) transformation. The scheduler can avoid unnecessary materialization between narrow operations, but it cannot freely rewrite opaque user code.

Shuffles remain expensive because they involve serialization, local storage, and network transfer. In the same 2015 setup, MapReduce was twice as fast on the sort workload because it overlapped shuffle transfers with the map stage and hid part of the network cost. The result demonstrated a workload-specific advantage of one execution plan rather than a general rule that MapReduce sorts faster.

Executor process reuse

Task startup is a separate cost. Hadoop MapReduce on YARN typically runs each task attempt in a separate JVM process. Spark acquires executor processes for an application and runs many tasks as threads inside each long-lived executor. For jobs divided into many short tasks, reusing executors avoids paying JVM startup cost once per task.

The same design introduces different operating concerns. A long-lived executor accumulates memory pressure across tasks, and poor sizing can increase garbage collection or operating-system paging. More executor memory therefore does not automatically make a Spark job faster.

Structured query execution

The costs so far are about moving and scheduling data. The last cost is per-record CPU, and reducing it is the job of Spark SQL rather than the DAG scheduler (the scheduler pipelines RDD and structured workloads alike). DataFrame and SQL expressions pass through the Catalyst optimizer, which can remove unused columns, push filters toward data sources, and select join strategies.⁴ Databricks’ 2015 DataFrames announcement described that API as a way to make Spark accessible to users who expect table-shaped data and optimizer support.

Database-style execution also invites comparison with massively parallel processing (MPP) databases. That comparison matters only up to a point. An MPP database is usually a SQL-first analytical system with tightly integrated storage, optimizer, and execution. Spark SQL borrows optimizer techniques from database systems, but Spark remains a general distributed compute engine that can run over external storage and mix SQL with procedural APIs. What changed is how much of a program the optimizer can see.

Later work on Project Tungsten, an initiative to improve Spark’s memory and CPU efficiency, moved Spark SQL closer to database execution techniques. Databricks’ 2016 Spark as a Compiler post described Spark as applying compiler ideas from modern query engines, and SPARK-12795 tracked whole-stage code generation, which fuses multiple compatible operators into one generated Java function to reduce virtual calls and intermediate row materialization.

Generated code does not help arbitrary RDD functions, and cost-based decisions depend on available table and column statistics. Spark 3.1 includes adaptive query execution (AQE), which can change parts of a plan using runtime statistics, but it is disabled by default. The broader design idea appeared in SPARK-9850, while Spark 3’s newer AQE implementation was tracked under SPARK-31412.

The binding resource

The mechanisms above mostly cut I/O and scheduling overhead, and reducing I/O does not make a job I/O-light so much as move the bottleneck elsewhere, often onto the CPU. A study of Spark SQL benchmarks and a production workload found computation to be the limiting resource more often than disk or network: eliminating disk access would have shortened completion time only modestly, and the network mattered less still.⁵ Those results depend on the workloads studied, so the binding resource still has to be measured for a given job, but they cut against the assumption that these frameworks are simply I/O-bound.

Interfaces that reshaped pipeline design

The programming-model change is separate from the runtime mechanisms. Beyond different scheduling rules, Spark changed the surface that engineers used to express pipelines.

Relational and procedural processing together

MapReduce exposes a low-level procedural interface. Higher-level systems such as Hive offer SQL over Hadoop, but Spark SQL integrates relational operations with ordinary Spark programs. A DataFrame can be created from files, Hive tables, external databases, or existing distributed collections, optimized as a relational plan, and then combined with procedural code or analytics libraries.

This reduces the need to divide a pipeline between a SQL system and a separate general-purpose compute framework. It also lets analysts use SQL while application developers use DataFrames or Datasets (the typed variant of the structured API, available in Java and Scala) over the same execution engine. Spark exposes its structured APIs through Java, Scala, Python, and R, and its interactive shells shorten the cycle between writing a transformation and inspecting its result.

One engine across workloads

The same consolidation extends past SQL. Spark combines its core engine with Spark SQL, machine-learning libraries, graph processing, and streaming in one project.⁶ Structured Streaming extends the DataFrame model to unbounded input, so many APIs and transformations are shared between batch and streaming computations.

The unification is not complete. Structured Streaming’s documentation lists unsupported operations and additional state, watermark, and output-mode constraints. The advantage is a common programming model and runtime across several classes of analytics; batch and streaming programs still cannot be swapped freely.

How Hive fits

Spark SQL’s relational layer also reshaped Spark’s relationship to Hive, the incumbent SQL system over Hadoop, in ways that are easy to confuse. Hive originally compiled queries into MapReduce jobs and later gained support for other execution engines. Two arrangements with similar names describe different architectures:

Spark on Hive (Spark SQL with Hive support): Spark SQL plans and executes the query while using Hive tables, metastore metadata, SerDes, or user-defined functions.
Hive on Spark: Hive remains the query planner and selects Spark as its execution engine instead of MapReduce.

The planner distinction matters because it decides which optimizer and execution features produce the query plan, and therefore which performance and SQL-compatibility characteristics apply. Hive’s move off MapReduce was not limited to Spark; it can also target Tez, and that choice of backend is largely separate from its SQL surface.

One part of Hive outlasted the question of execution engine. The Hive metastore became the de facto metadata catalog for the Hadoop ecosystem, and Spark SQL, like other query engines, reads table and partition metadata from it. Swapping MapReduce out of Hive’s execution path, or bypassing Hive’s planner with Spark SQL, leaves that shared catalog in place, another case of a Hadoop component persisting while the compute layer changes.

The default and its exceptions

For new analytics work on a Hadoop stack, the two engines are not symmetric options. Spark offers more mechanisms: a flexible execution graph, explicit persistence, long-lived executors, structured query optimization, and APIs beyond batch processing. A new pipeline usually satisfies at least one of the conditions under which those mechanisms pay: it revisits prepared data across stages, expresses much of its work through SQL or DataFrame operations, or spends a meaningful share of its runtime on shuffles, materialization, or task startup. That makes Spark the more common starting point for new Hadoop-stack analytics work.

The default still leaves several MapReduce-friendly cases, though each has to be shown rather than assumed:

A one-pass or shuffle-heavy job, where persistence and pipelining buy little and a particular MapReduce plan might match or beat Spark on comparable hardware.
A workload large and failure-prone enough that MapReduce’s durable, replicated output between jobs gives a recovery profile Spark’s lineage-based recovery does not, since rebuilding lost shuffle data can mean rerunning upstream tasks.
A cost calculation in which Spark’s speed advantage is too small to repay its higher resource footprint, tuning effort, or migration cost.

The stability claim from the opening table resolves the same way. MapReduce often confines memory failures to one short-lived task-attempt process, while a memory problem in a shared Spark executor can affect every task running in it; neither behavior is an advantage across all workloads.

Existing pipelines also encode tested business rules, operations, capacity assumptions, and recovery practices, so rewriting a working system carries correctness and operational risk. An installed base can persist because migration is expensive, not because MapReduce is usually the better choice for new work.

The result is coexistence around measured costs. Spark is useful when the workload benefits from fewer durable boundaries, reused prepared data or executors, structured query optimization, or shared APIs across analytics tasks. MapReduce remains defensible where it already runs reliably, or where a specific property of a job calls for its simpler durable batch model.

References

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In 6th Symposium on Operating Systems Design and Implementation (OSDI 04), 2004. 137–150. ↩︎
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), 2012. 15–28. ↩︎
Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan. 2015. Clash of the titans: MapReduce vs. Spark for large scale data analytics. Proceedings of the VLDB Endowment 8, 13 (2015), 2110–2121. ↩︎
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational data processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ‘15), 2015. 1383–1394. ↩︎
Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making sense of performance in data analytics frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), 2015. 293–307. ↩︎
Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A unified engine for big data processing. Communications of the ACM 59, 11 (2016), 56–65. ↩︎

Boyang Yue