<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Distributed Systems on Boyang Yue</title><link>http://blog.boyangyue.com/tags/distributed-systems/</link><description>Recent content in Distributed Systems on Boyang Yue</description><generator>Hugo</generator><language>en-us</language><copyright>A Good Year Ahead</copyright><lastBuildDate>Tue, 15 Jun 2021 11:00:00 +0900</lastBuildDate><atom:link href="http://blog.boyangyue.com/tags/distributed-systems/index.xml" rel="self" type="application/rss+xml"/><item><title>From MapReduce to Spark: Execution and Programming Models</title><link>http://blog.boyangyue.com/2021/06/from-mapreduce-to-spark-execution-and-programming-models/</link><pubDate>Tue, 15 Jun 2021 11:00:00 +0900</pubDate><guid>http://blog.boyangyue.com/2021/06/from-mapreduce-to-spark-execution-and-programming-models/</guid><description>&lt;p&gt;MapReduce gives Hadoop a simple batch execution model: map tasks, a shuffle, an optional reduce, and durable output between jobs.&lt;sup id="fnref:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt; That model is easy to reason about when a job reads input, produces output, and stops. It becomes more expensive when a pipeline parses the same records, joins or aggregates them, feeds the result into another job, and then repeats that pattern for model training or reporting.&lt;/p&gt;
&lt;p&gt;Spark changed which costs engineers had to manage. It can keep a dependent computation inside one application instead of forcing every intermediate result through replicated storage, record lineage so lost partitions can be recomputed, hold reused data resident under an explicit storage policy, reuse executor Java Virtual Machines (JVMs) across many tasks, and apply relational optimization when work is expressed as SQL or DataFrames. Each mechanism trades an old cost for a new tuning problem. For many iterative, multi-stage, and structured workloads, Spark is faster; the size of that advantage, and the exceptions to it, depend on the plan and hardware.&lt;/p&gt;</description></item></channel></rss>