<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Apache Spark on Boyang Yue</title><link>http://blog.boyangyue.com/tags/apache-spark/</link><description>Recent content in Apache Spark on Boyang Yue</description><generator>Hugo</generator><language>en-us</language><copyright>A Good Year Ahead</copyright><lastBuildDate>Sun, 18 Dec 2022 22:29:52 +0800</lastBuildDate><atom:link href="http://blog.boyangyue.com/tags/apache-spark/index.xml" rel="self" type="application/rss+xml"/><item><title>Efficient Similarity Search with FAISS</title><link>http://blog.boyangyue.com/2022/12/efficient-similarity-search-with-faiss/</link><pubDate>Sun, 18 Dec 2022 22:29:52 +0800</pubDate><guid>http://blog.boyangyue.com/2022/12/efficient-similarity-search-with-faiss/</guid><description>&lt;p&gt;FAISS (Facebook AI Similarity Search) is a high-performance library that expedites similarity search and classification, consuming high-dimensional vectors derived from cutting-edge AI tools such as word2vec or Convolutional Neural Networks (CNN). This article gives a brief introduction about how to use it efficiently, especially with PySpark.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;TL;DR: To obtain maximal performance gains, search with a batch of queries rather than a single one.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="getting-started-a-minimal-working-example"&gt;Getting Started: A Minimal Working Example&lt;/h2&gt;
&lt;p&gt;Prior to diving into the examples, it is imperative that FAISS is installed. It&amp;rsquo;s highly advised to &lt;a href="https://github.com/facebookresearch/faiss/blob/main/INSTALL.md" target="_blank" rel="nofollow noopener noreferrer"&gt;install FAISS via conda&lt;/a&gt;, as opposed to PyPI, to circumvent &lt;a href="https://github.com/facebookresearch/faiss/issues/1545#issuecomment-735580878" target="_blank" rel="nofollow noopener noreferrer"&gt;potential compatibility issues&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>The Comprehensive Guide to Hive UDF</title><link>http://blog.boyangyue.com/2022/01/the-comprehensive-guide-to-hive-udf/</link><pubDate>Sun, 16 Jan 2022 21:46:37 +0800</pubDate><guid>http://blog.boyangyue.com/2022/01/the-comprehensive-guide-to-hive-udf/</guid><description>&lt;p&gt;One of the most essential features of Spark is interaction with Hive, the data warehouse platform built on top of Hadoop. Naturally, Spark SQL supports the &lt;a href="https://spark.apache.org/docs/latest/sql-ref-functions-udf-hive.html" target="_blank" rel="nofollow noopener noreferrer"&gt;integration of Hive UDFs, UDAFs, and UDTFs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;At a glance, delving into Hive UDFs might seem unnecessary in the Spark context, considering the extensive functionalities provided by &lt;a href="https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html" target="_blank" rel="nofollow noopener noreferrer"&gt;Spark UDF&lt;/a&gt;. Nevertheless, Hive UDF could prove indispensable in particular scenarios, such as &lt;strong&gt;building pure SQL environments&lt;/strong&gt; or &lt;strong&gt;optimizing performance&lt;/strong&gt;. Despite the abundance of Spark tutorials, there is a dearth of practical guides on how to work with Hive UDF, and that&amp;rsquo;s why this article is being written.&lt;/p&gt;</description></item><item><title>From MapReduce to Spark: Execution and Programming Models</title><link>http://blog.boyangyue.com/2021/06/from-mapreduce-to-spark-execution-and-programming-models/</link><pubDate>Tue, 15 Jun 2021 11:00:00 +0900</pubDate><guid>http://blog.boyangyue.com/2021/06/from-mapreduce-to-spark-execution-and-programming-models/</guid><description>&lt;p&gt;MapReduce gives Hadoop a simple batch execution model: map tasks, a shuffle, an optional reduce, and durable output between jobs.&lt;sup id="fnref:1"&gt;&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref"&gt;1&lt;/a&gt;&lt;/sup&gt; That model is easy to reason about when a job reads input, produces output, and stops. It becomes more expensive when a pipeline parses the same records, joins or aggregates them, feeds the result into another job, and then repeats that pattern for model training or reporting.&lt;/p&gt;
&lt;p&gt;Spark changed which costs engineers had to manage. It can keep a dependent computation inside one application instead of forcing every intermediate result through replicated storage, record lineage so lost partitions can be recomputed, hold reused data resident under an explicit storage policy, reuse executor Java Virtual Machines (JVMs) across many tasks, and apply relational optimization when work is expressed as SQL or DataFrames. Each mechanism trades an old cost for a new tuning problem. For many iterative, multi-stage, and structured workloads, Spark is faster; the size of that advantage, and the exceptions to it, depend on the plan and hardware.&lt;/p&gt;</description></item></channel></rss>