Apache Spark on Boyang Yue

Efficient Similarity Search with FAISS

Sun, 18 Dec 2022 22:29:52 +0800

FAISS (Facebook AI Similarity Search) is a high-performance library that expedites similarity search and classification, consuming high-dimensional vectors derived from cutting-edge AI tools such as word2vec or Convolutional Neural Networks (CNN). This article gives a brief introduction about how to use it efficiently, especially with PySpark.

TL;DR: To obtain maximal performance gains, search with a batch of queries rather than a single one.

Getting Started: A Minimal Working Example

Prior to diving into the examples, it is imperative that FAISS is installed. It’s highly advised to install FAISS via conda, as opposed to PyPI, to circumvent potential compatibility issues.

The Comprehensive Guide to Hive UDF

Sun, 16 Jan 2022 21:46:37 +0800

One of the most essential features of Spark is interaction with Hive, the data warehouse platform built on top of Hadoop. Naturally, Spark SQL supports the integration of Hive UDFs, UDAFs, and UDTFs.

At a glance, delving into Hive UDFs might seem unnecessary in the Spark context, considering the extensive functionalities provided by Spark UDF. Nevertheless, Hive UDF could prove indispensable in particular scenarios, such as building pure SQL environments or optimizing performance. Despite the abundance of Spark tutorials, there is a dearth of practical guides on how to work with Hive UDF, and that’s why this article is being written.

From MapReduce to Spark: Execution and Programming Models

Tue, 15 Jun 2021 11:00:00 +0900

MapReduce gives Hadoop a simple batch execution model: map tasks, a shuffle, an optional reduce, and durable output between jobs.¹ That model is easy to reason about when a job reads input, produces output, and stops. It becomes more expensive when a pipeline parses the same records, joins or aggregates them, feeds the result into another job, and then repeats that pattern for model training or reporting.

Spark changed which costs engineers had to manage. It can keep a dependent computation inside one application instead of forcing every intermediate result through replicated storage, record lineage so lost partitions can be recomputed, hold reused data resident under an explicit storage policy, reuse executor Java Virtual Machines (JVMs) across many tasks, and apply relational optimization when work is expressed as SQL or DataFrames. Each mechanism trades an old cost for a new tuning problem. For many iterative, multi-stage, and structured workloads, Spark is faster; the size of that advantage, and the exceptions to it, depend on the plan and hardware.