Spark RDD vs Spark SQL Performance comparison using Spark Java APIs

Resilient Distributed Dataset (RDD) is the main abstraction of Spark framework while Spark SQL (a Spark module for structured data processing) provides Spark more information about the structure of both the data and the computation being performed, and therefore uses this extra information to perform extra optimizations. Up until Spark 1.6, RDDs used to perform …

Processing and Serving Data with Apache Spark

Written by: Edward Yeh, Principal Big Data Consultant So what is Apache Spark and why do we care? Spark is a fast and general-purpose cluster computing system that is used for large-scale data processing of both structured and unstructured data. The project was initially developed by the AMPlab at UC Berkeley and has now evolved …