Resilient Distributed Dataset (RDD) is the main abstraction of Spark framework while Spark SQL (a Spark module for structured data processing) provides Spark more information about the structure of both the data and the computation being performed, and therefore uses this extra information to perform extra optimizations. Up until Spark 1.6, RDDs used to perform better than its Spark SQL counterpart DataFrame (, however, Spark 2.1 upgrades have made Spark SQL quite more efficient ( This article attempts to compare the performance of “group by” type operations for some use cases for Spark SQL 2.1 DataFrame, and RDD. We will also look at how Spark 1.6 performance compares with new versions RDD and DataFrame.

Data Source

Above link is a source of 5000+ movies from IMDB website. There are 2399 unique director names and thousands of actors/actresses. Each movie has detail about the director, IMDB score, the number of critics, the number of users voted, Facebook likes in various categories, budget, gross revenue, etc. Refer above link for details. Since this data set is pretty small, we have replicated same data several times over using Linux command to make the number of rows over 1 million.


Test Environment

All tests will be performed on 16GB memory MacBook in local mode with all default settings. Spark version is 2.1 or 1.6 (wherever specified).


Use Cases

There will be two types of use cases – multi-step, and simple. The purpose of multi-step use case is to test performance when optimization is done internally by spark for several tasks including shuffle operations. Since performance for multi-step is directly affected by how steps are coded (especially for RDDs), a simple use case will also be used to compare a simple operation for the given data set.

Multi-Step Use Case

The goal of this use case is to find top 10 movie directors of all times using following business rules:

  • Select color movies where the number of critics is greater than 100.
  • Find weighted IMDB score weighted over number of critics, number of users voted, and number of users for review for all movies per director
  • Find average percentage profit per director based on gross revenue and budget per movie
  • Find average Facebook likes for director for all movies of the director
  • Find average total Facebook likes of cast for all movies of the director
  • Find average movie Facebook likes per director
  • Find maximum in each category per movie i.e. maximum IMDB score, maximum percentage profit, maximum director Facebook likes, maximum total cast Facebook likes, and maximum movie Facebook likes
  • Using maximum values, find score at scale of 100 in each category
  • Sum all scores per director to get total score (max score possible is 500) and then sort by score to select top 10

Simple Use Case

Find the total number of faces appeared on the poster for each color type (color, black & white, or unknown) movie. The goal of this use case is to use “group by” operation only once.


Multi-step Use Case

RDD Option

Refer complete code at GitHub repo below.


Dataset Option

Refer complete code at GitHub repo below. In this code, we have used a dataset of row object which is technically DataFrame but it is optimized for Spark 2.1’s Dataset implementation.


Simple Use Case

RDD Option

Refer complete code at GitHub repo below.


Dataset Option

Refer complete code at GitHub repo below.


DataFrame Option

Spark 1.6 is used for this option to compare optimization in Spark 2.1 Dataset, and also to compare an older version of DataFrame with new RDD. Refer complete code at GitHub repo below along with pom.xml.



Following screenshot of Spark History server shows result for each option –


If you look stages and tasks for each option at spark web UI, the dataset has more optimization towards input size. Another observation, if we don’t use persist option in RDD multi-step case, time duration will be significantly larger –

Following charts show side by side comparison of time taken by each option –

As shown above, Dataset option is more than 6 times faster for a multi-step use case and about 14 times faster for a simple use case. DataFrame API of Spark 1.6 is slower than RDD as expected.




Although RDDs used to perform better than Spark SQL’s DataFrame or SchemaRDD API before 2.1, the current version of Spark (2.1) has made significant improvements for Datasets in process optimization for certain use cases where data can easily be converted into Datasets. RDD option can further be optimized, however, by using generic aggregate options like “combine by key” instead of using “reduce by key” which requires output to be the same type as input. “Combine By Key” option doesn’t require output be the same type as the input therefor you can use custom functions that implement spark function APIs to combine multiple operations in one. Nonetheless, this optimization is not expected to outperform Datasets. Not only better performance, Dataset is much easier to code due to its SQL compliance.




Share this:

Leave a Reply

Your email address will not be published. Required fields are marked *