RDD in Spark

RDD (Resilient Distributed Dataset) in spark is a fundamental data structure of Spark. It is the primary data abstraction in Apache Spark and the Spark Core.

In this article, we will understand RDDs, advantages of RDDs, disadvantages of rdds, creation of rdds, operations on rdds and overview of transformations and actions on RDDs.

  • Understanding RDD
  • Advantages Of RDD Data Structure
  • Downsides Of RDD
  • Getting Started With RDD
  • Operations On RDDs

Understanding RDD

RDDs (Resilient Distributed Dataset) are fault-tolerant and immutable distributed collections of objects. Fault-tolerant and immutable means once we create an RDD we cannot change it in place. Each dataset in RDD is divided into logical partitions, which can be computed on different nodes of the cluster.

Note: getNumPartitions rdd method is used to get the rdd partitions.

RDDs are a collection of objects similar to collections in Scala or Java, with the difference being RDD is computed on several JVMs (Java Virtual Machines) across multiple physical servers also called nodes in a cluster while a Scala collection lives on a single JVM.

RDDs also provide data abstraction of partitioning and distribution of the data which is designed to run computations in parallel on several nodes. Which means, while doing transformations on RDD most of the time we don’t have to worry about parallelism and work on RDDs as a single dataset.

Note: RDD can have a name and unique identifier (id)

Advantages Of RDD Data Structure

RDD is the basic data structure in spark. Spark runs on RDDs. The RDD data model enabled the power of distributed processing which would be difficult otherwise. It has many advantages which made it popular in the bigdata and distributed processing space. Few advantages are listed below -

  • RDD made In-Memory processing possible. It was a drawback of hadoop data model.

  • RDDs are immutable. This reduced the operational cost at distributed level.

  • RDDs are fault-tolerant. Which means, incase of a processing node failure, RDDs can be regenerated from its previous step in the new node.

  • Lazy Evaluation - RDDs does planning in the spark engine before it executes. And RDDs are generated only when the necessary dependencies are processed.

  • As RDDs are immutable, they can be easily partitioned and processed in parallel in the distributed environment.

Downsides Of RDD

RDDs are immutable data structures as mentioned in the previous sections. This creates few limitations in the spark applications.

  • Spark RDDs are not suitable for applications that make updates to the state store such as storage systems. For these applications, it is more efficient to use systems that perform traditional update logging and data checkpointing, such as databases. The goal of RDD is to provide an efficient programming model for batch analytics.

  • An RDDs can be present in only one SparkContext.

Getting Started With RDD

RDDs are immutable distributed data sets. In another view, they are immutable jvm objects like any other object. They have attributes and methods associated with them like any java objects.

RDD Creation

RDD’s are created primarily in two different ways,

  • Parallelizing an existing collection and secondly
  • Referencing a dataset from an external storage system like S3, GCS, HDFS, etc.

Before we look into examples, let’s initialize SparkSession using the builder pattern method defined in SparkSession class. While initializing, we need to provide the master and application name as shown below.

import org.apache.spark.sql.SparkSession

val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("sde.whiletrue.live")
    .getOrCreate()

from pyspark.sql import SparkSession

spark = SparkSession.builder.\
        .master("local[1]")
        .appName("sde.whiletrue.live")
        .getOrCreate()

RDD Creation
RDD Creation

Create RDD Using parallelize()

Below example shows the parallelize method in scala, java and pyspark.

//Create RDD from parallelize    
val dataset = Range(0, 1000)
val rdd: RDD[Int] = spark.sparkContext.parallelize(dataset)
dataset = range(1000)
rdd = spark.sparkContext.parallelize(dataset)

Create RDD From Flat Files

It is always more practical to load the data set from any file from storage. Below example shows the creation of rdd from flat files.

Example: Using textFile() method we can read a text (.txt) file into RDD.

//Create RDD from external Data source
val rdd = spark.sparkContext.textFile("/path/textFile.txt")

Example: wholeTextFiles() function returns a PairRDD with the key being the file path and value being file content.

//Reads entire file into a RDD as single record.
val rdd = spark.sparkContext.wholeTextFiles("/path/textFile.txt")

Example: Using textFile() method we can read a text (.txt) file into RDD.

# Create RDD from external Data source
rdd = spark.sparkContext.textFile("/path/textFile.txt")

Example: wholeTextFiles() function returns a PairRDD with the key being the file path and value being file content.

# Reads entire file into a RDD as single record.
rdd = spark.sparkContext.wholeTextFiles("/path/textFile.txt")

Create An Empty RDD

Using emptyRDD() method on sparkContext we can create an RDD with no data. This method creates an empty RDD with no partition.

// Creates empty RDD with no partition 

// creates org.apache.spark.rdd.RDD[Nothing] = EmptyRDD[0]   
val rdd = spark.sparkContext.emptyRDD 

// creates org.apache.spark.rdd.RDD[String] = EmptyRDD[1]
val rddString = spark.sparkContext.emptyRDD[String] // creates EmptyRDD[1]
# Creates empty RDD with no partition 
rdd = spark.sparkContext.emptyRDD

Create An Empty RDD With Partitions

Sometimes we may need to write an empty RDD to files by partition, In this case, you should create an empty RDD with partition.

//Create empty RDD with partition
val rdd = spark.sparkContext.parallelize(Seq.empty[String])
# Create empty RDD with partition
rdd = spark.sparkContext.parallelize(list())

Operations On RDDs

In this section we will discuss the basic operations on RDDs categorized into transformations and actions, such as map, filter, and persist, etc.

Operations in RDDs can be broadly categories as Transformations and Actions.

RDD Transformations

Transformations are lazy operations, instead of updating an RDD, these operations return another RDD.

Below is the exhaustive list of transformations in spark.

General Transformations

  • map
  • filter
  • flatMap
  • mapPartitions
  • mapPartitionsWIthIndex
  • groupBy
  • sortBy

Statistical Transformations

  • sample
  • randomSplit

Relational Transformations

  • union
  • intersection
  • subtract
  • distinct
  • cartesian
  • zip

Data Structural Transformations

  • keyBy
  • zipWithIndex
  • zipWithUniqueID
  • zipPartitions
  • coalesce
  • repartition
  • repartitionAndSortWithinPartitions
  • pipe

RDD Actions

Actions are operations that trigger computation and return RDD values.

Below is the exhaustive list of actions in spark.

General Actions

  • reduce
  • collect
  • aggregate
  • fold
  • first
  • take
  • forEach
  • top
  • treeAggregate
  • treeReduce
  • forEachPartition
  • collectAsMap

Statistical Actions

  • count
  • takeSample
  • max
  • min
  • sum
  • histogram
  • mean
  • variance
  • stdev
  • sampleVariance
  • countApprox
  • countApproxDistinct

Relational Actions

  • takeOrderd

IO Actions

  • saveAsTextFile
  • saveAsSequenceFile
  • saveAsObjectFile
  • saveAsHadoopDataset
  • saveAsHadoopFile
  • saveAsNewAPIHadoopDataset
  • saveAsNewAPIHadoopFile
apache spark bigdata distributed-system spark-fundamentals rdd

Subscribe For More Content