Spark Context

SparkContext is the the entry point for spark application prior to spark 2.x. SparkSession was introduced as a common entry point for SparkContext, SQLContext, StreamingContext, HiveContext. SparkContext is still being used even after spark 2.x release.

Read about SparkSession

Understanding SparkContext

SparkContext enables us to interact with spark cluster and work with RDDs. Creating SparkContext is the first step to use RDDs and to connect with spark cluster.

Following are the key points about SparkContext -

It acts as entry point to the spark cluster
SparkContext APIs enables us to work with RDDs and cluster
Using SparkContext, we can programmatically create accumulators, broadcast variables in spark.
We can have only one active SparkContext per spark application. To create a new SparkContext we need to stop active SparkContext.

Working With SparkContext

As pointed out before, SparkContext exposes to apis to interact with cluster and RDDs.

In the following section, we will discuss about,

Using SparkContext In Spark Shell
Starting SparkContext In Spark Application
Stopping SparkContext
[Create Sample RDD Using SparkContext]({{ relref “#create-sample-rdd-using-sparkcontext” }} “Create Sample RDD Using SparkContext”)

SparkContext In spark-shell

Spark shell creates SparkContext object by default when we start spark-shell program. We can access it by the name sc from cli or utilities like notebooks, and Azure Databricks environments.

You can try following after you run spark-shell in cli.

scala>sc

>>>sc

Start SparkContext In Spark Application

Since the release of spark 2.x, we started using SparkSession as an entry point to the spark application and to communicate with the spark objects. However, SparkSession has almost same methods or APIs to interact with the spark cluster.

In fact, behind the scenes, SparkSession creates and uses SparkContext object.

Note: A spark application can have only one SparkContext instance at a time.

Since, spark 2.x, initializing SparkSession will internally initialize SparkContext object.

Following example shows the usage of SparkContext in the spark applications. Examples are in scala, python (pyspark) and java.

import org.apache.spark.sql.SparkSession

object MySpark extends App {
  // create spark session
  val spark = SparkSession.builder()
      .master("local[1]")
      .appName("sde.whiletrue.live")
      .getOrCreate()

  val sc = spark.sparkContext
  val appName = spark.sparkContext.appname
}

# Create SparkSession from builder
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[1]")\
                    .appName('sde.whiletrue.live')\
                    .getOrCreate()

sc = spark.sparkContext
app_name = spark.sparkContext.appName
app_id = spark.sparkContext.applicationId

import org.apache.spark.SparkContext;
import org.apache.spark.sql.SparkSession;

public class MyJavaSparkApp {
  public static void main(String[] args) {
    SparkSession spark = SparkSession.builder()
      .master("local[1]")
      .appName("sparkInJava")
      .getOrCreate();

      SparkContext sc = spark.sparkContext();
      
      System.out.println("First SparkContext:");
      System.out.println("APP Name :" + spark.sparkContext().appName());
  }
}

Stop SparkContext

To stop active SparkContext use stop() method.

Below example shows stop() usage.

spark.sparkContext.stop()

spark.sparkContext.stop()

spark.sparkContext().stop()

As discussed we can have only one SparkContext per JVM. If you wanted to create another SparkContext object, we need to stop it by using stop() method and create a new SparkContext instance

Create Sample RDD Using SparkContext

Below example shows creation of sample rdd using SparkContext.

val rdd = spark.sparkContext.range(1, 10)

rdd.collect().foreach(print)

rdd = spark.sparkContext.range(1, 10)

rdd.collect()

RDD<Object> data = sc.range(1, 1000000, 1, 12);

Frequently Used SparkSession Methods

Method	Description
`longAccumulator()`	Creates an accumulator variable of a long data type. Only a driver can access accumulator variables.
`doubleAccumulator()`	Creates an accumulator variable of a double data type. Only a driver can access accumulator variables.
`applicationId`	Returns a unique ID of a Spark application.
`appName`	Return an app name that was given when creating SparkContext
`broadcast`	Read Only variable broadcast to the entire cluster. You can broadcast a variable to a Spark cluster only once.
`emptyRDD`	Creates an empty RDD
`getPersistentRDDs`	Returns all persisted RDDs
`getOrCreate()`	Creates or returns a SparkContext
`hadoopFile`	Returns an RDD of a Hadoop file
`master()`	Returns master that set while creating SparkContext
`newAPIHadoopFile`	Creates an RDD for a Hadoop file with a new API InputFormat.
`sequenceFile`	Get an RDD for a Hadoop SequenceFile with given key and value types.
`setLogLevel`	Change log level to debug, info, warn, fatal, and error
`textFile`	Reads a text file from HDFS, local or any Hadoop supported file systems and returns an RDD
`union`	Union two RDDs
`wholeTextFiles`	Reads a text file in the folder from HDFS, local or any Hadoop supported file systems and returns an RDD of Tuple2.
`First`	Element of the tuple consists file name and the second element consists context of the text file.

⌖ apache spark bigdata distributed-system

Spark Context

Understanding SparkContext #

Working With SparkContext #

SparkContext In spark-shell #

Start SparkContext In Spark Application #

Stop SparkContext #

Create Sample RDD Using SparkContext #

Frequently Used SparkSession Methods #

Subscribe For More Content