Spark Context

SparkContext is the the entry point for spark application prior to spark 2.x. SparkSession was introduced as a common entry point for SparkContext, SQLContext, StreamingContext, HiveContext. SparkContext is still being used even after spark 2.x release.

Read about SparkSession

Understanding SparkContext

SparkContext enables us to interact with spark cluster and work with RDDs. Creating SparkContext is the first step to use RDDs and to connect with spark cluster.

Following are the key points about SparkContext -

  • It acts as entry point to the spark cluster

  • SparkContext APIs enables us to work with RDDs and cluster

  • Using SparkContext, we can programmatically create accumulators, broadcast variables in spark.

  • We can have only one active SparkContext per spark application. To create a new SparkContext we need to stop active SparkContext.

SparkContext
Spark Context (Source: spark.apache.org)

Working With SparkContext

As pointed out before, SparkContext exposes to apis to interact with cluster and RDDs.

In the following section, we will discuss about,

SparkContext In spark-shell

Spark shell creates SparkContext object by default when we start spark-shell program. We can access it by the name sc from cli or utilities like notebooks, and Azure Databricks environments.

You can try following after you run spark-shell in cli.

scala>sc
>>>sc

Start SparkContext In Spark Application

Since the release of spark 2.x, we started using SparkSession as an entry point to the spark application and to communicate with the spark objects. However, SparkSession has almost same methods or APIs to interact with the spark cluster.

In fact, behind the scenes, SparkSession creates and uses SparkContext object.

Note: A spark application can have only one SparkContext instance at a time.

Since, spark 2.x, initializing SparkSession will internally initialize SparkContext object.

Following example shows the usage of SparkContext in the spark applications. Examples are in scala, python (pyspark) and java.

import org.apache.spark.sql.SparkSession

object MySpark extends App {
  // create spark session
  val spark = SparkSession.builder()
      .master("local[1]")
      .appName("sde.whiletrue.live")
      .getOrCreate()

  val sc = spark.sparkContext
  val appName = spark.sparkContext.appname
}
# Create SparkSession from builder
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[1]")\
                    .appName('sde.whiletrue.live')\
                    .getOrCreate()

sc = spark.sparkContext
app_name = spark.sparkContext.appName
app_id = spark.sparkContext.applicationId  
import org.apache.spark.SparkContext;
import org.apache.spark.sql.SparkSession;

public class MyJavaSparkApp {
  public static void main(String[] args) {
    SparkSession spark = SparkSession.builder()
      .master("local[1]")
      .appName("sparkInJava")
      .getOrCreate();

      SparkContext sc = spark.sparkContext();
      
      System.out.println("First SparkContext:");
      System.out.println("APP Name :" + spark.sparkContext().appName());
  }
}

Stop SparkContext

To stop active SparkContext use stop() method.

Below example shows stop() usage.

spark.sparkContext.stop()
spark.sparkContext.stop()
spark.sparkContext().stop()

As discussed we can have only one SparkContext per JVM. If you wanted to create another SparkContext object, we need to stop it by using stop() method and create a new SparkContext instance

Create Sample RDD Using SparkContext

Below example shows creation of sample rdd using SparkContext.

val rdd = spark.sparkContext.range(1, 10)

rdd.collect().foreach(print)
rdd = spark.sparkContext.range(1, 10)

rdd.collect()
RDD<Object> data = sc.range(1, 1000000, 1, 12);

Frequently Used SparkSession Methods

MethodDescription
longAccumulator()Creates an accumulator variable of a long data type. Only a driver can access accumulator variables.
doubleAccumulator()Creates an accumulator variable of a double data type. Only a driver can access accumulator variables.
applicationIdReturns a unique ID of a Spark application.
appNameReturn an app name that was given when creating SparkContext
broadcastRead Only variable broadcast to the entire cluster. You can broadcast a variable to a Spark cluster only once.
emptyRDDCreates an empty RDD
getPersistentRDDsReturns all persisted RDDs
getOrCreate()Creates or returns a SparkContext
hadoopFileReturns an RDD of a Hadoop file
master()Returns master that set while creating SparkContext
newAPIHadoopFileCreates an RDD for a Hadoop file with a new API InputFormat.
sequenceFileGet an RDD for a Hadoop SequenceFile with given key and value types.
setLogLevelChange log level to debug, info, warn, fatal, and error
textFileReads a text file from HDFS, local or any Hadoop supported file systems and returns an RDD
unionUnion two RDDs
wholeTextFilesReads a text file in the folder from HDFS, local or any Hadoop supported file systems and returns an RDD of Tuple2.
FirstElement of the tuple consists file name and the second element consists context of the text file.
apache spark bigdata distributed-system

Subscribe For More Content