Spark Context
SparkContext is the the entry point for spark application prior to spark 2.x. SparkSession was introduced as a common entry point for SparkContext, SQLContext, StreamingContext, HiveContext. SparkContext is still being used even after spark 2.x release.
Read about SparkSession
Understanding SparkContext
SparkContext
enables us to interact with spark cluster and work with RDDs. Creating SparkContext is the first step to use RDDs and to connect with spark cluster.
Following are the key points about SparkContext -
It acts as entry point to the spark cluster
SparkContext APIs enables us to work with RDDs and cluster
Using SparkContext, we can programmatically create accumulators, broadcast variables in spark.
We can have only one active SparkContext per spark application. To create a new
SparkContext
we need to stop activeSparkContext
.
Working With SparkContext
As pointed out before, SparkContext
exposes to apis to interact with cluster and RDDs.
In the following section, we will discuss about,
- Using SparkContext In Spark Shell
- Starting SparkContext In Spark Application
- Stopping SparkContext
- [Create Sample RDD Using SparkContext]({{ relref “#create-sample-rdd-using-sparkcontext” }} “Create Sample RDD Using SparkContext”)
SparkContext In spark-shell
Spark shell creates SparkContext
object by default when we start spark-shell
program. We can access it by the name sc
from cli or utilities like notebooks, and Azure Databricks environments.
You can try following after you run spark-shell
in cli.
scala>sc
>>>sc
Start SparkContext In Spark Application
Since the release of spark 2.x, we started using SparkSession as an entry point to the spark application and to communicate with the spark objects. However, SparkSession has almost same methods or APIs to interact with the spark cluster.
In fact, behind the scenes, SparkSession creates and uses SparkContext
object.
Note: A spark application can have only one SparkContext instance at a time.
Since, spark 2.x, initializing SparkSession will internally initialize SparkContext object.
Following example shows the usage of SparkContext
in the spark applications. Examples are in scala, python (pyspark) and java.
import org.apache.spark.sql.SparkSession
object MySpark extends App {
// create spark session
val spark = SparkSession.builder()
.master("local[1]")
.appName("sde.whiletrue.live")
.getOrCreate()
val sc = spark.sparkContext
val appName = spark.sparkContext.appname
}
# Create SparkSession from builder
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]")\
.appName('sde.whiletrue.live')\
.getOrCreate()
sc = spark.sparkContext
app_name = spark.sparkContext.appName
app_id = spark.sparkContext.applicationId
import org.apache.spark.SparkContext;
import org.apache.spark.sql.SparkSession;
public class MyJavaSparkApp {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.master("local[1]")
.appName("sparkInJava")
.getOrCreate();
SparkContext sc = spark.sparkContext();
System.out.println("First SparkContext:");
System.out.println("APP Name :" + spark.sparkContext().appName());
}
}
Stop SparkContext
To stop active SparkContext
use stop()
method.
Below example shows stop()
usage.
spark.sparkContext.stop()
spark.sparkContext.stop()
spark.sparkContext().stop()
As discussed we can have only one SparkContext
per JVM
. If you wanted to create another SparkContext
object, we need to stop
it by using stop()
method and create a new SparkContext
instance
Create Sample RDD Using SparkContext
Below example shows creation of sample rdd using SparkContext
.
val rdd = spark.sparkContext.range(1, 10)
rdd.collect().foreach(print)
rdd = spark.sparkContext.range(1, 10)
rdd.collect()
RDD<Object> data = sc.range(1, 1000000, 1, 12);
Frequently Used SparkSession Methods
Method | Description |
---|---|
longAccumulator() | Creates an accumulator variable of a long data type. Only a driver can access accumulator variables. |
doubleAccumulator() | Creates an accumulator variable of a double data type. Only a driver can access accumulator variables. |
applicationId | Returns a unique ID of a Spark application. |
appName | Return an app name that was given when creating SparkContext |
broadcast | Read Only variable broadcast to the entire cluster. You can broadcast a variable to a Spark cluster only once. |
emptyRDD | Creates an empty RDD |
getPersistentRDDs | Returns all persisted RDDs |
getOrCreate() | Creates or returns a SparkContext |
hadoopFile | Returns an RDD of a Hadoop file |
master() | Returns master that set while creating SparkContext |
newAPIHadoopFile | Creates an RDD for a Hadoop file with a new API InputFormat. |
sequenceFile | Get an RDD for a Hadoop SequenceFile with given key and value types. |
setLogLevel | Change log level to debug, info, warn, fatal, and error |
textFile | Reads a text file from HDFS, local or any Hadoop supported file systems and returns an RDD |
union | Union two RDDs |
wholeTextFiles | Reads a text file in the folder from HDFS, local or any Hadoop supported file systems and returns an RDD of Tuple2. |
First | Element of the tuple consists file name and the second element consists context of the text file. |