bigdata

Sharding vs Partitioning

Sharding and partitioning are both techniques used in database management to break up large databases into smaller, more manageable parts. However, there are some key differences between the two approaches

Posted May 2, 2023 by Rohith and Anusha ‐ 2 min read

⌖ quick-references bigdata blog

Spark Architecture

Apache Spark is a unified, open-source, distributed data processing engine for big data. In this article, we will discuss about the Spark architecture, its distributed nature and how it achieves processing of high volume data.

Posted August 4, 2022 by Rohith ‐ 7 min read

⌖ apache spark bigdata architecture transformations distributed-system actions rdd

Spark Memory Management

The main feature of apache spark is its ability to run computations in memory. Hence, it is obvious that memory management plays a very important role in the whole system. In this article we will dive into spark memory management.

Posted August 9, 2022 by Rohith ‐ 11 min read

⌖ apache spark bigdata architecture memory jvm yarn heap off-heap distributed-system gc

Spark Session

Spark Session is the entry point for spark applications to create RDD, DataFrame, and Dataset.

Posted August 22, 2022 by Rohith ‐ 8 min read

⌖ apache spark bigdata distributed-system spark-session

Spark Context

SparkContext is the the entry point for spark application prior to spark 2.x. SparkSession was introduced as a common entry point for SparkContext, SQLContext, StreamingContext, HiveContext. SparkContext is still being used even after spark 2.x release.

Posted August 23, 2022 by Rohith ‐ 4 min read

⌖ apache spark bigdata distributed-system