In the world of Big Data, Spark is a big name. Spark's in-memory computation overcomes the shortcomings of Hadoop's Map Reduce Architecture where data sharing is slow due to replication, serialization, and disk IO operations. Most of the Hadoop applications spend more than 90 percent of the time doing HDFS read and write operation. But Spark's in-memory computation comes at a cost: it requires high CPU utilization due to excessive memory usage.
What are RDDs in Spark ?
RDD or Resilient Distributed Datasets is a fundamental data structure of Spark. It is a read-only, partitioned collection of records. It can contain any type of python, scala, or java objects, including user-defined classes. This makes Spark more flexible in terms of accessing RDDs in different languages where scala is the official language of Spark. RDDs are capable of processing structured, semi-structured, and unstructured data.
How to create an RDD ?
There are two ways to create an RDD
- Parallelizing an existing collection in your driver program
> var rdd1 = spark.sparkContext.parallelize(1 to 10).collect() - Referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop input/output format.
> var rdd2 = spark.sparkContext.textFile("file_path/data.txt")
Features of RDD
In-memory Computation - The data is kept in a Random Access Memory (RAM) instead of disk drives like Hadoop. This reduces the cost of memory to process, compute, and analyze the data but requires a significantly large amount of RAM for storing huge data. This increases infrastructural cost since RAMs are much costlier than disk drives.
Lazy Evaluation - In lazy evaluation, execution does not start until we trigger it. RDD transformations are lazy in nature. They keep a record of which operation is being called on the RDD. These operations can be executed anytime by calling an action on the RDD.
Distributed Environment - RDDs are divided into logical partitions. These partitions may be computed on different nodes of the cluster which enables parallel processing. Parallel processing helps to perform different tasks simultaneously, therefore increasing the speed of operation, provide optimization by reducing the number of queries.
Fault Tolerance - RDDs are highly resilient since the same data chunks are replicated across multiple executor nodes. Thus, even if one executor node fails, another will still process the data making the RDDs recover from any issue.
Persistence - This feature increases the reusability of RDDs by saving the intermediate result of the operation. It also provides the flexibility to select a storage strategy from memory_only, disk_only, memory_and_disk, memory_only_ser, memory_and_disk_ser.
Coarse-Grained Operations - We can apply from a group of operations like maps, filter by, or group by. These operations can be applied to all the elements.
Other features of RDD include immutability (RDDs cannot be divided further, therefore it is said to be a fundamental unit of Spark), partitioning, automatic data rollback, etc.
Operations on RDD
RDD supports two main operations namely,
- Transformations
- Actions
As mentioned earlier, RDDs are immutable and read-only in nature, then how does it apply transformations ?
Well, when we run any transformation, it runs those transformations on all the RDD and creates a new RDD instead of transforming the original one. This is mainly done to achieve fault tolerance and optimization reasons. Actions are applied to these new RDDs to produce the results.
Here, we come to an end of this post which covers the basics about RDD, its features, two ways to create RDD, and operation that can be performed on them.
Comments
Post a Comment