Data skewness is a common challenge faced by organizations processing extensive data using Spark’s EL+T (Extract, Load, and Transform) systems.
In this blog, we will explore the concept of data skewness in Spark and understand the significant impact it can have on the performance of EL+T systems.
Understanding Data Skewness
Data skewness occurs when the distribution of data within a workflow becomes imbalanced. In other words, certain partitions or subsets of data end up containing significantly more records than others. This phenomenon can be particularly problematic when dealing with large datasets, as it can lead to several performance-related issues.
The Impact of Data Skewness
Data skewness can have several adverse effects on EL+T systems:
- Longer Processing Times: When some partitions have much more data than others, tasks processing those partitions take longer to complete. This imbalance in processing times can significantly slow down the entire workflow.
- Performance Degradation: Slower processing times can lead to performance degradation, making it challenging to meet processing deadlines and service-level agreements.
- Resource Inefficiency: Skewed data can result in uneven resource consumption, leading to inefficiencies in resource allocation and utilization.
- Increased Costs: Prolonged processing times and inefficient resource usage can result in higher computation costs, impacting the organization’s budget.
- Environmental Impact: Inefficient data processing consumes more energy and resources, contributing to a higher environmental footprint.
Let’s consider a real-world scenario to illustrate the impact of data skewness. Suppose a car insurance company is processing a massive dataset containing approximately 325 million rows. This dataset includes various details such as claim numbers, customer IDs, service center IDs, geographical information, service dates, and more.
Here are some key characteristics of this dataset:
- Multiple records for individual car owners.
- Car owners visiting one or more service centers over time.
- Approximately 95% of records having null values for referring service center ID.
- Inconsistencies in state, city, and zip data for service centers.
- Varied state and zip values for the same car owner, reflecting changes in residence or incomplete data.
Now, imagine a transformation that uses a window function to identify the most recent non-empty state value for a referring service center. Due to the high percentage of null or empty values (95%) in the referring service center ID, this transformation can result in highly skewed data, leading to the issues mentioned earlier.
Also read – What’s Next For ELT? (Part II)
Computing Skewness Using Spark Engine
In Spark Big Data, computing skewness on vast datasets entails handling billions of data rows distributed across numerous nodes in a cluster. This gives rise to challenges related to scalability, performance, and resource optimization.
Spark employs a distributed computing model that divides data into smaller chunks and processes these partitions concurrently across multiple cluster nodes. The number and size of these partitions can be adjusted to align with the dataset’s characteristics and the available resources.
In this scenario, a single Spark task can process a substantial number of records, contingent on factors such as partition size and the memory and processing capabilities of the nodes. This means that, depending on the computation’s complexity and resource availability, a task can handle millions or even billions of data rows in a single iteration.
In our next blog, we will explore Spark Engine’s execution process and how it addresses data skewness through logical plan optimization.