Reading and Writing Deeply Partitioned Files in Apache Spark
In large-scale data engineering and analytics, files are often stored in deeply partitioned directories to improve performance and manageability. This structure is common in data lakes where data is organised hierarchically by time (e.g., year, month, day).
A deeply partitioned structure enables Apache Spark to efficiently prune irrelevant partitions, resulting in faster queries and improved resource utilisation. In this article, we will explore how to read and write deeply partitioned files using Apache Spark in Java.
1. Setting Up an Apache Spark Example Project
Let’s start by including the following Maven dependencies in the project.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.13</artifactId>
<version>3.5.7</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.13</artifactId>
<version>3.5.7</version>
<scope>provided</scope>
</dependency>
These dependencies ensure that the Java program can interact with the Spark runtime, handle distributed data processing, and perform SQL-style operations on structured datasets. The spark-core library provides essential functionalities for Spark applications, while spark-sql adds support for working with DataFrames and executing SQL queries.
Note: The folder structure assumes the source data is already organised into partition folders following a standard date-time hierarchy (year, month, day, hour), similar to how partitioned datasets are typically structured in distributed storage systems.
2. Java Program
This code example represents the core logic for reading and writing deeply partitioned files using Apache Spark in Java. It demonstrates how Spark can automatically traverse a nested directory structure, extract partition metadata, and reorganise the data into a new partitioned format.
public class DeepPartitionExample {
public static void main(String[] args) {
// Initialize Spark Session
SparkSession spark = SparkSession.builder()
.appName("DeepPartitionExampleApp")
.master("local[*]") // for local testing
.getOrCreate();
String inputPath = "data/sales_partitioned_source/customers";
// Read deeply partitioned CSV files
var inputDf = spark.read()
.format("csv")
.schema("CustomerId STRING, Region STRING, PurchaseAmount DOUBLE")
.option("header", "true")
.option("pathGlobFilter", "*.csv")
.option("recursiveFileLookup", "true")
.load(inputPath);
// Extract partition columns from file paths
var processedDf = inputDf
.withColumn("year", regexp_extract(input_file_name(), "year=(\\d{4})", 1))
.withColumn("month", regexp_extract(input_file_name(), "month=(\\d{2})", 1))
.withColumn("day", regexp_extract(input_file_name(), "day=(\\d{2})", 1));
// Write the DataFrame with partitions
String outputPath = "data/sales_partitioned_output/customers";
processedDf.write()
.format("csv")
.option("header", "true")
.mode("overwrite")
.partitionBy("year", "month", "day")
.save(outputPath);
spark.stop();
}
}
Here’s a breakdown of each part.
Initializing the Spark Session
The program begins by creating a SparkSession, which serves as the entry point for all Spark functionalities. The .master("local[*]") configuration indicates that the program will run locally using all available CPU cores, making it suitable for local testing and development.
SparkSession spark = SparkSession.builder()
.appName("DeepPartitionExampleApp")
.master("local[*]")
.getOrCreate();
Reading Deeply Partitioned Files
Next, the code reads a collection of CSV files stored under a deeply partitioned directory structure. The inputPath variable points to the base directory containing subfolders such as year=2024/month=09/day=28/. The .read() operation configures the input format as CSV and explicitly defines the schema.
The following read options are particularly important:
option("pathGlobFilter", "*.csv")ensures that only CSV files are processed.option("recursiveFileLookup", "true")enables Spark to look through all nested subdirectories and read every CSV file under the given path.
String inputPath = "data/sales_partitioned_source/customers";
var inputDf = spark.read()
.format("csv")
.schema("CustomerId STRING, Region STRING, PurchaseAmount DOUBLE")
.option("header", "true")
.option("pathGlobFilter", "*.csv")
.option("recursiveFileLookup", "true")
.load(inputPath);
This capability makes it possible to process deeply partitioned datasets without manually enumerating each subdirectory.
Extracting Partition Columns
After reading the files, the code uses Spark SQL functions such as regexp_extract and input_file_name() to extract partition information embedded in the file paths.
var processedDf = inputDf
.withColumn("year", regexp_extract(input_file_name(), "year=(\\d{4})", 1))
.withColumn("month", regexp_extract(input_file_name(), "month=(\\d{2})", 1))
.withColumn("day", regexp_extract(input_file_name(), "day=(\\d{2})", 1));
This transformation step adds new columns to the DataFrame for each partition value. It enables Spark to use these columns in operations such as filtering, aggregation, or repartitioning, turning directory information into actual data fields that can be queried and processed.
Writing the Processed Data with New Partitions
Once we’ve extracted the year, month, and day columns, we can instruct Spark to write the data back in a partitioned layout.
processedDf.write()
.format("csv")
.option("header", "true")
.mode("overwrite")
.partitionBy("year", "month", "day")
.save(outputPath);
Here:
.format("csv")writes the data in CSV format..mode("overwrite")clears any existing output directory before writing..partitionBy("year", "month", "day")organizes the files in partitioned folders.
After writing the results, we call spark.stop(). This gracefully shuts down the Spark context, releasing all allocated resources. This step is important in long-running pipelines or scheduled jobs.
3. Conclusion
This article explored a Java example that demonstrates how to read and write deeply partitioned files using Apache Spark. It covered how to set up a Spark session, read files recursively from nested directories, extract partition information using regular expressions, write partitioned outputs, and organize data for efficient processing.




