Batch Processing with DataLoader
When building applications that process large amounts of data, one of the biggest performance challenges comes from inefficient data fetching. If every request to retrieve data from a database or an external service is handled individually, the system can quickly become slow and resource-heavy. DataLoader is a utility that addresses this challenge by grouping multiple data requests into a single batch and caching results within the same processing cycle. This approach reduces redundant queries, lowers network overhead, and ensures smoother and more efficient data pipelines.
In this article, we will explore how DataLoader helps optimise data fetching in Java applications, particularly in batch processing scenarios.
1. What is DataLoader?
DataLoader is a utility that batches and caches data-fetching tasks to avoid the “N+1 problem“. Instead of making a separate call for each request, DataLoader collects multiple requests within a single tick of the event loop and resolves them together. This batching approach drastically cuts down on the number of queries or API calls made, while caching prevents re-fetching the same data multiple times.
For example, imagine needing to fetch user details for a list of orders. Without DataLoader, the system might call the database once for each user, leading to dozens of queries. With DataLoader, all user IDs can be collected and fetched in a single query, then efficiently distributed back to the requesting code.
1.1 Why Use DataLoader for Batch Processing?
Batch processing scenarios often involve processing large volumes of related data. A naïve implementation might lead to repetitive queries that scale poorly. DataLoader helps by:
- Batching: Combines multiple requests into one bulk request.
- Caching: Stores results temporarily to avoid redundant fetching within the same request lifecycle.
- Efficiency: Reduces database load and network round-trips.
- Consistency: Guarantees the order of results corresponds to the order of requests.
By incorporating DataLoader into batch processing, applications can achieve both better performance and scalability, especially in systems dealing with complex relationships or high-throughput data processing.
2. Setting Up a Java Project with DataLoader
To use DataLoader in Java, we can rely on the popular org.dataloader library. Let’s set up a Maven-based Java project that demonstrates its use in a batch-processing scenario. Here is the pom.xml configuration for our project:
<dependency>
<groupId>com.graphql-java</groupId>
<artifactId>java-dataloader</artifactId>
<version>5.0.2</version>
</dependency>
<dependency>
<groupId>com.h2database</groupId>
<artifactId>h2</artifactId>
<version>2.3.232</version>
<scope>runtime</scope>
</dependency>
This configuration sets up a Maven project with an in-memory H2 database that requires no external setup. The java-dataloader dependency adds batching and caching support.
3. Setting Up the Database
We will now configure our in-memory H2 database by defining three components: a model class (User), a database initializer (DatabaseInitializer), and a repository (UserRepository).
Defining the User Model
First, we create a simple User entity that represents the data we will store and fetch. This class contains two fields: id and firstname.
public class User {
private final int id;
private final String firstname;
public User(int id, String firstname) {
this.id = id;
this.firstname = firstname;
}
public int getId() {
return id;
}
public String getFirstname() {
return firstname;
}
}
In this code, the final keyword is used for the fields id and firstname. Declaring a field as final means its value cannot be reassigned once it is initialized. This makes the User class immutable, ensuring that once a User object is created, its state cannot be changed. Immutability is a common best practice for model classes in batch processing and caching scenarios because it helps maintain consistency, avoids accidental modifications, and makes the code thread safe when objects are shared across multiple operations.
Initializing the Database
We now create a helper class that sets up the H2 database schema and inserts sample data.
public class DatabaseInitializer {
public static Connection initialize() {
try {
Connection connection = DriverManager.getConnection(
"jdbc:h2:mem:testdb;DB_CLOSE_DELAY=-1", "sa", "");
try (Statement stmt = connection.createStatement()) {
stmt.execute("CREATE TABLE users (id INT PRIMARY KEY, firstname VARCHAR(50))");
stmt.execute("INSERT INTO users (id, firstname) VALUES " +
"(1, 'Thomas'), " +
"(2, 'Barry'), " +
"(3, 'Charlie'), " +
"(4, 'Diana'), " +
"(5, 'Eve')");
}
return connection;
} catch (SQLException e) {
throw new RuntimeException("Failed to initialize H2 database", e);
}
}
}
This class creates a users table and populates it with five records. The DB_CLOSE_DELAY=-1 setting ensures the in-memory database remains active until the JVM shuts down.
4. Implementing the User Repository
The repository is responsible for fetching data from the database. In our case, it simulates batch loading by fetching multiple users at once using their IDs.
public class UserRepository {
private final Connection connection;
public UserRepository(Connection connection) {
this.connection = connection;
}
public CompletableFuture<List<User>> getUsersByIds(List<Integer> ids) {
System.out.println("Fetching users for IDs: " + ids);
String placeholders = ids.stream().map(id -> "?").collect(Collectors.joining(","));
String sql = "SELECT id, firstname FROM users WHERE id IN (" + placeholders + ")";
Map<Integer, User> resultMap = new HashMap<>();
try (PreparedStatement ps = connection.prepareStatement(sql)) {
for (int i = 0; i < ids.size(); i++) {
ps.setInt(i + 1, ids.get(i));
}
try (ResultSet rs = ps.executeQuery()) {
while (rs.next()) {
int id = rs.getInt("id");
String firstname = rs.getString("firstname");
resultMap.put(id, new User(id, firstname));
}
}
} catch (SQLException e) {
throw new RuntimeException("Error fetching users", e);
}
List<User> users = ids.stream()
.map(id -> resultMap.getOrDefault(id, new User(id, "Unknown")))
.toList();
return CompletableFuture.completedFuture(users);
}
}
This repository fetches multiple users in a single query using IN (?) placeholders, which avoids making repeated database calls for each ID. The results are mapped into immutable User objects, keeping the model consistent and safe.
The method returns a CompletableFuture<List<User>>, which allows the query to run asynchronously in a separate thread. This prevents blocking the main thread and integrates smoothly with DataLoader, which is designed to work with asynchronous batch loading functions.
5. Creating the DataLoader
The purpose of the DataLoader is to optimize how we fetch data by batching multiple requests into a single database call and caching results within the same execution cycle. This means that if several parts of the application request the same user, the DataLoader will avoid duplicate queries and return the cached result instead.
public class UserDataLoader {
private final UserRepository userRepository;
public UserDataLoader(UserRepository userRepository) {
this.userRepository = userRepository;
}
public DataLoader<Integer, User> create() {
BatchLoader<Integer, User> batchLoader = ids
-> userRepository.getUsersByIds(ids).thenApply(users -> {
// Preserve order by matching users to IDs
return ids.stream()
.map(id -> users.stream()
.filter(user -> user.getId() == id)
.findFirst()
.orElse(null))
.toList();
});
return DataLoaderFactory.newDataLoader(batchLoader);
}
}
In this class, the UserDataLoader is constructed with a UserRepository to perform database lookups. Inside the create() method, a BatchLoader<Integer, User> is defined. This function receives a list of user IDs and delegates the call to userRepository.getUsersByIds(ids), which executes one batched query. Since the repository returns a CompletableFuture, the processing continues asynchronously.
The call to thenApply ensures that the results are mapped back in the exact order of the requested IDs. This is important because database queries may not guarantee ordering by default. By iterating over the incoming IDs and matching them with the fetched users, the mapping preserves consistency between inputs and outputs.
Finally, the DataLoaderFactory.newDataLoader(batchLoader) method creates the actual DataLoader instance. Once created, this DataLoader can be used across the application to batch multiple user lookups into single queries and to reuse cached results within the same execution cycle. This significantly improves performance in batch processing workflows.
6. Running the Application
Finally, we bring everything together in the application entry point.
public class DataloaderBatchProcessing {
public static void main(String[] args) {
Connection connection = DatabaseInitializer.initialize();
UserRepository repository = new UserRepository(connection);
UserDataLoader userDataLoader = new UserDataLoader(repository);
DataLoader<Integer, User> dataLoader = userDataLoader.create();
List<Integer> userIds = List.of(1, 2, 3, 4, 5, 1, 2);
List<CompletableFuture<User>> futures = userIds.stream()
.map(dataLoader::load)
.toList();
dataLoader.dispatch();
futures.forEach(future -> future.thenAccept(user
-> System.out.println("Loaded user: " + user.getFirstname())
));
}
The application simulates processing user IDs for a set of orders. All user requests are batched into a single SQL query, then results are resolved asynchronously.
Sample Output
Running the program produces the following console output:
Fetching users for IDs: [1, 2, 3, 4, 5] Loaded user: Thomas Loaded user: Barry Loaded user: Charlie Loaded user: Diana Loaded user: Eve Loaded user: Thomas Loaded user: Barry
The log shows that even though we requested users multiple times, DataLoader combined all IDs into a single fetch operation.
7. Conclusion
In this article, we demonstrated how to use DataLoader with an in-memory H2 database in a Java application. By batching multiple requests into one SQL query and caching results within the same cycle, DataLoader avoids redundant work and improves performance. This strategy greatly enhances performance in batch processing workflows with recurring related data lookups.
8. Download the Source Code
This article explored Java DataLoader batching and its role in optimizing data fetching.
You can download the full source code of this example here: java dataloader batching

