dplyr Full Join in R

The dplyr package in R programming language provides functions to combine datasets using various types of joins. A full join includes all rows from both data frames, adding NA where there is no match.

Purpose of using Full Join in R

To merge two data frames and keep all rows from both, we use the full_join() function from the dplyr package in R. This function is useful when we want to combine data based on a common key, while retaining unmatched rows from both data frames.

Syntax:

full_join(x, y, by = NULL)

x, y: The data frames to be joined.
by: Variables to join by. If not specified, the function will attempt to join all variables with common names.

1. Basic Full Join

We perform a full join on two data frames that contain employee details and their salaries.

data.frame(): Creates a new data frame.
EmployeeID: Common column used to match rows in both data frames.
Name, Department, Salary: Non-matching columns that are included in the output.
full_join(): Combines both data frames, keeping all rows from both.
by = "EmployeeID": Specifies the column used to match rows during the join.

install.packages("dplyr")
library(dplyr)

df1 <- data.frame(EmployeeID = c(1, 2, 3),
                  Name = c("Alice", "Ben", "Clara"),
                  Department = c("HR", "IT", "Finance"))

df2 <- data.frame(EmployeeID = c(1, 2, 4),
                  Salary = c(50000, 60000, 55000))

full_join(df1, df2, by = "EmployeeID")

Output:

All employee records from both data frames are included. The row with EmployeeID four appears even though it has no corresponding match in the first data frame. Missing values are filled with NA.

2. Full Join with Multiple Variables

We use multiple columns to match and merge sales and expense data for various months and years.

Month, Year: Used as join keys from both data frames.
Sales, Expenses: Unique columns that are retained after the join.
by = c("Month", "Year"): Joins the data frames using both Month and Year columns.

install.packages("dplyr")
library(dplyr)

sales <- data.frame(Month = c("Jan", "Feb", "Mar"),
                    Year = c(2023, 2023, 2023),
                    Sales = c(10000, 12000, 15000))

expenses <- data.frame(Month = c("Jan", "Feb", "Apr"),
                       Year = c(2023, 2023, 2023),
                       Expenses = c(5000, 6000, 5500))

full_join(sales, expenses, by = c("Month", "Year"))

Output:

All rows from both datasets are retained. March has no matching expense record and April has no matching sales record, so NA values are inserted accordingly.

3. Handling Missing Values with Full Join

We handle scenarios where missing values are present by replacing them with zeros after the full join.

ID: Join key used in both data frames.
Value1, Value2: Numeric columns with missing values (NA).
mutate(): Used to modify or create columns.
across(): Applies a function across all specified columns.
everything(): Targets all columns in the data frame.
ifelse(is.na(x), 0, x): Replaces all NA values with 0.

install.packages("dplyr")
library(dplyr)

df1 <- data.frame(ID = c(1, 2, 3),
                  Value1 = c(10, NA, 30))

df2 <- data.frame(ID = c(2, 3, 4),
                  Value2 = c(20, 30, 40))

merged_df <- full_join(df1, df2, by = "ID")

merged_df_filled <- merged_df %>%
  mutate(across(.cols = everything(), .fns = function(x) ifelse(is.na(x), 0, x)))

print(merged_df_filled)

Output:

The full join merges all records and introduces NA where there is no match. We replace all NA values with zero using the mutate and across functions from the dplyr package.

Purpose of using Full Join in R

1. Basic Full Join

2. Full Join with Multiple Variables

3. Handling Missing Values with Full Join

Explore