Systematic Sampling in R

Systematic sampling is a statistical sampling method where elements from a larger population are selected at regular intervals with a fixed sampling interval. The process involves selecting every kth element from a list after a random start, where k is the sampling interval.

For example, if a teacher wanted to sample 100 students from a school with 1000 students using systematic sampling, then the teacher would select every 10th student from a list sorted by student ID numbers.

Systematic Sampling in R

In R programming language, we can perform systematic sampling by selecting elements at fixed intervals from a vector or dataset.

1. Performing Systematic Sampling on a Numeric Vector

We generate a numeric population and extract a systematic sample by calculating a fixed interval.

ceiling: Rounds numbers up to the nearest integer.
length: Returns the number of elements in a vector.
seq: Generates a sequence of values at regular intervals.
sample: Selects elements based on calculated positions.

population <- 1:100
sample_size <- 10

interval <- ceiling(length(population) / sample_size)

start <- seq(from = 1, by = interval, length.out = sample_size)
systematic_sample <- population[start]

print(systematic_sample)

Output:

[1] 1 11 21 31 41 51 61 71 81 91

The output represents the systematic sample obtained from the population using the specified parameters.

1 is the first element of the systematic sample. It corresponds to the first element of the population.
11 is the second element of the systematic sample. It corresponds to the element in the population that is 10 positions away from the first element.
Similarly, 21, 31, 41, 51, 61, 71, 81 and 91 are the subsequent elements of the systematic sample, each obtained by adding the sampling interval to the previous selected element.

2. Loading and Sampling the mtcars Dataset Systematically

We use the built-in mtcars dataset and select rows at regular intervals based on a randomly selected starting point.

data: Loads built-in datasets in R.
nrow: Returns the number of rows in a data frame.
ceiling: Calculates the sampling interval.
sample: Picks a random starting index.
seq: Creates a sequence of indices for sampling.

data(mtcars)

sample_size <- 5
sampling_interval <- ceiling(nrow(mtcars) / sample_size)

random_start <- sample(1:sampling_interval, 1)

systematic_sample_indices <- seq(from = random_start, to = nrow(mtcars), by = sampling_interval)
systematic_sample <- mtcars[systematic_sample_indices, ]

print(systematic_sample)

Output:

3. Creating and Sampling a Custom Data Frame

We generate a dataset of student names and exam scores and use a user-defined function to perform systematic sampling.

set.seed: Ensures reproducibility of random outputs.
replicate: Repeats an expression multiple times.
sample: Selects a random starting index.
do.call: Combines elements to form strings.
paste0: Concatenates character vectors without separator.
rnorm: Generates values from a normal distribution.
round: Rounds numbers to a specified number of decimal places.
seq: Creates the sample index based on the interval.
dim: Returns dimensions of a data frame.

set.seed(123)

randomNames <- function(n = 5000) {
  do.call(paste0, replicate(5, sample(letters, n, TRUE), FALSE))
}

students <- data.frame(first_name = randomNames(500),
                       exam_score = round(rnorm(500, mean = 75, sd = 5), 1))

head(students)

obtain_sys <- function(N, n) {
  k <- ceiling(N / n)
  r <- sample(1:k, 1)
  seq(r, r + k * (n - 1), k)
}

sys_sample_students <- students[obtain_sys(nrow(students), 10), ]

head(sys_sample_students)
dim(sys_sample_students)

Output:

Uses of Systematic Sampling

Large Populations: When dealing with a large population, it can be challenging and expensive to conduct a simple random sample. Systematic sampling provides a more practical and efficient way to obtain a representative sample by selecting every kth element.
Efficiency: Systematic sampling is often more efficient than simple random sampling. It requires less effort and resources, making it a suitable choice when time and budget constraints are significant considerations.
Homogeneous Population: If the population is relatively homogeneous and there is no significant order or pattern in the data, systematic sampling can give representative results.
Regular Data Collection: In situations where data is collected at regular intervals, systematic sampling can align with the natural order of the data collection process. This can simplify the sampling procedure and make it more practical.

Limitations of Systematic Sampling

Bias Risk: Systematic sampling may introduce bias if there's a hidden pattern or periodicity in the population aligned with the sampling interval.
Skewed Representation: It can lead to skewed representation if the sampling interval coincides with certain characteristics, causing under or overrepresentation.
Dependency on Ordering: The effectiveness relies on the order of elements; specific arrangements may affect representativeness.
Sensitivity to Outliers: Outliers can have a significant impact, especially if they are consistently spaced based on the sampling interval.
Inapplicability for Unordered Populations: Not suitable for populations without a clear order or listing.
Complexity in Unequal Probability: Adjusting for unequal probabilities can add complexity, potentially negating the simplicity of systematic sampling.

Systematic Sampling in R