Check if a File Is a PDF File in Java
In Java, determining whether a given file is a valid PDF is an important task when working with document processing systems, upload validations, and data ingestion pipelines. Let us delve into understanding how to use Java to check if a file is a valid PDF.
1. Introduction to the Problem
Checking the file extension .pdf alone is not reliable, as the extension can be misleading. A more accurate approach involves checking:
- The PDF file signature (magic number).
- Using content-aware libraries that parse and validate PDF structure.
2. Code Example Using Various Approaches
This section demonstrates multiple methods to determine if a file is a valid PDF using Java. We’ll explore techniques involving file signature inspection, and libraries such as Apache Tika, Apache PDFBox, and iText. But first, let’s start by adding the required dependencies.
2.1 Adding Dependencies
To use the libraries discussed in this article, you’ll need to include the following Maven dependencies in your project’s pom.xml file:
<!-- Apache Tika -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>latest_jar_version</version>
</dependency>
<!-- Apache PDFBox -->
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>latest_jar_version</version>
</dependency>
<!-- iText PDF -->
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itextpdf</artifactId>
<version>latest_jar_version</version>
</dependency>
2.2 Code Example
In this section, we provide a comprehensive Java example that showcases how to determine if a file is a PDF using different techniques. The class PDFDetectionDemo includes implementations for checking based on file signature, Apache Tika, Apache PDFBox, and iText. Each method highlights a unique way of validating PDF content with varying levels of accuracy and depth.
//PDFDetectionDemo.java
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.tika.Tika;
import com.itextpdf.text.pdf.PdfReader;
public class PDFDetectionDemo {
// Method 1: File Signature Check
public static boolean isPDFBySignature (String filePath) {
try (FileInputStream fis = new FileInputStream(filePath)) {
byte[] header = new byte[5];
if (fis.read(header) == 5) {
String signature = new String(header);
return "%PDF-".equals(signature);
}
} catch (IOException e) {
System.err.println("Signature check failed: " + e.getMessage());
}
return false;
}
// Method 2: Apache Tika
public static boolean isPDFByTika(String filePath) {
Tika tika = new Tika();
try {
String mimeType = tika.detect(new File(filePath));
return "application/pdf".equalsIgnoreCase(mimeType);
} catch (Exception e) {
System.err.println("Tika check failed: " + e.getMessage());
return false;
}
}
// Method 3: Apache PDFBox
public static boolean isPDFByPDFBox(String filePath) {
try (PDDocument document = PDDocument.load(new File(filePath))) {
return true;
} catch (IOException e) {
System.err.println("PDFBox check failed: " + e.getMessage());
return false;
}
}
// Method 4: iText
public static boolean isPDFByIText(String filePath) {
try {
PdfReader reader = new PdfReader(filePath);
reader.close();
return true;
} catch (Exception e) {
System.err.println("iText check failed: " + e.getMessage());
return false;
}
}
// Main method to test all approaches
public static void main(String[] args) {
String filePath = "sample.pdf"; // replace with your file path
System.out.println("File: " + filePath);
System.out.println("-----------------------------");
System.out.println("1. Signature Check: " + isPDFBySignature(filePath));
System.out.println("2. Apache Tika: " + isPDFByTika(filePath));
System.out.println("3. Apache PDFBox: " + isPDFByPDFBox(filePath));
System.out.println("4. iText Library: " + isPDFByIText(filePath));
}
}
2.2.1 Code Explanation
The PDFDetectionDemo class provides four different methods to determine whether a file is a valid PDF. The first method, isPDFBySignature, reads the first five bytes of the file to check for the “%PDF-” header, which is the standard signature of a PDF file. The second method, isPDFByTika, uses Apache Tika to detect the file’s MIME type and confirm if it is “application/pdf”. The third method, isPDFByPDFBox, attempts to load the file using Apache PDFBox, which throws an exception if the file is not a valid or readable PDF. Similarly, the fourth method, isPDFByIText, tries to read the file using iText’s PdfReader, returning true if successful. The main method runs all four checks on a sample file path and prints the results to the console, making it easy to compare how each method performs under different scenarios.
2.2.2 Code Output
The following output illustrates the result of running the PDFDetectionDemo class against a sample PDF file named sample.pdf. Each method is executed sequentially, and the output reflects whether the respective approach was able to successfully identify the file as a valid PDF. A true result indicates a positive match, confirming the file’s format and readability by the chosen library or technique.
File: sample.pdf ----------------------------- 1. Signature Check: true 2. Apache Tika: true 3. Apache PDFBox: true 4. iText Library: true
3. Comparison Table
The following table compares four different approaches for identifying whether a file is a valid PDF. Each method has unique characteristics in terms of accuracy, speed, external dependencies, and the ability to detect corrupted PDFs. This can help developers choose the right solution based on the specific requirements of their applications.
| Approach | Accuracy | Speed | Library Required | Detects Corrupt PDFs | Comments |
|---|---|---|---|---|---|
| File Signature: Checks if file starts with “%PDF-” magic number | Medium | Fast | No | No | Quick and lightweight method, but can be spoofed since it only checks the file header. |
| Apache Tika: Uses content analysis to detect MIME type | High | Fast | Yes | Partially | Reliable for detecting PDFs by MIME type; great for content ingestion pipelines and preprocessing. |
| Apache PDFBox: Attempts to parse the file as a PDF document | Very High | Medium | Yes | Yes | Excellent for verifying PDF structure and detecting malformed or corrupted files. |
| iText: Reads PDF structure and validates syntax | Very High | Medium | Yes | Yes | Highly robust; suitable for advanced PDF manipulation, but commercial use may require a license. |
4. Conclusion
Choosing the right method depends on your use case: use the file signature check for quick and dirty validations; Apache Tika is a great balance of speed and accuracy; PDFBox or iText should be preferred when structural integrity and corruption checks are crucial. For most robust systems, combining Tika for MIME detection and PDFBox/iText for validation provides the best of both worlds.

