Check if a File Is a PDF File in Java

Yatin BatraJune 12th, 2025Last Updated: June 11th, 2025

0 918 4 minutes read

In Java, determining whether a given file is a valid PDF is an important task when working with document processing systems, upload validations, and data ingestion pipelines. Let us delve into understanding how to use Java to check if a file is a valid PDF.

1. Introduction to the Problem

Checking the file extension .pdf alone is not reliable, as the extension can be misleading. A more accurate approach involves checking:

The PDF file signature (magic number).
Using content-aware libraries that parse and validate PDF structure.

2. Code Example Using Various Approaches

This section demonstrates multiple methods to determine if a file is a valid PDF using Java. We’ll explore techniques involving file signature inspection, and libraries such as Apache Tika, Apache PDFBox, and iText. But first, let’s start by adding the required dependencies.

2.1 Adding Dependencies

To use the libraries discussed in this article, you’ll need to include the following Maven dependencies in your project’s pom.xml file:

<!-- Apache Tika -->
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>latest_jar_version</version>
</dependency>

<!-- Apache PDFBox -->
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>latest_jar_version</version>
</dependency>

<!-- iText PDF -->
<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>itextpdf</artifactId>
    <version>latest_jar_version</version>
</dependency>

2.2 Code Example

In this section, we provide a comprehensive Java example that showcases how to determine if a file is a PDF using different techniques. The class PDFDetectionDemo includes implementations for checking based on file signature, Apache Tika, Apache PDFBox, and iText. Each method highlights a unique way of validating PDF content with varying levels of accuracy and depth.

//PDFDetectionDemo.java

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.tika.Tika;

import com.itextpdf.text.pdf.PdfReader;

public class PDFDetectionDemo {

    // Method 1: File Signature Check
    public static boolean isPDFBySignature (String filePath) {
        try (FileInputStream fis = new FileInputStream(filePath)) {
            byte[] header = new byte[5];
            if (fis.read(header) == 5) {
                String signature = new String(header);
                return "%PDF-".equals(signature);
            }
        } catch (IOException e) {
            System.err.println("Signature check failed: " + e.getMessage());
        }
        return false;
    }

    // Method 2: Apache Tika
    public static boolean isPDFByTika(String filePath) {
        Tika tika = new Tika();
        try {
            String mimeType = tika.detect(new File(filePath));
            return "application/pdf".equalsIgnoreCase(mimeType);
        } catch (Exception e) {
            System.err.println("Tika check failed: " + e.getMessage());
            return false;
        }
    }

    // Method 3: Apache PDFBox
    public static boolean isPDFByPDFBox(String filePath) {
        try (PDDocument document = PDDocument.load(new File(filePath))) {
            return true;
        } catch (IOException e) {
            System.err.println("PDFBox check failed: " + e.getMessage());
            return false;
        }
    }

    // Method 4: iText
    public static boolean isPDFByIText(String filePath) {
        try {
            PdfReader reader = new PdfReader(filePath);
            reader.close();
            return true;
        } catch (Exception e) {
            System.err.println("iText check failed: " + e.getMessage());
            return false;
        }
    }

    // Main method to test all approaches
    public static void main(String[] args) {
        String filePath = "sample.pdf"; // replace with your file path

        System.out.println("File: " + filePath);
        System.out.println("-----------------------------");

        System.out.println("1. Signature Check: " + isPDFBySignature(filePath));
        System.out.println("2. Apache Tika:      " + isPDFByTika(filePath));
        System.out.println("3. Apache PDFBox:    " + isPDFByPDFBox(filePath));
        System.out.println("4. iText Library:    " + isPDFByIText(filePath));
    }
}

2.2.1 Code Explanation

The PDFDetectionDemo class provides four different methods to determine whether a file is a valid PDF. The first method, isPDFBySignature, reads the first five bytes of the file to check for the “%PDF-” header, which is the standard signature of a PDF file. The second method, isPDFByTika, uses Apache Tika to detect the file’s MIME type and confirm if it is “application/pdf”. The third method, isPDFByPDFBox, attempts to load the file using Apache PDFBox, which throws an exception if the file is not a valid or readable PDF. Similarly, the fourth method, isPDFByIText, tries to read the file using iText’s PdfReader, returning true if successful. The main method runs all four checks on a sample file path and prints the results to the console, making it easy to compare how each method performs under different scenarios.

2.2.2 Code Output

The following output illustrates the result of running the PDFDetectionDemo class against a sample PDF file named sample.pdf. Each method is executed sequentially, and the output reflects whether the respective approach was able to successfully identify the file as a valid PDF. A true result indicates a positive match, confirming the file’s format and readability by the chosen library or technique.

File: sample.pdf
-----------------------------
1. Signature Check: true
2. Apache Tika:      true
3. Apache PDFBox:    true
4. iText Library:    true

3. Comparison Table

The following table compares four different approaches for identifying whether a file is a valid PDF. Each method has unique characteristics in terms of accuracy, speed, external dependencies, and the ability to detect corrupted PDFs. This can help developers choose the right solution based on the specific requirements of their applications.

Approach	Accuracy	Speed	Library Required	Detects Corrupt PDFs	Comments
File Signature: Checks if file starts with “%PDF-” magic number	Medium	Fast	No	No	Quick and lightweight method, but can be spoofed since it only checks the file header.
Apache Tika: Uses content analysis to detect MIME type	High	Fast	Yes	Partially	Reliable for detecting PDFs by MIME type; great for content ingestion pipelines and preprocessing.
Apache PDFBox: Attempts to parse the file as a PDF document	Very High	Medium	Yes	Yes	Excellent for verifying PDF structure and detecting malformed or corrupted files.
iText: Reads PDF structure and validates syntax	Very High	Medium	Yes	Yes	Highly robust; suitable for advanced PDF manipulation, but commercial use may require a license.

4. Conclusion

Choosing the right method depends on your use case: use the file signature check for quick and dirty validations; Apache Tika is a great balance of speed and accuracy; PDFBox or iText should be preferred when structural integrity and corruption checks are crucial. For most robust systems, combining Tika for MIME detection and PDFBox/iText for validation provides the best of both worlds.

Check if a File Is a PDF File in Java

1. Introduction to the Problem

2. Code Example Using Various Approaches

2.1 Adding Dependencies

2.2 Code Example

2.2.1 Code Explanation

2.2.2 Code Output

3. Comparison Table

4. Conclusion

Thank you!

Yatin Batra

Thank you!

1. Introduction to the Problem

2. Code Example Using Various Approaches

2.1 Adding Dependencies

2.2 Code Example

2.2.1 Code Explanation

2.2.2 Code Output

3. Comparison Table

4. Conclusion

Thank you!

Related Articles

Thank you!