Core Java

Check if a File Is a PDF File in Java

In Java, determining whether a given file is a valid PDF is an important task when working with document processing systems, upload validations, and data ingestion pipelines. Let us delve into understanding how to use Java to check if a file is a valid PDF.

1. Introduction to the Problem

Checking the file extension .pdf alone is not reliable, as the extension can be misleading. A more accurate approach involves checking:

  • The PDF file signature (magic number).
  • Using content-aware libraries that parse and validate PDF structure.

2. Code Example Using Various Approaches

This section demonstrates multiple methods to determine if a file is a valid PDF using Java. We’ll explore techniques involving file signature inspection, and libraries such as Apache Tika, Apache PDFBox, and iText. But first, let’s start by adding the required dependencies.

2.1 Adding Dependencies

To use the libraries discussed in this article, you’ll need to include the following Maven dependencies in your project’s pom.xml file:

<!-- Apache Tika -->
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>latest_jar_version</version>
</dependency>

<!-- Apache PDFBox -->
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>latest_jar_version</version>
</dependency>

<!-- iText PDF -->
<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>itextpdf</artifactId>
    <version>latest_jar_version</version>
</dependency>

2.2 Code Example

In this section, we provide a comprehensive Java example that showcases how to determine if a file is a PDF using different techniques. The class PDFDetectionDemo includes implementations for checking based on file signature, Apache Tika, Apache PDFBox, and iText. Each method highlights a unique way of validating PDF content with varying levels of accuracy and depth.

//PDFDetectionDemo.java

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.tika.Tika;

import com.itextpdf.text.pdf.PdfReader;

public class PDFDetectionDemo {

    // Method 1: File Signature Check
    public static boolean isPDFBySignature (String filePath) {
        try (FileInputStream fis = new FileInputStream(filePath)) {
            byte[] header = new byte[5];
            if (fis.read(header) == 5) {
                String signature = new String(header);
                return "%PDF-".equals(signature);
            }
        } catch (IOException e) {
            System.err.println("Signature check failed: " + e.getMessage());
        }
        return false;
    }

    // Method 2: Apache Tika
    public static boolean isPDFByTika(String filePath) {
        Tika tika = new Tika();
        try {
            String mimeType = tika.detect(new File(filePath));
            return "application/pdf".equalsIgnoreCase(mimeType);
        } catch (Exception e) {
            System.err.println("Tika check failed: " + e.getMessage());
            return false;
        }
    }

    // Method 3: Apache PDFBox
    public static boolean isPDFByPDFBox(String filePath) {
        try (PDDocument document = PDDocument.load(new File(filePath))) {
            return true;
        } catch (IOException e) {
            System.err.println("PDFBox check failed: " + e.getMessage());
            return false;
        }
    }

    // Method 4: iText
    public static boolean isPDFByIText(String filePath) {
        try {
            PdfReader reader = new PdfReader(filePath);
            reader.close();
            return true;
        } catch (Exception e) {
            System.err.println("iText check failed: " + e.getMessage());
            return false;
        }
    }

    // Main method to test all approaches
    public static void main(String[] args) {
        String filePath = "sample.pdf"; // replace with your file path

        System.out.println("File: " + filePath);
        System.out.println("-----------------------------");

        System.out.println("1. Signature Check: " + isPDFBySignature(filePath));
        System.out.println("2. Apache Tika:      " + isPDFByTika(filePath));
        System.out.println("3. Apache PDFBox:    " + isPDFByPDFBox(filePath));
        System.out.println("4. iText Library:    " + isPDFByIText(filePath));
    }
}

2.2.1 Code Explanation

The PDFDetectionDemo class provides four different methods to determine whether a file is a valid PDF. The first method, isPDFBySignature, reads the first five bytes of the file to check for the “%PDF-” header, which is the standard signature of a PDF file. The second method, isPDFByTika, uses Apache Tika to detect the file’s MIME type and confirm if it is “application/pdf”. The third method, isPDFByPDFBox, attempts to load the file using Apache PDFBox, which throws an exception if the file is not a valid or readable PDF. Similarly, the fourth method, isPDFByIText, tries to read the file using iText’s PdfReader, returning true if successful. The main method runs all four checks on a sample file path and prints the results to the console, making it easy to compare how each method performs under different scenarios.

2.2.2 Code Output

The following output illustrates the result of running the PDFDetectionDemo class against a sample PDF file named sample.pdf. Each method is executed sequentially, and the output reflects whether the respective approach was able to successfully identify the file as a valid PDF. A true result indicates a positive match, confirming the file’s format and readability by the chosen library or technique.

File: sample.pdf
-----------------------------
1. Signature Check: true
2. Apache Tika:      true
3. Apache PDFBox:    true
4. iText Library:    true

3. Comparison Table

The following table compares four different approaches for identifying whether a file is a valid PDF. Each method has unique characteristics in terms of accuracy, speed, external dependencies, and the ability to detect corrupted PDFs. This can help developers choose the right solution based on the specific requirements of their applications.

ApproachAccuracySpeedLibrary RequiredDetects Corrupt PDFsComments
File Signature: Checks if file starts with “%PDF-” magic numberMediumFastNoNoQuick and lightweight method, but can be spoofed since it only checks the file header.
Apache Tika: Uses content analysis to detect MIME typeHighFastYesPartiallyReliable for detecting PDFs by MIME type; great for content ingestion pipelines and preprocessing.
Apache PDFBox: Attempts to parse the file as a PDF documentVery HighMediumYesYesExcellent for verifying PDF structure and detecting malformed or corrupted files.
iText: Reads PDF structure and validates syntaxVery HighMediumYesYesHighly robust; suitable for advanced PDF manipulation, but commercial use may require a license.

4. Conclusion

Choosing the right method depends on your use case: use the file signature check for quick and dirty validations; Apache Tika is a great balance of speed and accuracy; PDFBox or iText should be preferred when structural integrity and corruption checks are crucial. For most robust systems, combining Tika for MIME detection and PDFBox/iText for validation provides the best of both worlds.

Yatin Batra

An experience full-stack engineer well versed with Core Java, Spring/Springboot, MVC, Security, AOP, Frontend (Angular & React), and cloud technologies (such as AWS, GCP, Jenkins, Docker, K8).
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Back to top button