< All Topics
Print

Recipe 1: Bulk PDF Metadata Processing & Catalogue with Exiftool by Phil Harvey

Introduction

The PDF metadata is an important tool for taxonomy individuals as well as web editors and website owners optimizing PDFs for SEO, web accessibility, and general cataloging for easier PDF findability and classifications. This recipe creates a PHP script that can catalogue unlimited PDF files found on any website.

Script Prerequisites

The script prerequisite is the exiftool which you can download at https://exiftool.org/.

Note: On a Windows computer you may want to rename the “exiftool(-k).exe” file to “exiftool.exe” as this will remove the annoying “Press Enter” message every time you run the command.

The PDF files you wish to catalogue need to be listed in the “filelist.txt” text file as shown below. This file is used as the input by the program we will create.

The screenshot above is from the filelist.txt file in my Notepad++ text editor.

The Script

Next, move on to our PHP script. The following code block shows the code. You can save the script and name it any way you like. I have called it “processBulkPDFs.php”.

<?php

//Main Loop
foreach(file('filelist.txt') as $line) {
   //get URL for file for this iteration
   $filenameURL=trim($line);

   //construct the shell curl command
   $commandToRun = "curl -s $filenameURL --output undertest.pdf";

   //run the command to get the PDF
   echo shell_exec($commandToRun);

   //analyze the PDF with exiftool
   $commandToRun = "exiftool undertest.pdf";

   //store exciftool into array 
   $shelloutput = shell_exec($commandToRun);
   
   $lines = explode("\n", $shelloutput);
   $currentHash = processLines($lines);

   //do something with $currentHash
   prettyPrint($currentHash, $filenameURL);
}


function processLines($lines){
   $overallHash = array();
   foreach($lines as $line){
      $lineArray = explode(':', $line);
      if($lineArray[0] != "")
         $overallHash[trim($lineArray[0])]=$lineArray[1];      
   }
   return $overallHash;
}

function prettyPrint($myarr, $filenameURL){
    echo "\nSummary for: $filenameURL\n";
    foreach ($myarr as $key => $value) {
       echo $key . "," . $value . "\n";
    }		
    echo "\n";
}
?>

Before running the script make sure that:

  • exiftool.exe is in the same folder as your PHP script above on your Windows computer (you can download it here: https://exiftool.org/)
  • filelist.txt text file with a PDF URL list that you want to process is in the same folder

My folder looked like this with all of the prerequisites.

Running the Script

To run the script cd to your folder with all of the files above and run the following (assuming you have saved your script with the same name as I have):

php processBulkPDFs.php

You should get a similar output to the following:

Summary for: https://www.trailblazerlearning.com/wp-content/uploads/2021/11/PHP-Programming-Reference-Intermediate.pdf
ExifTool Version Number, 12.35
File Name, undertest.pdf
Directory, .
File Size, 555 KiB
File Modification Date/Time, 2021
File Access Date/Time, 2021
File Creation Date/Time, 2021
File Permissions, -rw-rw-rw-
File Type, PDF
File Type Extension, pdf
MIME Type, application/pdf
PDF Version, 1.7
Linearized, No
Author, aws
Create Date, 2021
Modify Date, 2021
Producer, Microsoft
Title, Microsoft Word - PHP Programming  Cheat Sheet.docx
Page Count, 1

There you have it! You can modify the script to save it to a CSV file or JSON. You can also pipe the output of the script to a text file for later ingestion into a spreadsheet just give it a CSV extension. If you are on a Mac, just get the exiftool for your system and the script should work with little or no modifications.

 

You can download recipe 1 using this link.

 

Happy programming.

Table of Contents