Recipe 1: Bulk PDF Metadata Processing & Catalogue with Exiftool by Phil Harvey
Introduction
The PDF metadata is an important tool for taxonomy individuals as well as web editors and website owners optimizing PDFs for SEO, web accessibility, and general cataloging for easier PDF findability and classifications. This recipe creates a PHP script that can catalogue unlimited PDF files found on any website.
Script Prerequisites
The script prerequisite is the exiftool which you can download at https://exiftool.org/.
Note: On a Windows computer you may want to rename the “exiftool(-k).exe” file to “exiftool.exe” as this will remove the annoying “Press Enter” message every time you run the command.
The PDF files you wish to catalogue need to be listed in the “filelist.txt” text file as shown below. This file is used as the input by the program we will create.
The screenshot above is from the filelist.txt file in my Notepad++ text editor.
The Script
Next, move on to our PHP script. The following code block shows the code. You can save the script and name it any way you like. I have called it “processBulkPDFs.php”.
<?php //Main Loop foreach(file('filelist.txt') as $line) { //get URL for file for this iteration $filenameURL=trim($line); //construct the shell curl command $commandToRun = "curl -s $filenameURL --output undertest.pdf"; //run the command to get the PDF echo shell_exec($commandToRun); //analyze the PDF with exiftool $commandToRun = "exiftool undertest.pdf"; //store exciftool into array $shelloutput = shell_exec($commandToRun); $lines = explode("\n", $shelloutput); $currentHash = processLines($lines); //do something with $currentHash prettyPrint($currentHash, $filenameURL); } function processLines($lines){ $overallHash = array(); foreach($lines as $line){ $lineArray = explode(':', $line); if($lineArray[0] != "") $overallHash[trim($lineArray[0])]=$lineArray[1]; } return $overallHash; } function prettyPrint($myarr, $filenameURL){ echo "\nSummary for: $filenameURL\n"; foreach ($myarr as $key => $value) { echo $key . "," . $value . "\n"; } echo "\n"; } ?>
Before running the script make sure that:
- exiftool.exe is in the same folder as your PHP script above on your Windows computer (you can download it here: https://exiftool.org/)
- filelist.txt text file with a PDF URL list that you want to process is in the same folder
My folder looked like this with all of the prerequisites.
Running the Script
To run the script cd to your folder with all of the files above and run the following (assuming you have saved your script with the same name as I have):
php processBulkPDFs.php
You should get a similar output to the following:
Summary for: https://www.trailblazerlearning.com/wp-content/uploads/2021/11/PHP-Programming-Reference-Intermediate.pdf ExifTool Version Number, 12.35 File Name, undertest.pdf Directory, . File Size, 555 KiB File Modification Date/Time, 2021 File Access Date/Time, 2021 File Creation Date/Time, 2021 File Permissions, -rw-rw-rw- File Type, PDF File Type Extension, pdf MIME Type, application/pdf PDF Version, 1.7 Linearized, No Author, aws Create Date, 2021 Modify Date, 2021 Producer, Microsoft Title, Microsoft Word - PHP Programming Cheat Sheet.docx Page Count, 1
There you have it! You can modify the script to save it to a CSV file or JSON. You can also pipe the output of the script to a text file for later ingestion into a spreadsheet just give it a CSV extension. If you are on a Mac, just get the exiftool for your system and the script should work with little or no modifications.
You can download recipe 1 using this link.
Happy programming.