我已经使用了xpdfbin-win软件包和cpdf.exe中的“ pdfinfo.exe”来检查PDF文件是否损坏,但是如果没有必要,则不想包含二进制文件。
我了解到,较新的PDF格式的末尾具有可读的xml数据目录,因此我使用常规窗口NOTEPAD.exe打开了PDF,并向下滚动经过了不可读的数据,最后看到了几个可读的键。我只需要一个键,但是选择同时使用CreationDate和ModDate。
以下Powershell(PS)脚本将检查当前目录中的所有PDF文件,并将每个状态输出为文本文件(!RESULTS.log)。对35,000个PDF文件运行了大约2分钟。我试图为PS的新手添加评论。希望这可以节省一些时间。可能有更好的方法来执行此操作,但这对我而言是完美的,并且可以静默处理错误。您可能需要在开始时定义以下内容:$ ErrorActionPreference =“ SilentlyContinue”,如果您在屏幕上看到错误。
将以下内容复制到文本文件中并适当命名(例如:CheckPDF.ps1),或打开PS并浏览到包含PDF文件的目录以进行检查并将其粘贴在控制台中。
#
# PowerShell v4.0
#
# Get all PDF files in current directory
#
$items = Get-ChildItem | Where-Object {$_.Extension -eq ".pdf"}
$logFile = "!RESULTS.log"
$badCounter = 0
$goodCounter = 0
$msg = "`n`nProcessing " + $items.count + " files... "
Write-Host -nonewline -foregroundcolor Yellow $msg
foreach ($item in $items)
{
#
# Suppress error messages
#
trap { Write-Output "Error trapped"; continue; }
#
# Read raw PDF data
#
$pdfText = Get-Content $item -raw
#
# Find string (near end of PDF file), if BAD file, ptr will be undefined or 0
#
$ptr1 = $pdfText.IndexOf("CreationDate")
$ptr2 = $pdfText.IndexOf("ModDate")
#
# Grab raw dates from file - will ERR if ptr is 0
#
try { $cDate = $pdfText.SubString($ptr1, 37); $mDate = $pdfText.SubString($ptr2, 31); }
#
# Append filename and bad status to logfile and increment a counter
# catch block is also where you would rename, move, or delete bad files.
#
catch { "*** $item is Broken ***" >> $logFile; $badCounter += 1; continue; }
#
# Append filename and good status to logfile
#
Write-Output "$item - OK" -EA "Stop" >> $logFile
#
# Increment a counter
#
$goodCounter += 1
}
#
# Calculate total
#
$totalCounter = $badCounter + $goodCounter
#
# Append 3 blank lines to end of logfile
#
1..3 | %{ Write-Output "" >> $logFile }
#
# Append statistics to end of logfile
#
Write-Output "Total: $totalCounter / BAD: $badCounter / GOOD: $goodCounter" >> $logFile
Write-Output "DONE!`n`n"