我有成千上万的文档,其中一些已扫描。因此,我需要一个脚本来测试属于目录的所有PDF文件。有没有简单的方法可以做到这一点?
- 大多数PDF是报告。因此,他们有很多文字。
它们是非常不同的,但是由于不稳定的OCR处理与扫描相结合,因此如下所述的扫描对象可以找到一些文本。
在下面的评论中,由于Sudodus提出的提案似乎非常有趣。查看已扫描的PDF与未扫描的PDF之间的区别:
已扫描:
grep --color -a 'Image' AR-G1002.pdf
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream
<rdf:li xml:lang="x-default">Image</rdf:li>
<rdf:li xml:lang="x-default">Image</rdf:li>
未扫描:
grep --color -a 'Image' AR-G1003.pdf
<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream
<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>
<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>
<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>
<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>
每页的图像数量要大得多(每页大约一张)!
pdf
文件包含图像(在文本旁边或整个页面中插入到文档中,称为“扫描的pdf”),则该文件通常(也许总是)包含字符串/Image/
,可以在命令行中找到该字符串grep --color -a 'Image' filename.pdf
。这会将仅包含文本的文件与包含图像的文件(整页图像以及带有小徽标和中型插图图片的文本页)分开。