我如何将pdfminer用作库


72

我正在尝试使用pdfminer从pdf获取文本数据。我可以使用pdfminer命令行工具pdf2txt.py将数据成功提取到.txt文件中。我目前正在执行此操作,然后使用python脚本清理.txt文件。我想将pdf提取过程合并到脚本中,从而节省了一步。

当我找到此链接时我以为自己正在研究某些东西,但是任何解决方案都没有成功。也许那里列出的功能需要再次更新,因为我使用的是pdfminer的较新版本。

我也尝试了此处显示的功能,但是也没有用。

我尝试的另一种方法是使用调用脚本内的脚本os.system。这也是不成功的。

我正在使用Python版本2.7.1和pdfminer版本20110227。


你试过这个吗?
拉菲·凯特勒

我确实查看了api,谢谢,但是不幸的是,我的python技能不够强,无法获得工作功能。
jmeich 2011年

您可能会发现对类似问题的解决方案很有用:stackoverflow.com/a/61857301/7483211
Cornelius Roemer

Answers:


72

这是我最终制作的对我有用的清理版本。以下只是给定了文件名的PDF字符串。我希望这可以节省一些时间。

from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO

def convert_pdf(path):

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    fp = file(path, 'rb')
    process_pdf(rsrcmgr, device, fp)
    fp.close()
    device.close()

    str = retstr.getvalue()
    retstr.close()
    return str

该解决方案在2013年11月API更改之前一直有效。


23
仅供参考,这不再适用于pdfminer的当前版本-我收到“ ImportError:无法导入名称process_pdf”
scubbo 2013年

1
这是一个
可行的

2
process_pdf仅由PDFPageInterpreter取代
Ando Jurai

1
StringIO也消失了,您可以改为使用“ from io import StringIO”
moli

78

这是与最新版本一起使用的新解决方案:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str

5
大!终于不使用运行的解决方案process_pdf:)我想,与simple1.pdf官方git仓库
迈克尔

4
这适用于version pdfminer 20131113。谢谢。
AfromanJ 2014年

@czw。如何将其与内置选项一起使用?有没有办法指定要转换的页码?
杰森2014年

2
如果要获取html输出,只需将HTMLConverter添加到import和函数中,而不是TextConverter放HTMLConverter。就那么简单。
OWADVL 2014年

1
伟大的一个小小的修改:不要使用str作为变量名。str是python函数。
ViennaMike 2015年

13

我知道回答您自己的问题很不好,但是我想我可能已经明白了这一点,我不希望其他人浪费时间寻找解决我的问题的方法。

我在问题中发布的链接之一中遵循了建议,并重新调整了pdfminer随附的当前pdf2txt.py脚本的用途。如果对其他人有用的话,这里是该功能。感谢用户skyl发布了该答案,我要做的就是进行一些更改以使其能够与当前版本的pdfminer一起使用。

此函数采用pdf格式,并在相同名称的相同目录中创建一个.txt文件。

def convert_pdf(path, outtype='txt', opts={}):
import sys
from pdfminer.pdfparser import PDFDocument, PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams
import getopt

outfile = path[:-3] + outtype
outdir = '/'.join(path.split('/')[:-1])

# debug option
debug = 0
# input option
password = ''
pagenos = set()
maxpages = 0
# output option
# ?outfile = None
# ?outtype = None
outdir = None
#layoutmode = 'normal'
codec = 'utf-8'
pageno = 1
scale = 1
showpageno = True
laparams = LAParams()
for (k, v) in opts:
    if k == '-d': debug += 1
    elif k == '-p': pagenos.update( int(x)-1 for x in v.split(',') )
    elif k == '-m': maxpages = int(v)
    elif k == '-P': password = v
    elif k == '-o': outfile = v
    elif k == '-n': laparams = None
    elif k == '-A': laparams.all_texts = True
    elif k == '-V': laparams.detect_vertical = True
    elif k == '-M': laparams.char_margin = float(v)
    elif k == '-L': laparams.line_margin = float(v)
    elif k == '-W': laparams.word_margin = float(v)
    elif k == '-F': laparams.boxes_flow = float(v)
    elif k == '-Y': layoutmode = v
    elif k == '-O': outdir = v
    elif k == '-t': outtype = v
    elif k == '-c': codec = v
    elif k == '-s': scale = float(v)
#
#PDFDocument.debug = debug
#PDFParser.debug = debug
CMapDB.debug = debug
PDFResourceManager.debug = debug
PDFPageInterpreter.debug = debug
PDFDevice.debug = debug
#
rsrcmgr = PDFResourceManager()

outtype = 'text'

if outfile:
    outfp = file(outfile, 'w')
else:
    outfp = sys.stdout
device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)


fp = file(path, 'rb')
process_pdf(rsrcmgr, device, fp, pagenos, maxpages=maxpages, password=password,
                check_extractable=True)
fp.close()
device.close()
outfp.close()
return

14
可以回答您自己的问题,不用担心。
oers 2011年

1
我还重新调整了代码的用途。它可能会更苗条
Jesvin Jose 2011年

2
由于您提到的原因,非常希望回答您自己的问题
disc0dancer 2012年

12

使用最新版本的pdfminer(截至2014年9月)对我有用:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
import unicodedata, codecs
from io import StringIO

def getPDFText(pdfFilenamePath):
    retstr = StringIO()
    parser = PDFParser(open(pdfFilenamePath,'r'))
    try:
        document = PDFDocument(parser)
    except Exception as e:
        print(pdfFilenamePath,'is not a readable pdf')
        return ''
    if document.is_extractable:
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr,retstr, codec='ascii' , laparams = LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)
        return retstr.getvalue()
    else:
        print(pdfFilenamePath,"Warning: could not extract text from pdf file.")
        return ''

if __name__ == '__main__':
    words = getPDFText(path)

11

这是我的解决方案

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
import os 

def convert_pdf_to_txt(path, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)
    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = open(path, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close()
    return text

例如,您只想阅读pdf文件的前3页:

text = convert('../Data/EN-FINAL Table 9.pdf', pages=[0,1,2])

pdfminer.six == 20160614

的Python:3.x


1
真好!这对我有用。Python 3.6.2 pdfminer.six
PJATX

Python 3.6 / pdfminer.six / Windows 7下的最佳解决方案
Peter

4

对non-process_pdf答案的以下修改可直接从URL字符串名称中提取文本,并适用于版本20140328和Python 2.7:

from urllib2 import urlopen
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(url):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    scrape = urlopen(url).read()
    fp = StringIO(scrape)

    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    fp.close()
    device.close()
    textstr = retstr.getvalue()
    retstr.close()
    return textstr

4

这是pdfminer.six运行python 3.6的答案。pdfminer.high_level如果您只是想从一个简单的PDF文件中提取原始文本,则它使用的模块可以抽象出很多底层细节。

import pdfminer
import io

def extract_raw_text(pdf_filename):
    output = io.StringIO()
    laparams = pdfminer.layout.LAParams() # Using the defaults seems to work fine

    with open(pdf_filename, "rb") as pdffile:
        pdfminer.high_level.extract_text_to_fp(pdffile, output, laparams=laparams)

    return output.getvalue()

3

如果您正在通过urllib2处理抓取的数据,请尝试以下操作(在此处进行开发和说明):

def pdf_to_text(scraped_pdf_data): 
    from pdfminer.pdfinterp import PDFResourceManager, process_pdf 
    from pdfminer.pdfdevice import PDFDevice 
    from pdfminer.converter import TextConverter 
    from pdfminer.layout import LAParams 

    import StringIO 
    fp = StringIO.StringIO() 
    fp.write(scraped_pdf_data) 
    fp.seek(0) 
    outfp = StringIO.StringIO() 

    rsrcmgr = PDFResourceManager() 
    device = TextConverter(rsrcmgr, outfp, laparams=LAParams()) 
    process_pdf(rsrcmgr, device, fp) 
    device.close() 

    t = outfp.getvalue() 
    outfp.close() 
    fp.close() 
    return t

像其他答案一样,此处的代码改编了PDFMiner本身提供的pdf2txt实用程序。你可以由此也转换为HTML或XML -只是子HTMLConverterXMLConverterTextConverter无处不在上面。


3

以下代码对我来说适用于最新版本的PDFMiner,它采用pdf路径并以.txt格式返回文本。

PS:这是以上答案的修改。

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path, outtype='txt'):
    outfile = path[:-3] + outtype
    rsrcmgr = PDFResourceManager()
    codec = 'utf-8'
    laparams = LAParams()
    if outfile:
        outfp = file(outfile, 'w')
    else:
        outfp = sys.stdout
    device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    outfp.close()
    return

3

以防万一仍然有人需要它,使其与请求和python 3.4一起使用。感谢@bahmait在上面的回答:)

import requests

from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams


def pdf_to_text(url=None):
    text = None
    pdf = requests.get(url)

    if pdf.ok:
        fp = StringIO(str(pdf.content, 'utf-8'))
        outfp = StringIO()

        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, outfp, laparams=LAParams())
        process_pdf(rsrcmgr, device, fp)
        device.close()

        text = outfp.getvalue()
        outfp.close()
        fp.close()
    return text


if __name__ == "__main__":
    hello_world_text = pdf_to_text("https://bytebucket.org/hsoft/pdfminer3k/raw/28edfc91caed830674ca0b928f42571f7dee6091/samples/simple1.pdf")
    no_pdf = pdf_to_text('http://www.google.com/404')
    print(hello_world_text)
    print(no_pdf)

2
你有试过这个家伙吗?[link] pypi.python.org/pypi/pdfminer3k ...看起来好像不再真正支持了,但是它给了我一个很好的起点。最后,对于我的特定用例,我使用了他们的pdf2txt模块,混合了pypdf2中的几种方法。
djmm187

2

这是我最终制作的对我有用的清理版本。以下只是给定了文件名的PDF字符串。我希望这可以节省一些时间。

from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO

def convert_pdf(path):

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    fp = file(path, 'rb')
    process_pdf(rsrcmgr, device, fp)
    fp.close()
    device.close()

    str = retstr.getvalue()
    retstr.close()
    return str

有人可以说:在pdf文件要放置的特定位置吗?


1

仅当有人仍然需要它时:如何使用PDFMiner从PDF打印HTML:

import sys
import getopt
from Core.Interfaces.IReader import IReader
from pdfminer.pdfparser import PDFDocument, PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams
from cStringIO import StringIO

class PdfReader(object):
def __init__(self):
    pass

def readText(self,path, outtype='text', opts={}):
    outfile = path[:-3] + outtype
    outdir = '/'.join(path.split('/')[:-1])
    # debug option
    debug = 0
    # input option
    password = ''
    pagenos = set()
    maxpages = 0
    # output option
    # ?outfile = None
    # ?outtype = None
    outdir = None
    #layoutmode = 'normal'
    codec = 'utf-8'
    pageno = 1
    scale = 1
    showpageno = True
    laparams = LAParams()
    for (k, v) in opts:
        if k == '-d': debug += 1
        elif k == '-p': pagenos.update( int(x)-1 for x in v.split(',') )
        elif k == '-m': maxpages = int(v)
        elif k == '-P': password = v
        elif k == '-o': outfile = v
        elif k == '-n': laparams = None
        elif k == '-A': laparams.all_texts = True
        elif k == '-V': laparams.detect_vertical = True
        elif k == '-M': laparams.char_margin = float(v)
        elif k == '-L': laparams.line_margin = float(v)
        elif k == '-W': laparams.word_margin = float(v)
        elif k == '-F': laparams.boxes_flow = float(v)
        elif k == '-Y': layoutmode = v
        elif k == '-O': outdir = v
        elif k == '-t': outtype = v
        elif k == '-c': codec = v
        elif k == '-s': scale = float(v)

    print laparams
    #
    #PDFDocument.debug = debug
    #PDFParser.debug = debug
    CMapDB.debug = debug
    PDFResourceManager.debug = debug
    PDFPageInterpreter.debug = debug
    PDFDevice.debug = debug
    #
    rsrcmgr = PDFResourceManager()

    #outtype = 'text'

    outfp = StringIO()

    device = HTMLConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)


    fp = file(path, 'rb')
    process_pdf(rsrcmgr, device, fp, pagenos, maxpages=maxpages, password=password,
                    check_extractable=True)
    fp.close()
    device.close()
    print outfp.getvalue()
    outfp.close()

    return



reader = PdfReader()
opt = map(None,['-W','-L','-t'],[0.5,0.4,'html'])
reader.readText("/test_data/test.pdf","html",opt)

1

这个在python 3中为我工作。它需要PDFMiner.six包

pip install pdfminer.six

代码如下(与每个人相同的代码,但有较小的修复):

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from six import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()

    return str.replace("\\n","\n")

0

完全公开,我是pdfminer.six的维护者之一。

如今,有多种API可以根据您的需要从PDF中提取文本。在幕后,所有这些API都使用相同的逻辑来解析和分析布局。

(所有示例均假设您的PDF文件称为example.pdf

命令行

如果您只想提取一次文本,则可以使用命令行工具pdf2txt.py:

$ pdf2txt.py example.pdf

高级API

如果要使用Python提取文本,则可以使用高级api。如果要以编程方式从许多PDF提取文本,则此方法是首选解决方案。

from pdfminer.high_level import extract_text

text = extract_text('example.pdf')

可组合的API

还有一个可组合的api,在处理生成的对象时具有很大的灵活性。例如,您可以使用它来实现自己的布局算法。其他答案中建议使用此方法,但仅在需要自定义pdfminer.six行为方式时,才建议使用此方法。

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('example.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

-1

以下代码段能够使用最新版本的pdfminer(截至2016年3月23日)从pdf文档中提取纯文本。希望这可以帮助。

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    fp = file(path, 'rb')

    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    parser.set_document(doc)

    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,        password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    print text
    return text

convert_pdf_to_txt(<path_of_the_pdf_file>)
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.