如何在Python脚本中运行Scrapy

Question 1

我是Scrapy的新手，我正在寻找一种从Python脚本运行它的方法。我找到2个资料来解释这一点：

http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/

http://snipplr.com/view/67006/using-scrapy-from-a-script/

我不知道应该在哪里放置我的Spider代码以及如何从main函数中调用它。请帮忙。这是示例代码：

# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. 
# 
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
# 
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet. 

#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports

from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue

class CrawlerScript():

    def __init__(self):
        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()
        self.items = []
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def _crawl(self, queue, spider_name):
        spider = self.crawler.spiders.create(spider_name)
        if spider:
            self.crawler.queue.append_spider(spider)
        self.crawler.start()
        self.crawler.stop()
        queue.put(self.items)

    def crawl(self, spider):
        queue = Queue()
        p = Process(target=self._crawl, args=(queue, spider,))
        p.start()
        p.join()
        return queue.get(True)

# Usage
if __name__ == "__main__":
    log.start()

    """
    This example runs spider1 and then spider2 three times. 
    """
    items = list()
    crawler = CrawlerScript()
    items.append(crawler.crawl('spider1'))
    for i in range(3):
        items.append(crawler.crawl('spider2'))
    print items

# Snippet imported from snippets.scrapy.org (which no longer works)
# author: joehillen
# date  : Oct 24, 2010

谢谢。

Question 2

所有其他答案均参考Scrapyv0.x。根据更新的文档，Scrapy 1.0要求：

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

Question 3

尽管我还没有尝试过，但是我认为答案可以在scrapy文档中找到。要直接引用它：

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider

spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here

根据我的收集，这是库中的一项新开发，使一些较早的在线方法（例如问题中的方法）过时了。

Question 4

在scrapy 0.19.x中，您应该这样做：

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings

spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent

注意这些行

settings = get_project_settings()
crawler = Crawler(settings)

没有它，您的蜘蛛将不会使用您的设置，也不会保存这些项目。花了我一段时间才能弄清楚为什么文档中的示例没有保存我的物品。我发送了一个拉取请求以修复doc示例。

要做的另一件事就是直接从脚本中调用命令

from scrapy import cmdline
cmdline.execute("scrapy crawl followall".split())  #followall is the spider's name

从我在这里的第一个答案复制了这个答案：https : //stackoverflow.com/a/19060485/1402286

Question 5

只需我们可以使用

from scrapy.crawler import CrawlerProcess
from project.spiders.test_spider import SpiderName

process = CrawlerProcess()
process.crawl(SpiderName, arg1=val1,arg2=val2)
process.start()

在Spider__init__函数内部将这些参数与全局范围一起使用。

Question 6

当需要在一个python脚本中运行多个搜寻器时，必须谨慎处理反应堆停止，因为反应堆只能停止一次且无法重新启动。

但是，我在做专案时发现

os.system("scrapy crawl yourspider")

是最简单的。这将使我免于处理各种信号，特别是当我有多个蜘蛛时。

如果需要考虑性能，则可以使用多处理并行运行蜘蛛，例如：

def _crawl(spider_name=None):
    if spider_name:
        os.system('scrapy crawl %s' % spider_name)
    return None

def run_crawler():

    spider_names = ['spider1', 'spider2', 'spider2']

    pool = Pool(processes=len(spider_names))
    pool.map(_crawl, spider_names)

Question 7

这是对 Scrapy在使用crawlerprocess运行时会引发错误

和 https://github.com/scrapy/scrapy/issues/1904#issuecomment-205331087

首先创建通常的蜘蛛以成功运行命令行。它应该运行并导出数据或图像或文件非常重要

一旦结束，就像在程序中粘贴在蜘蛛类定义上方和__name __下面以调用设置一样。

它将获得一些必要的设置，这些设置“从scrapy.utils.project import get_project_settings”无法完成，这是许多人推荐的

上下两部分都应该在一起。只有一个不跑。Spider将在scrapy.cfg文件夹中运行，而不在其他任何文件夹中运行

主持人可能会显示树形图以供参考

#Tree
[enter image description here][1]

#spider.py
import sys
sys.path.append(r'D:\ivana\flow') #folder where scrapy.cfg is located

from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from flow import settings as my_settings

#----------------Typical Spider Program starts here-----------------------------

          spider class definition here

#----------------Typical Spider Program ends here-------------------------------

if __name__ == "__main__":

    crawler_settings = Settings()
    crawler_settings.setmodule(my_settings)

    process = CrawlerProcess(settings=crawler_settings)
    process.crawl(FlowSpider) # it is for class FlowSpider(scrapy.Spider):
    process.start(stop_after_crawl=True)

Question 8

# -*- coding: utf-8 -*-
import sys
from scrapy.cmdline import execute


def gen_argv(s):
    sys.argv = s.split()


if __name__ == '__main__':
    gen_argv('scrapy crawl abc_spider')
    execute()

将此代码放在可以从命令行运行的路径中scrapy crawl abc_spider。（使用Scrapy == 0.24.6测试）

Question 9

如果要运行简单的爬网，只需运行以下命令即可轻松实现：

刮y的爬行。还有另一个选项可将结果导出以某些格式存储，例如：Json，xml，csv。

scrapy搜寻-o result.csv或result.json或result.xml。

您可能想尝试一下