程序设计 web-scraping

8

我是Scrapy的新手，我正在寻找一种从Python脚本运行它的方法。我找到2个资料来解释这一点： http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/ http://snipplr.com/view/67006/using-scrapy-from-a-script/ 我不知道应该在哪里放置我的Spider代码以及如何从main函数中调用它。请帮忙。这是示例代码： # This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. # # The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart …

74 python web-scraping web-crawler scrapy

10

用Java进行Web抓取

我找不到任何基于Java的良好网页抓取API。我需要抓取的网站也没有提供任何API；我想遍历所有网页，pageID并在其DOM树中提取HTML标题/其他内容。除了网页抓取还有其他方法吗？

72 java web-scraping frameworks

7

Google Chrome扩展程序中的网页抓取（JavaScript + Chrome API）

在带有JavaScript的Google Chrome扩展程序中对当前未打开的标签执行Web爬取的最佳选择是什么，以及可用的更多技术。其他JavaScript库也被接受。重要的是掩盖抓取行为，使其表现得像正常的网络请求。没有迹象表明AJAX或XMLHttpRequest，如X-Requested-With: XMLHttpRequest或Origin。必须从JavaScript中访问已抓取的内容，以便在扩展中进行进一步的操作和表示，很可能是字符串。 WebKit / Chrome特定的API中是否有任何钩子可用于进行正常的Web请求并获取处理结果？ var pageContent = getPageContent(url); // TODO: Implement var items = $(pageContent).find('.item'); // Display items with further selections 奖励点，可以从磁盘上的本地文件进行此项工作，以进行初始调试。但是，如果那只是停止解决方案的唯一要点，则忽略奖金点。

72 javascript google-chrome google-chrome-extension xmlhttprequest web-scraping

8

我怎么能刮得更快

这里的工作是刮的API的网站，从开始https://xxx.xxx.xxx/xxx/1.json到https://xxx.xxx.xxx/xxx/1417749.json写它到底到MongoDB的。为此，我有以下代码： client = pymongo.MongoClient("mongodb://127.0.0.1:27017") db = client["thread1"] com = db["threadcol"] start_time = time.time() write_log = open("logging.log", "a") min = 1 max = 1417749 for n in range(min, max): response = requests.get("https:/xx.xxx.xxx/{}.json".format(str(n))) if response.status_code == 200: parsed = json.loads(response.text) inserted = com.insert_one(parsed) write_log.write(str(n) + "\t" + str(inserted) + "\n") print(str(n) + …

16 python mongodb web-scraping pymongo

3

单击带有VBA和HTML的自动完成列表中的项目

我创建了一个自动化程序，该程序可以让我在网站上输入详细信息（尽管我不能共享它，因为它是内部的）。我下面的代码仅在输入“ received from”的文本之前有效。但是，此“接收自”字段具有自动完成列表，我需要选择它才能填充其他字段，例如TIN和“地址”。自动完成列表与https://jqueryui.com/autocomplete/中的列表非常相似，或者 http://demos.codexworld.com/autocomplete-textbox-using-jquery-php-mysql/中的下面是我的代码： Sub Automate_IE_Enter_Data() 'This will load a webpage in IE Dim i As Long Dim Url As String Dim IE As InternetExplorer Dim objElement As Object Dim objCollection As Object Dim HWNDSrc As Long Dim wsTemplate As Worksheet Dim objEvent As Object Dim li_arr …

13 html excel vba web-scraping autocomplete

1

无法让我的脚本自动生成一些要在有效负载中使用的值

我创建了一个脚本，可通过随后发送两个https请求从目标页面获取html元素。我的脚本可以完美地完成任务。但是，我必须复制chrome开发工具中的四个值以填充其中的四个键payload，以便发送最终的http请求到达目标页面。这是起始链接，下面是有关如何到达目标页面的说明。单击Find Hotel按钮（如果chek-out默认情况下check-in日期比日期长至少一天，则无需更改日期）。勾选下图所示的框，然后Book Now按其上方的按钮。现在，它将引导您自动进入目标页面。到达标题为的目标页面后Enter Guest Details，从此处解析html元素我已经尝试过（使用一个）： import requests from bs4 import BeautifulSoup url = 'https://booking.discoverqatar.qa/SearchHandler.aspx?' second_url = 'https://booking.discoverqatar.qa/PassengerDetails.aspx?' params = { 'Module':'H','txtCity':'','hdnCity':'2947','txtHotel':'','hdnHotel':'', 'fromDate':'05/11/2019','toDate':'07/11/2019','selZone':'','minSelPrice':'', 'maxSelPrice':'','roomConfiguration':'2|0|','noOfRooms':'1', 'hotelStandardArray':'63,60,54,50,52,51','CallFrom':'','DllNationality':'-1', 'HdnNoOfRooms':'-1','SourceXid':'MTEzNzg=','mdx':'' } payload = { 'CallFrom':'MToxNjozOCBQTXxCMkN8MToxNjozOCBQTQ==', 'Btype':'MToxNjozOCBQTXxBfDE6MTY6MzggUE0=', 'PaxConfig':'MToxNjozOCBQTXwyfDB8MnwwfHwxOjE2OjM4IFBN', 'usid':'MToxNjozOCBQTXxoZW54dmkzcWVnc3J3cXpld2lsa2ZwMm18MToxNjozOCBQTQ==' } with requests.Session() as s: r = s.get(url,params=params,headers={"User-agent":"Mozilla/5.0"}) res = s.get(second_url,params=payload,headers={ "User-agent":"Mozilla/5.0", "Referer":r.url …

10 python python-3.x web-scraping

2

无法解析用户名以确保我已登录网站

我已经用python编写了一个脚本来登录网站并解析用户名，以确保我确实能够登录。使用下面尝试的方法似乎可以使我到达那里。但是，我使用了脚本中chrome开发工具中的硬编码cookie来获得成功。我尝试过： import requests from bs4 import BeautifulSoup url = 'https://secure.imdb.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.imdb.com%2Fap-signin-handler&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=imdb_pro_us&openid.mode=checkid_setup&siteState=eyJvcGVuaWQuYXNzb2NfaGFuZGxlIjoiaW1kYl9wcm9fdXMiLCJyZWRpcmVjdFRvIjoiaHR0cHM6Ly9wcm8uaW1kYi5jb20vIn0&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0' signin = 'https://secure.imdb.com/ap/signin' mainurl = 'https://pro.imdb.com/' with requests.Session() as s: res = s.get(url,headers={"User-agent":"Mozilla/5.0"}) soup = BeautifulSoup(res.text,"lxml") payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')} payload['email'] = 'some username' payload['password'] = 'some password' s.post(signin,data=payload,headers={ "User-agent":"Mozilla/5.0", "Cookie": 'adblk=adblk_yes; ubid-main=130-2884709-6520735; _msuuid_518k2z41603=95C56F3B-E3C1-40E5-A47B-C4F7BAF2FF5D; …

9 python python-3.x web-scraping

Questions tagged «web-scraping»