创建第一个爬虫蜘蛛(2)python SCRAPY教程1.51以上版本

发表于： 2020年8月26日 2022年12月7日
分类： Python, scrapy
标签： def, filename, HTTP, name, Page, parse, python, quotes, requests, Response, Scrapy, scrapy教程, self, Spider, start, toscrape, url, urls, 安装Scrapy, 爬虫, 蜘蛛

蜘蛛是您定义的类，Scrapy用来从网站（或一组网站）中提取信息。它们必须子类化 scrapy.Spider并定义要生成的初始请求，可选地如何跟踪页面中的链接，以及如何解析下载的页面内容以提取数据。

这是我们第一个蜘蛛的代码。将其保存在 项目目录quotes_spider.py下的tutorial/spiders文件中：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

如您所见，我们的Spider子类scrapy.Spider 并定义了一些属性和方法：

name：识别蜘蛛。它在项目中必须是唯一的，也就是说，您不能为不同的Spiders设置相同的名称。
start_requests()：必须返回Spider将开始爬行的可迭代请求（您可以返回请求列表或编写生成器函数）。后续请求将从这些初始请求中连续生成。
parse()：将调用一个方法来处理为每个请求下载的响应。响应参数是TextResponse保存页面内容的实例，并具有处理它的其他有用方法。

该parse()方法通常解析响应，将抽取的数据提取为dicts，并查找要遵循的新URL并Request从中创建新的request（）。

蜘蛛采集内置选择器大全python scrapy.Spider(16)SCRAPY最新教程1.51以上版本 2020年9月1日
(命令行工具)可用的工具命令(13)python SCRAPY最新教程1.51以上版本 2020年8月31日
抓取采集网页并提取数据(5)python SCRAPY最新教程1.51以上版本 2020年8月27日
爬虫蜘蛛采集请求和回应Request和Response之响应对象scrapy.Response(34)p… 2020年9月10日
爬虫蜘蛛基准测试scrapy bench(53)python Scrapy教程1.51以上版本 2020年9月21日
爬虫蜘蛛Scrapy如何设置API？(64)python Scrapy教程1.51以上版本 2020年9月26日
Scrapy下载和处理文件和图像并存储到google云端(50)python Scrapy教程1.51以上版本 2020年9月19日
爬虫蜘蛛合同contracts(44)python Scrapy教程1.51以上版本 2020年9月16日
通过scrapy命令行工具进行控制配置设置(7)python SCRAPY最新教程1.51以上版本 2020年8月28日
爬虫蜘蛛Scrapy核心Crawler API详细介绍(63)python Scrapy教程1.51以上版本 2020年9月26日
爬虫蜘蛛项目加载器Item Loader类详解之可用的内置处理器详解 (24)python… 2020年9月5日