运行爬虫蜘蛛crawl参数(6)python SCRAPY最新教程1.51以上版本

发表于： 2020年8月28日 2022年12月7日
分类： Python, scrapy
标签： crawl, def, HTTP, http_pass, http_user, humor, None, python, quotes, Scrapy, scrapy教程, self, Spider, spider参数, start, start_urls, tag, tag=humor, url, user_agent, yield, 参数, 基本概念, 爬虫, 蜘蛛, 配置文件

您可以-a 在运行蜘蛛时使用该选项为您的蜘蛛提供命令行参数：

scrapy crawl quotes -o quotes-humor.json -a tag=humor

这些参数传递给Spider的__init__方法，默认情况下变为spider属性。

在此示例中，为参数提供的值tag将通过self.tag。您可以使用此选项使您的蜘蛛只获取具有特定标记的引号，并根据参数构建URL：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

如果您将tag=humor参数传递给此蜘蛛，您会注意到它只会访问humor标记中的URL ，例如http://quotes.toscrape.com/tag/humor。

您可以了解更多关于此处理蜘蛛参数。

蜘蛛可以接收修改其行为的参数。spider参数的一些常见用途是定义起始URL或将爬网限制到站点的某些部分，但它们可用于配置spider的任何功能。

crawl使用该-a选项通过命令传递Spider参数。例如：

scrapy crawl myspider -a category=electronics

蜘蛛可以在__init__方法中访问参数：

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://www.example.com/categories/%s' % category]
        # ...
默认的__init__方法将接受任何spider参数并将它们作为属性复制到spider。上面的例子也可以写成如下：

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/categories/%s' % self.category)

请记住，蜘蛛参数只是字符串。蜘蛛本身不会进行任何解析。如果要从命令行设置start_urls属性，则必须使用ast.literal_eval 或json.loads 等方法将其自行解析为列表，然后将其设置为属性。否则，您将导致对start_urls字符串的迭代（一个非常常见的python陷阱），导致每个字符被视为一个单独的url。

有效的用例是设置由以下所用HttpAuthMiddleware 的用户代理使用的http身份验证凭据UserAgentMiddleware：

scrapy crawl myspider -a http_user=myuser -a http_pass=mypassword -a user_agent=mybot

Spider 参数也可以通过Scrapyd schedule.jsonAPI 传递。请参阅Scrapyd文档。

Scrapy调试内存泄漏及常见问题(49)python Scrapy教程1.51以上版本 2020年9月19日
爬虫蜘蛛Scrapy设置Settings大全(36)python SCRAPY最新教程1.51以上版本 2020年9月11日
蜘蛛采集选择器xpath的详细使用讲解python… 2020年9月1日
爬虫蜘蛛采集请求和回应Request和Response之响应对象scrapy.Response(34)p… 2020年9月10日
爬虫蜘蛛常见问题解答(42)python Scrapy教程1.51以上版本 2020年9月15日
爬虫蜘蛛Scrapy如何设置API？(64)python Scrapy教程1.51以上版本 2020年9月26日
爬虫蜘蛛项目加载器Item Loader类详解之ItemLoader对象详解 (21)python… 2020年9月4日
运行Scrapy爬虫蜘蛛的方法大全(45)python Scrapy教程1.51以上版本 2020年9月17日
爬虫蜘蛛Scrapy内置下载中间件详细分析DOWNLOADER_MIDDLEWARES(58)pytho… 2020年9月23日
如何运行我们的蜘蛛爬虫(3)python SCRAPY教程1.51以上版本 2020年8月26日
- 简单的HTML和XHTML解析器 - 结构化标记处理工具（Python教程）（参考资料） 2019年3月25日