抓取采集网页并提取数据(5)python SCRAPY最新教程1.51以上版本

发表于： 2020年8月27日 2022年12月7日
分类： Python, scrapy
标签： author, css, extract, first, href, HTTP, Page, python, quote, quotes, Request, Response, Scrapy, Scrapy Selectors, scrapy.Request, scrapy教程, title, XPath, 安装Scrapy, 快捷方式, 提取, 爬虫, 示例, 蜘蛛, 选择器

提取数据

学习如何使用Scrapy 提取数据的最佳方法是使用shell Scrapy shell尝试选择器。运行：

scrapy shell 'http://quotes.toscrape.com/page/1/'

注意

当从命令行运行Scrapy shell时，请记住始终将URL括在引号中，否则包含参数（即&字符）的url 将不起作用。

在Windows上，请使用双引号：

scrapy shell "http://quotes.toscrape.com/page/1/"

你会看到类似的东西：

[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
[s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>>

使用shell，您可以尝试使用CSS和响应对象选择元素：

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

运行的结果response.css('title')是一个类似于列表的对象 SelectorList，它表示一个Selector包含XML / HTML元素的对象列表，并允许您运行进一步的查询以细化选择或提取数据。

要从上面的标题中提取文本，您可以执行以下操作：

>>> response.css('title::text').extract()
['Quotes to Scrape']

这里有两点需要注意：一个是我们已经添加::text到CSS查询中，意味着我们只想在元素内直接选择文本元素 <title>。如果我们不指定::text，我们将获得完整的title元素，包括其标签：

>>> response.css('title').extract()
['<title>Quotes to Scrape</title>']

另一件事是调用的结果.extract()是一个列表，因为我们正在处理一个实例SelectorList。当你知道你只想要第一个结果时，就像在这种情况下，你可以这样做：

>>> response.css('title::text').extract_first()
'Quotes to Scrape'

作为替代方案，你可以写：

>>> response.css('title::text')[0].extract()
'Quotes to Scrape'

但是，当它找不到与选择匹配的任何元素时，使用.extract_first()避免IndexError和返回 None。

这里有一个教训：对于大多数抓取代码，您希望它能够在页面上找不到任何内容时对错误具有弹性，因此即使某些部分无法被删除，您也至少可以获得一些数据。

除了extract()和 extract_first()方法之外，您还可以使用该re()方法使用正则表达式进行提取：

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

为了找到合适的CSS选择器，您可能会发现使用Web浏览器中的shell打开响应页面非常有用view(response)。您可以使用浏览器开发人员工具或扩展程序（如Firebug）（请参阅有关使用Firebug进行抓取和使用Firefox进行抓取的部分）。

Selector Gadget也是一个很好的工具，可以快速找到视觉选择元素的CSS选择器，它可以在许多浏览器中使用。

XPath：简要介绍

除了CSS，Scrapy 选择器还支持使用XPath表达式：

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'

XPath表达式非常强大，是Scrapy Selectors的基础。实际上，CSS选择器在引擎盖下转换为XPath。如果仔细阅读shell中选择器对象的文本表示，则可以看到。

虽然可能不像CSS选择器那样流行，但XPath表达式提供了更多功能，因为除了导航结构之外，它还可以查看内容。使用XPath，您可以选择以下内容：选择包含文本“下一页”的链接。这使得XPath非常适合抓取任务，我们鼓励你学习XPath，即使你已经知道如何构造CSS选择器，它也会使抓取更容易。

我们不会在这里介绍XPath的大部分内容，但您可以在此处阅读有关在Scrapy选择器中使用XPath的更多信息。要了解有关XPath的更多信息，我们建议本教程通过示例学习XPath，本教程将学习“如何在XPath中思考”。

提取引号和作者

现在你已经了解了一些关于选择和提取的知识，让我们通过编写代码从网页中提取引号来完成我们的蜘蛛。

http://quotes.toscrape.com中的每个引用都由HTML元素表示，如下所示：

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

让我们打开scrapy shell并播放一下以了解如何提取我们想要的数据：

$ scrapy shell 'http://quotes.toscrape.com'

我们得到了引用HTML元素的选择器列表：

>>> response.css("div.quote")

上面的查询返回的每个选择器允许我们对其子元素运行进一步的查询。让我们将第一个选择器分配给一个变量，这样我们就可以直接在特定的引号上运行CSS选择器：

>>> quote = response.css("div.quote")[0]

现在，让我们来提取title，author而tags从报价使用quote我们刚刚创建的对象：

>>> title = quote.css("span.text::text").extract_first()
>>> title
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").extract_first()
>>> author
'Albert Einstein'

鉴于标签是一个字符串列表，我们可以使用该.extract()方法获取所有这些：

>>> tags = quote.css("div.tags a.tag::text").extract()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

在弄清楚如何提取每个位之后，我们现在可以遍历所有引号元素并将它们放在一起放入Python字典：

>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").extract_first()
...     author = quote.css("small.author::text").extract_first()
...     tags = quote.css("div.tags a.tag::text").extract()
...     print(dict(text=text, author=author, tags=tags))
{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
{'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
    ... a few more of these, omitted for brevity
>>>

在我们的蜘蛛中提取数据

让我们回到我们的蜘蛛。到目前为止，它并没有特别提取任何数据，只是将整个HTML页面保存到本地文件中。让我们将上面的提取逻辑集成到我们的蜘蛛中。

Scrapy蜘蛛通常会生成许多包含从页面提取的数据的字典。为此，我们yield在回调中使用Python关键字，如下所示：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

如果你运行这个蜘蛛，它将输出提取的数据与日志：

2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}

存储已删除的数据

存储已删除数据的最简单方法是使用Feed导出，使用以下命令：

scrapy crawl quotes -o quotes.json

这将生成一个quotes.json包含所有已删除项目的文件，以JSON序列化。

由于历史原因，Scrapy会附加到给定文件而不是覆盖其内容。如果在第二次之前没有删除文件的情况下运行此命令两次，则最终会出现损坏的JSON文件。

您还可以使用其他格式，例如JSON Lines：

scrapy crawl quotes -o quotes.jl

该JSON行格式是有用的，因为它的流状，你可以很容易地新记录追加到它。当你运行两次时，它没有相同的JSON问题。此外，由于每个记录都是一个单独的行，您可以处理大文件而无需将所有内容都放在内存中，有些工具如JQ可以帮助您在命令行中执行此操作。

在小项目（如本教程中的项目）中，这应该足够了。但是，如果要使用已删除的项目执行更复杂的操作，可以编写项目管道。项目管道的占位符文件已在创建项目时为您设置tutorial/pipelines.py。如果您只想存储已删除的项目，则不需要实现任何项目管道。

以下链接

让我们说，你不需要从http://quotes.toscrape.com的前两页中抓取内容，而是需要来自网站所有页面的引用。

既然您知道如何从页面中提取数据，那么让我们看看如何从它们中获取链接。

首先是提取我们想要关注的页面的链接。检查我们的页面，我们可以看到下一页的链接带有以下标记：

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

我们可以尝试在shell中提取它：

>>> response.css('li.next a').extract_first()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

这将获取锚元素，但我们需要该属性href。为此，Scrapy支持CSS扩展，让您选择属性内容，如下所示：

>>> response.css('li.next a::attr(href)').extract_first()
'/page/2/'

现在让我们看看我们的蜘蛛被修改为递归地跟随到下一页的链接，从中提取数据：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

现在，在提取数据之后，该parse()方法查找到下一页的链接，使用该urljoin()方法构建完整的绝对URL （因为链接可以是相对的）并向下一页生成新请求，将自身注册为回调来处理下一页的数据提取，并保持爬网遍历所有页面。

你在这里看到的是Scrapy的跟踪链接机制：当你在回调方法中产生一个Request时，Scrapy会安排发送该请求并注册一个回调方法，以便在该请求完成时执行。

使用此功能，您可以根据您定义的规则构建遵循链接的复杂爬网程序，并根据其访问的页面提取不同类型的数据。

在我们的示例中，它创建了一种循环，跟随到下一页的所有链接，直到它找不到一个 – 用于爬行博客，论坛和其他具有分页的站点。

创建请求的快捷方式

作为创建Request对象的快捷方式，您可以使用 response.follow：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

与scrapy.Request不同，它response.follow直接支持相对URL – 无需调用urljoin。注意，response.follow只返回一个Request实例; 你仍然需要提出这个请求。

您也可以传递选择器response.follow而不是字符串; 此选择器应提取必要的属性：

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)
对于<a>元素，有一个快捷方式：response.follow自动使用其href属性。所以代码可以进一步缩短：

for a in response.css('li.next a'):
    yield response.follow(a, callback=self.parse)

注意

response.follow(response.css('li.next a'))无效是因为 response.css返回一个类似于列表的对象，其中包含所有结果的选择器，而不是单个选择器。甲for象在上面的例子中循环，或是好的。response.follow(response.css('li.next a')[0])

提取数据

XPath：简要介绍

提取引号和作者

在我们的蜘蛛中提取数据

以下链接

创建请求的快捷方式

更多示例和模式