蜘蛛采集选择器xpath的详细使用讲解python scrapy.Spider(15)SCRAPY最新教程1.51以上版本

发表于： 2020年9月1日 2022年12月8日
分类： Python, scrapy
标签： css, div, EXSLT, extract, href, html, image, itemprop, My, org, python, response.css, response.xpath, Scrapy, scrapy shell, scrapy教程, sel, Selector, span, Spider, TextResponse, XPath, 正则表达式, 爬虫, 蜘蛛, 选择器

构造选择器

Scrapy 选择器是Selector通过传递文本或TextResponse 对象构造的类的实例。它会根据输入类型自动选择最佳解析规则（XML与HTML）：

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse

从文本构造：

>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()
[u'good']

从响应构建：

>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').extract()
[u'good']

为方便起见，响应对象在.selector属性上公开选择器，在可能的情况下使用此快捷方式是完全可以的：

>>> response.selector.xpath('//span/text()').extract()
[u'good']

使用选择器

为了解释如何使用选择器，我们将使用Scrapy shell（提供交互式测试）和Scrapy文档服务器中的示例页面：

https://www.itbooks.com/2019/01/8764/

这是它的HTML代码：

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

首先，让我们打开shell：

scrapy shell https://doc.scrapy.org/en/latest/_static/selectors-sample1.html

然后，在shell加载之后，您将获得响应作为response shell变量，并在response.selector属性中附加选择器。

由于我们正在处理HTML，因此选择器将自动使用HTML解析器。

因此，通过查看该页面的HTML代码，让我们构建一个XPath来选择title标签内的文本：

>>> response.selector.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]

使用XPath和CSS查询响应非常常见，响应包括两个便捷快捷方式：response.xpath()和response.css()：

>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]

正如你所看到的，.xpath()并且.css()方法返回一个 SelectorList实例，这是新的选择列表。此API可用于快速选择嵌套数据：

>>> response.css('img').xpath('@src').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

要实际提取文本数据，必须调用selector .extract() 方法，如下所示：

>>> response.xpath('//title/text()').extract()
[u'Example website']

如果只想提取第一个匹配的元素，可以调用选择器 .extract_first()

>>> response.xpath('//div[@id="images"]/a/text()').extract_first()
u'Name: My image 1 '

None如果没有找到元素，则返回：

>>> response.xpath('//div[@id="not-exists"]/text()').extract_first() is None
True

可以提供默认返回值作为参数，以代替None：

>>> response.xpath('//div[@id="not-exists"]/text()').extract_first(default='not-found')
'not-found'

请注意，CSS选择器可以使用CSS3伪元素选择文本或属性节点：

>>> response.css('title::text').extract()
[u'Example website']

现在我们将获得基本URL和一些图像链接：

>>> response.xpath('//base/@href').extract()
[u'http://example.com/']

>>> response.css('base::attr(href)').extract()
[u'http://example.com/']

>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.css('a[href*=image]::attr(href)').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

>>> response.css('a[href*=image] img::attr(src)').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

嵌套选择器

选择方法（.xpath()或.css()）返回相同类型的选择器列表，因此您也可以为这些选择器调用选择方法。这是一个例子：

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

>>> for index, link in enumerate(links):
...     args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
...     print 'Link number %d points to url %s and image %s' % args

Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']

使用具有正则表达式的选择器

Selector还有一种.re()使用正则表达式提取数据的方法。但是，与using .xpath()或.css()methods 不同，.re()返回unicode字符串列表。所以你不能构造嵌套.re()调用。

以下是用于从上面的HTML代码中提取图像名称的示例：

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
[u'My image 1',
 u'My image 2',
 u'My image 3',
 u'My image 4',
 u'My image 5']

这里有一个额外的辅助往复.extract_first()的.re()，命名.re_first()。用它来提取第一个匹配的字符串：

>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
u'My image 1'

相对XPath的工作

请记住，如果您正在嵌套选择器并使用以XP开头的XPath /，那么XPath对于文档是绝对的，而不是相对于 Selector您从中调用它。

例如，假设您要提取<p>元素内的所有<div> 元素。首先，您将获得所有<div>元素：

>>> divs = response.xpath('//div')

首先，您可能会尝试使用以下方法，这是错误的，因为它实际上<p>从文档中提取所有元素，而不仅仅是<div>元素内部的元素：

>>> for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print p.extract()

这是正确的方法（注意前缀为.//pXPath 的点）：

>>> for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print p.extract()

另一个常见的案例是提取所有直接<p>子女：

>>> for p in divs.xpath('p'):
...     print p.extract()

有关相对XPath的更多详细信息，请参阅XPath规范中的“ 位置路径”部分。

XPath表达式中的变量

XPath允许您使用$somevariable语法引用XPath表达式中的变量。这有点类似于SQL世界中的参数化查询或预准备语句，您可以使用占位符替换查询中的某些参数?，然后将其替换为随查询传递的值。

这是一个基于其“id”属性值匹配元素的示例，而不对其进行硬编码（如前所示）：

>>> # `$val` used in the expression, a `val` argument needs to be passed
>>> response.xpath('//div[@id=$val]/a/text()', val='images').extract_first()
u'Name: My image 1 '

这是另一个例子，找到<div>包含五个<a>子节点的标签的“id”属性（这里我们将值5作为整数传递）：

>>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).extract_first()
u'images'

调用时，所有变量引用都必须具有绑定值.xpath() （否则您将获得异常）。这是通过根据需要传递尽可能多的命名参数来完成的。ValueError: XPath error:

parsel，为Scrapy 选择器提供动力的库，有关于XPath变量的更多细节和示例。

使用EXSLT扩展

在lxml上构建，Scrapy 选择器还支持一些EXSLT扩展，并附带这些预先注册的命名空间以在XPath表达式中使用：

字首	命名空间	用法
回覆	http://exslt.org/regular-expressions	常用表达
组	http://exslt.org/sets	设定操纵

正则表达式

test()例如，当XPath starts-with()或contains()不足时，该函数可以证明非常有用。

示例选择列表项中的链接，其中“class”属性以数字结尾：

>>> from scrapy import Selector
>>> doc = """
... <div>
...     <ul>
...         <li class="item-0"><a href="link1.html">first item</a></li>
...         <li class="item-1"><a href="link2.html">second item</a></li>
...         <li class="item-inactive"><a href="link3.html">third item</a></li>
...         <li class="item-1"><a href="link4.html">fourth item</a></li>
...         <li class="item-0"><a href="link5.html">fifth item</a></li>
...     </ul>
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> sel.xpath('//li//@href').extract()
[u'link1.html', u'link2.html', u'link3.html', u'link4.html', u'link5.html']
>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract()
[u'link1.html', u'link2.html', u'link4.html', u'link5.html']
>>>

警告

C库libxslt本身不支持EXSLT 正则表达式，因此lxml的实现使用了Python re模块的钩子。因此，在XPath表达式中使用regexp函数可能会增加很小的性能损失。

设定操作

例如，在提取文本元素之前，这些可以方便地排除文档树的部分。

使用itemscopes和相应的itemprops组提取微数据（取自http://schema.org/Product的样本内容）的示例：

>>> doc = """
... <div itemscope itemtype="http://schema.org/Product">
...   <span itemprop="name">Kenmore White 17" Microwave</span>
...   <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' />
...   <div itemprop="aggregateRating"
...     itemscope itemtype="http://schema.org/AggregateRating">
...    Rated <span itemprop="ratingValue">3.5</span>/5
...    based on <span itemprop="reviewCount">11</span> customer reviews
...   </div>
...
...   <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
...     <span itemprop="price">$55.00</span>
...     <link itemprop="availability" href="http://schema.org/InStock" />In stock
...   </div>
...
...   Product description:
...   <span itemprop="description">0.7 cubic feet countertop microwave.
...   Has six preset cooking categories and convenience features like
...   Add-A-Minute and Child Lock.</span>
...
...   Customer reviews:
...
...   <div itemprop="review" itemscope itemtype="http://schema.org/Review">
...     <span itemprop="name">Not a happy camper</span> -
...     by <span itemprop="author">Ellie</span>,
...     <meta itemprop="datePublished" content="2011-04-01">April 1, 2011
...     <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
...       <meta itemprop="worstRating" content = "1">
...       <span itemprop="ratingValue">1</span>/
...       <span itemprop="bestRating">5</span>stars
...     </div>
...     <span itemprop="description">The lamp burned out and now I have to replace
...     it. </span>
...   </div>
...
...   <div itemprop="review" itemscope itemtype="http://schema.org/Review">
...     <span itemprop="name">Value purchase</span> -
...     by <span itemprop="author">Lucas</span>,
...     <meta itemprop="datePublished" content="2011-03-25">March 25, 2011
...     <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
...       <meta itemprop="worstRating" content = "1"/>
...       <span itemprop="ratingValue">4</span>/
...       <span itemprop="bestRating">5</span>stars
...     </div>
...     <span itemprop="description">Great microwave for the price. It is small and
...     fits in my apartment.</span>
...   </div>
...   ...
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> for scope in sel.xpath('//div[@itemscope]'):
...     print "current scope:", scope.xpath('@itemtype').extract()
...     props = scope.xpath('''
...                 set:difference(./descendant::*/@itemprop,
...                                .//*[@itemscope]/*/@itemprop)''')
...     print "    properties:", props.extract()
...     print

current scope: [u'http://schema.org/Product']
    properties: [u'name', u'aggregateRating', u'offers', u'description', u'review', u'review']

current scope: [u'http://schema.org/AggregateRating']
    properties: [u'ratingValue', u'reviewCount']

current scope: [u'http://schema.org/Offer']
    properties: [u'price', u'availability']

current scope: [u'http://schema.org/Review']
    properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description']

current scope: [u'http://schema.org/Rating']
    properties: [u'worstRating', u'ratingValue', u'bestRating']

current scope: [u'http://schema.org/Review']
    properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description']

current scope: [u'http://schema.org/Rating']
    properties: [u'worstRating', u'ratingValue', u'bestRating']

>>>

在这里，我们首先迭代itemscope元素，并且对于每个元素，我们寻找所有itemprops元素并排除那些本身在另一个元素中的元素itemscope。

一些XPath技巧

根据ScrapingHub博客上的这篇文章，在使用带有Scrapy 选择器的XPath时，您可能会发现一些有用的提示。如果您还不熟悉XPath，可能需要先看看这个XPath教程。

在条件中使用文本节点

当您需要使用文本内容作为XPath字符串函数的参数时，请避免使用.//text()和使用.。

这是因为表达式.//text()产生了一组文本元素 – 一个节点集。并且当节点集被转换成字符串，当它是作为参数传递给像字符串功能这恰好contains()或starts-with()，它导致文仅用于第一个元素。

例：

>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')

将节点集转换为字符串：

>>> sel.xpath('//a//text()').extract() # take a peek at the node-set
[u'Click here to go to the ', u'Next Page']
>>> sel.xpath("string(//a[1]//text())").extract() # convert it to string
[u'Click here to go to the ']

一个节点转换为字符串，但是，拼文本的本身及其所有的后代：

>>> sel.xpath("//a[1]").extract() # select the first node
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").extract() # convert it to string
[u'Click here to go to the Next Page']

因此，.//text()在这种情况下，使用节点集不会选择任何内容：

>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").extract()
[]

但使用.意味着节点，工作：

>>> sel.xpath("//a[contains(., 'Next Page')]").extract()
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']

注意// node [1]和（// node）之间的区别[1]

//node[1] 选择在各自父母下首先出现的所有节点。

(//node)[1] 选择文档中的所有节点，然后只获取其中的第一个节点。

例：

>>> from scrapy import Selector
>>> sel = Selector(text="""
....:     <ul class="list">
....:         <li>1</li>
....:         <li>2</li>
....:         <li>3</li>
....:     </ul>
....:     <ul class="list">
....:         <li>4</li>
....:         <li>5</li>
....:         <li>6</li>
....:     </ul>""")
>>> xp = lambda x: sel.xpath(x).extract()

这将获得所有第一个<li> 元素，无论它是它的父元素：

>>> xp("//li[1]")
[u'<li>1</li>', u'<li>4</li>']

这将获得<li> 整个文档中的第一个元素：

>>> xp("(//li)[1]")
[u'<li>1</li>']

这将获取父项<li> 下的所有第一个元素<ul>：

>>> xp("//ul/li[1]")
[u'<li>1</li>', u'<li>4</li>']

这将获得整个文档中父级<li> 下的第一个元素<ul>：

>>> xp("(//ul/li)[1]")
[u'<li>1</li>']

在按类查询时，请考虑使用

因为一个元素可以包含多个CSS类，所以按类选择元素的XPath方式相当冗长：

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

如果您使用@class='someclass'，最终可能会丢失具有其他类的元素，如果您只是用来弥补它，您可能会得到更多您想要的元素，如果它们具有共享字符串的不同类名。contains(@class, 'someclass')someclass

事实证明，Scrapy 选择器允许您链接选择器，因此大多数时候您可以使用CSS按类选择，然后在需要时切换到XPath：

>>> from scrapy import Selector
>>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>')
>>> sel.css('.shout').xpath('./time/@datetime').extract()
[u'2014-07-23 19:00']

这比使用上面显示的详细XPath技巧更清晰。只需记住.在后面的XPath表达式中使用。

爬虫蜘蛛Scrapy内置蜘蛛中间件SPIDER_MIDDLEWARES的详细介绍(61)python… 2020年9月25日
爬虫蜘蛛Scrapy如何设置API？(64)python Scrapy教程1.51以上版本 2020年9月26日
爬虫蜘蛛采集请求和回应Request和Response之请求对象scrapy.Request(33)py… 2020年9月10日
运行爬虫蜘蛛crawl参数(6)python SCRAPY最新教程1.51以上版本 2020年8月28日
通用蜘蛛爬虫抓取采集数据scrapy.Spider(14)python SCRAPY最新教程1.51以上版本 2020年8月31日
爬虫蜘蛛Scrapy编写自己的蜘蛛中间件SPIDER_MIDDLEWARES(60)python… 2020年9月24日
爬虫蜘蛛Scrapy核心Crawler API详细介绍(63)python Scrapy教程1.51以上版本 2020年9月26日
爬虫蜘蛛Scrapy shell之运行使用shell详解 (26)python SCRAPY最新教程1.51以上版本 2020年9月6日
爬虫蜘蛛合同contracts(44)python Scrapy教程1.51以上版本 2020年9月16日
运行Scrapy爬虫蜘蛛的方法大全(45)python Scrapy教程1.51以上版本 2020年9月17日
爬虫蜘蛛Scrapy加载和激活扩展Extensions详细介绍(62)python Scrapy教程1.51以上版本 2020年9月25日