在Scrapy中运用Selenium和Chrome

网络爬虫

浏览数:216

2019-8-26

本篇结合Scrapy、Selenium与Headless Chrome来爬取需要js渲染的页面,本节以爬取京东搜索手机的页面为例。

页面分析

image.png

可以看到对于手机这个选项,总共有100页的结果,从动态图页可以看到,每次页面加载并不是一次性加载完的,而是当鼠标滚轮向下滚动到一定距离的时候,才会出现新的搜索结果,这种是通过js渲染的方式来实现的。
我们可以通过Selenium的execute_script("window.scrollTo(0, document.body.scrollHeight);")来模拟向下滑动到最底的操作。

在看页面,从图中我们可以看出,当下一页跳转到第2页的时候,url中的page值为3,在点击下一页跳转到第3页是,url中的page为5,由此可以推断出,page的变化与对应的展示页面对应关系为,real_page = 2*(page-1),由此,我们可以得到所有页面的url地址。

实现

只展示关键源码,其他settings.py等文件不做展示,具体可见我的Github

# search.py
# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
import time
class SearchSpider(scrapy.Spider):
    name = 'search'
    search_page_url_pattern = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page={page}&enc=utf-8"
    start_urls = ['https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8']

    def __init__(self):
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('--no-sandbox')
        self.browser = webdriver.Chrome(chrome_options=chrome_options, executable_path='/usr/local/bin/chromedriver')
        super(SearchSpider, self).__init__()
    def closed(self,reason):
        self.browser.close()        # 记得关闭

    def parse(self, response):
        total_page = response.css('span.p-skip em b::text').extract_first()
        if total_page:
            for i in range(int(total_page)):
                next_page_url = self.search_page_url_pattern.format(page=2*i + 1)
                yield scrapy.Request(next_page_url, callback = self.parse_page)
                time.sleep(1)

    def parse_page(self, response):
        phone_info_list = response.css('div.p-name a')
        for item in book_info_list:
            phone_name = item.css('a::attr(title)').extract_first()
            phone_href = item.css('a::attr(href)').extract_first()

            yield dict(name=phone_name, href=phone_href)

这里在spider中定义了webdriver,这样的话就可以避免每次都重新打开一个新的浏览器。
closed()中要关闭浏览器。
parse()我们先获取到页面的总页数,然后在开始根据规则生成url,继续爬取。
parse_page()中我们根据页面规则爬取要获取的信息,不再赘述。

#middlewares.py
from scrapy import signals
from scrapy.http import HtmlResponse

class JdDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        spider.browser.get(request.url)
        for i in range(5):
            spider.browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        return HtmlResponse(url = spider.browser.current_url, body = spider.browser.page_source, encoding = 'utf8', request = request)

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

这边我们利用DownloadMiddleware的特性,在process_request()中使用webdriver来模拟滚动获取整个页面的源码,在直接返回一个Response对象,根据规则,当返回Response对象,之后的DownloadMiddle将不会再运行,而是直接返回。

运行scrapy crawl search -o result.csv --nolog即可获得爬取结果。

总结

本篇讲解了selenium与headless chrome和scrapy的联合使用,看怎么爬取动态页面的信息,通过此方法,再也不用怕需要动态渲染的页面无法爬取了。
自此,解决了动态爬取动态页面的问题之后,就要解决爬取规模的问题,接下来将会学习如何使用scrapy-redis来进行分布式爬取。

作者:喵帕斯0_0