Python抓取Google Trends(谷歌指数)

网络爬虫

浏览数:354

2019-11-1

Pyppeteer暴力抓取Google trends:

import re
import time
import asyncio
from lxml import etree
from pyppeteer import launch

async def main():
    # headless参数设为False,则变成有头模式
    browser = await launch(
        # headless=False
    )
    page = await browser.newPage()
    await page.setViewport(viewport={'width':1280, 'height':800})
    await page.setJavaScriptEnabled(enabled=True)
    await page.goto('https://trends.google.com/trends/?geo=US')
    await page.type(selector='input#input-254', text='bitcoin')
    await asyncio.sleep(1) # 等待网页加载出来,懒得用条件判断了
    await page.keyboard.press('Enter')
    await asyncio.sleep(2)
    # print(await page.title())
    await page.goto('https://trends.google.com/trends/explore?date=now%207-d&q=bitcoin')
    await asyncio.sleep(2)
    content_text = await page.content()
    # print(content_text)
    res = re.findall(r'<table>.*</table>?', content_text, flags=0)[0]
    # print(res)
    tree = etree.HTML(res)
    values = tree.xpath('//table/tbody/tr')
    for item in values:
        timeformat = item.xpath('./td[1]/text()')[0].replace('\u202a','').replace('\u202c','')
        # print(timeformat)
        timeArray = time.strptime(str(time.localtime().tm_year) + ' ' + timeformat, "%Y %b %d at %H:%M %p")
        timestamp = int(time.mktime(timeArray))
        print(timestamp) # 时间戳
        score = item.xpath('./td[2]/text()')[0]
        print(score) # 分数

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

获得的是每相隔一小时的结果:

1551060000
68
1551063600
67
1551067200
66
1551027600
73
1551031200
72
1551034800
72
1551038400
68

GitHub上的pytrends项目(https://github.com/GeneralMills/pytrends)也可以用来抓取,但是获取分数的请求url年久失修,不能获取到数据,其它比如获取相关词是好的。

作者:SeanCheney