使用jsonpath解析json

网络爬虫

浏览数:167

2019-11-1

AD:资源代下载服务

使用jsonpath,可以大大减少开发量。

为了能像写XPath一样写json路径,Stefan Goessner开发了jsonpath(https://goessner.net/articles/JsonPath/)。

jsonpath也有Python实现(https://github.com/kennknowles/python-jsonpath-rw)。

安装方式:pip install jsonpath-rw

简单使用:

from jsonpath_rw import jsonpath, parse

jsonpath_expr = parse('foo[*].baz')
print(jsonpath_expr)

print([match.value for match in jsonpath_expr.find({'foo': [{'baz': 1}, {'baz': 2}]})])

print([str(match.full_path) for match in jsonpath_expr.find({'foo': [{'baz': 1}, {'baz': 2}]})])

jsonpath.auto_id_field = 'id'
print([match.value for match in parse('foo[*].id').find({'foo': [{'id': 'bizzle'}, {'baz': 3}]})])

print([match.value for match in parse('a.*.b.`parent`.c').find({'a': {'x': {'b': 1, 'c': 'number one'}, 'y': {'b': 2, 'c': 'number two'}}})])

另一个例子,获取所有author字段:

dict = { "store": {
    "book": [
      { "category": "reference",
        "author": "Nigel Rees",
        "title": "Sayings of the Century",
        "price": 8.95
      },
      { "category": "fiction",
        "author": "Evelyn Waugh",
        "title": "Sword of Honour",
        "price": 12.99
      },
      { "category": "fiction",
        "author": "Herman Melville",
        "title": "Moby Dick",
        "isbn": "0-553-21311-3",
        "price": 8.99
      },
      { "category": "fiction",
        "author": "J. R. R. Tolkien",
        "title": "The Lord of the Rings",
        "isbn": "0-395-19395-8",
        "price": 22.99
      }
    ],
    "bicycle": {
      "color": "red",
      "price": 19.95
    }
  }
}

from jsonpath_rw import parse

jsonpath_expr = parse('$..author')

res = jsonpath_expr.find(dict)

print([match.value for match in res])

对36氪的快讯接口解析,只要知道最终要取的字段名就好,不用写完整的字典取值,可以省不少事:

import requests
import json
from jsonpath_rw import parse


header = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 '
                      '(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }

url = 'https://36kr.com/api/newsflash?&per_page=20'
response = requests.get(url,
                        headers=header,
                        timeout=5
                        )

dict = json.loads(response.text)

jsonpath_expr = parse('$..title, description, published_at')

res = jsonpath_expr.find(dict)

print([match.value for match in res])

jsonpath的详细语法可参考:
https://github.com/kennknowles/python-jsonpath-rw

作者:SeanCheney