豆瓣爬虫实践-python版
豆瓣登录,无验证码版:
import requests #starturl = "https://www.douban.com/accounts/login" loginurl = "https://accounts.douban.com/login" headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', } fromdata ={'source':'None', 'redir':'https://shanghai.douban.com/', 'form_email':'yourAccount', 'form_password':'password', 'login':'登录'} s = requests.Session() s.headers.update(headers) resp = s.post(loginurl,fromdata) with open('douban.html','wb') as f: f.write(resp.text.encode('utf-8')) print(resp.status_code) print(resp.cookies) s.close()
豆瓣TOP250电影爬虫
import requests from bs4 import BeautifulSoup def getContent(bsItem): content=[] content.append(item.find('a')['href']) film=item.find_all('span',{'class':'title'}) film[0]=film[0].string if len(film) > 1: film[1]=film[1].string.replace(u'\xa0','').replace(r'/','') else: film.append('无外语名') content.append(film) content.append(item.find('span',{'class':'rating_num'}).string) content.append(item.find('span',{'class':'','property':''}).string) return content starturl = 'https://movie.douban.com/top250' headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', } params={'start':0} s = requests.Session() s.headers.update(headers) curpage = 0 with open('doubanfilm.txt','w',encoding='utf-8') as f: while(curpage<250): params['start'] = curpage resp = s.get(starturl,params=params) bs = BeautifulSoup(resp.text,'html.parser') for item in bs.find_all('div',{"class":'info'}): f.write(str(getContent(item))+'\n') curpage += 25 print('bug end') s.close()
python,生活因你而精彩!
原文地址:https://www.jianshu.com/p/b301757e799c
相关推荐
-
python爬虫学习之爬取全国各省市县级城市邮政编码 网络爬虫
2019-10-8
-
使用python scrapy爬取网页中带有地图展示的数据 网络爬虫
2019-5-3
-
python简单爬去油价信息发送到公众号 网络爬虫
2019-9-14
-
Python爬虫(10):Selenium+PhantomJS基本操作 网络爬虫
2018-3-13
-
python爬虫神器PyQuery的使用方法 网络爬虫
2018-3-11
-
Python爬虫番外篇之关于登录 网络爬虫
2019-8-19
-
Beautiful Soup 采坑之旅 网络爬虫
2019-8-26
-
微信公众号文章评论、阅读、点赞的数据采集 网络爬虫
2019-8-26
-
一个简单有趣的微信聊天机器人 网络爬虫
2019-4-27
-
爬虫学习之一个简单的网络爬虫 网络爬虫
2018-2-26