熟女一匹,二区,三区,亚州在线无码一二三区,一区二区三区视频

豆瓣電影Top250信息爬取

通過(guò)本案例[豆瓣電影Top250信息爬取]鍛煉除正則表達(dá)式之外兩種信息解析方式：Xpath和PyQuery。

爬取url地址：https://movie.douban.com/top250

分析：

分析url地址：每頁(yè)25條數(shù)據(jù)，共計(jì)10頁(yè)

第1頁(yè)：https://movie.douban.com/top250?start=0
第2頁(yè)：https://movie.douban.com/top250?start=25
第3頁(yè)：https://movie.douban.com/top250?start=50
...
結(jié)果：
for i in range(10):
    url = "https://movie.douban.com/top250?start="+str(i*25)

分析網(wǎng)頁(yè)源代碼內(nèi)容：每部電影信息都是放在

...

中

具體實(shí)現(xiàn)代碼如下：

from requests.exceptions import RequestException
from lxml import etree
from pyquery import PyQuery as pq
import requests
import re,time,json

def getPage(url):
    '''爬取指定url頁(yè)面信息'''
    try:
        #定義請(qǐng)求頭信息
        headers = {
            'User-Agent':'User-Agent:Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1'
        }
        # 執(zhí)行爬取
        res = requests.get(url,headers=headers)
        #判斷響應(yīng)狀態(tài),并響應(yīng)爬取內(nèi)容
        if res.status_code == 200:
            return res.text
        else:
            return None
    except RequestException:
        return None

def parsePage(content):
    '''解析爬取網(wǎng)頁(yè)中的內(nèi)容，并返回字段結(jié)果'''
    print(content)
    # =========使用pyquery解析==================
    # 解析HTML文檔
    doc = pq(content)
    #獲取網(wǎng)頁(yè)中所有標(biāo)簽并遍歷輸出標(biāo)簽名
    items = doc("div.item")
    #遍歷封裝數(shù)據(jù)并返回
    for item in items.items():
        yield {
            'index':item.find("div.pic em").text(),
            'image':item.find("div.pic img").attr('src'),
            'title':item.find("div.hd span.title").text(),
            'actor':item.find("div.bd p:eq(0)").text(),
            'score':item.find("div.bd div.star span.rating_num").text(),
        }    

    '''
    # =======使用xpath解析====================
    # 解析HTML文檔，返回根節(jié)點(diǎn)對(duì)象
    html = etree.HTML(content)
    #獲取網(wǎng)頁(yè)中所有標(biāo)簽并遍歷輸出標(biāo)簽名
    items = html.xpath('//div[@class="item"]')
    #遍歷封裝數(shù)據(jù)并返回
    for item in items:
        yield {
            'index':item.xpath('.//div/em[@class=""]/text()')[0],
            'image':item.xpath('.//img[@width="100"]/@src')[0],
            'title':item.xpath('.//span[@class="title"]/text()')[0],
            'actor':item.xpath('.//p[@class=""]/text()')[0],
            'score':item.xpath('.//span[@class="rating_num"]/text()'),
            #'time':item[4].strip()[5:],
        }
    '''

def writeFile(content):
    '''執(zhí)行文件追加寫(xiě)操作'''
    with open("./result.txt",'a',encoding='utf-8') as f:
        f.write(json.dumps(content,ensure_ascii=False) + "\n")
        #json.dumps 序列化時(shí)對(duì)中文默認(rèn)使用的ascii編碼.想輸出真正的中文需要指定ensure_ascii=False

def main(offset):
    ''' 主程序函數(shù)，負(fù)責(zé)調(diào)度執(zhí)行爬蟲(chóng)處理 '''
    url = 'https://movie.douban.com/top250?start=' + str(offset)
    html = getPage(url)
    # 判斷是否爬取到數(shù)據(jù)，并調(diào)用解析函數(shù)
    if html:
        for item in parsePage(html):
            writeFile(item)

# 判斷當(dāng)前執(zhí)行是否為主程序運(yùn)行，并遍歷調(diào)用主函數(shù)爬取數(shù)據(jù)
if __name__ == '__main__':
    for i in range(10):
        main(offset=i*25)
        time.sleep(1)

審核編輯：符乾江

聲明：本文內(nèi)容及配圖由入駐作者撰寫(xiě)或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點(diǎn)僅代表作者本人，不代表電子發(fā)燒友網(wǎng)立場(chǎng)。文章及其配圖僅供工程師學(xué)習(xí)之用，如有內(nèi)容侵權(quán)或者其他違規(guī)問(wèn)題，請(qǐng)聯(lián)系本站處理。舉報(bào)投訴

python

python

+關(guān)注

關(guān)注
56

文章
4827

瀏覽量
86759
爬蟲(chóng)

爬蟲(chóng)

+關(guān)注

關(guān)注
0

文章
83

瀏覽量
7506

一区二区三区三上|欧美在线视频五区|国产午夜无码在线观看视频|亚洲国产裸体网站|无码成年人影视|亚洲AV亚洲AV|成人开心激情五月|欧美性爱内射视频|超碰人人干人人上|一区二区无码三区亚洲人区久久精品

搜索歷史

豆瓣電影Top250信息爬取

分析：

具體實(shí)現(xiàn)代碼如下：

評(píng)論