[筆記] Python 爬蟲 BeautifulSoup 的進階運用 @地瓜大的飛翔旅程

章節連結

課程名稱
課程相關文章
指令

本文為 Hahow 上的 Python 網頁爬蟲入門實戰的書籍版(Python：網路爬蟲與資料分析入門實戰)課程心得，其對應的章節為 Chapter 2。

課程名稱

Python 網頁爬蟲入門實戰：https://bit.ly/2U6wElg
對於爬蟲初學者而言，算是滿不錯的搭配教材。如有需要，你可以搭配「Python：網路爬蟲與資料分析入門實戰」這本書來看。

課程相關文章

[筆記] Python 爬蟲實戰 – PPT 表特版和圖片下載

[筆記] Python 爬蟲 PTT 八卦版

[筆記] Python 爬蟲初探 BeautifulSoup

指令

import requests
from bs4 import BeautifulSoup

res = requests.get('http://blog.castman.net/py-scraping-analysis-book/ch2/blog/blog.html')
soup = BeautifulSoup(res.text,'html5lib')
# 印出網頁卡片物件內的文字
cards = soup.select('.col-md-4')
for card in cards:
    # 運用 strip()來解決空白問題
    print (card.h6.text.strip(),card.h4.a.text.strip(),card.p.text.strip())
    # 運用 .stripped_strings 將子標籤取成物件，所以要再跑一次迴圈
    print ([a for a in card.stripped_strings])

import requests

from bs4 import BeautifulSoup

res = requests.get('http://blog.castman.net/py-scraping-analysis-book/ch2/blog/blog.html')

soup = BeautifulSoup(res.text,'html5lib')

# 印出網頁卡片物件內的文字

cards = soup.select('.col-md-4')

for card in cards:

# 運用 strip()來解決空白問題

print (card.h6.text.strip(),card.h4.a.text.strip(),card.p.text.strip())

# 運用 .stripped_strings 將子標籤取成物件，所以要再跑一次迴圈

print ([a for a in card.stripped_strings])

import requests
from bs4 import BeautifulSoup

# 求課程的平均價格
res = requests.get('http://blog.castman.net/py-scraping-analysis-book/ch2/table/table.html')
soup = BeautifulSoup(res.text,'html5lib')
prices = []
rows = soup.select('.table tbody tr')
for row in rows:
    price = row.find_all('td')[2].text
    #將 price 放到 prices 的 list 裡
    #int()，將字串轉為數字
    prices.append(int(price))

averagePrice = sum(prices)/len(prices)
# 用 ’%.2f % f' 方法來四捨五入
print('所有的課程平均價格為：','%.3f' % averagePrice)

import requests

from bs4 import BeautifulSoup

# 求課程的平均價格

res = requests.get('http://blog.castman.net/py-scraping-analysis-book/ch2/table/table.html')

soup = BeautifulSoup(res.text,'html5lib')

prices = []

rows = soup.select('.table tbody tr')

for row in rows:

price = row.find_all('td')[2].text

#將 price 放到 prices 的 list 裡

#int()，將字串轉為數字

prices.append(int(price))

averagePrice = sum(prices)/len(prices)

# 用 ’%.2f % f' 方法來四捨五入

print('所有的課程平均價格為：','%.3f' % averagePrice)

import requests
from bs4 import BeautifulSoup
res = requests.get('http://blog.castman.net/py-scraping-analysis-book/ch2/blog/blog.html')
soup = BeautifulSoup(res.text,'html5lib')

# 使用 regex 找出所有 'h' 開頭的標題文字 (也就是 h1~h6)
import re
#for title in soup.find_all(re.compile('h[1-6]')):
#    print(title.text.strip())
    
# 使用 regex 找出所有的 .png 圖片
# 正規表示法，可以使用 Regex Online 來幫忙
imgs = soup.find_all('img',{'src':re.compile('\.png$')})
for link in imgs:
    print(link)
    print(link['class'])
    print(link['src'])