[筆記] 超新手也能用 Python 爬蟲打造貨比千家的比價網站 – 爬取動態的電商網站 ( 蝦皮購物 ) @地瓜大的飛翔旅程

章節連結

課程相關資訊
筆記
系列文章

斷斷續續地在網路和書籍上學習有關 Python 的爬蟲技術，但隨著反爬蟲的技術也是越來越精進，所以想說藉由「超新手也能用 Python 爬蟲打造貨比千家的比價網站」這門課來看看能否解決相關的疑惑。這篇筆記用 Selenium 來爬取電商網站。

本篇範圍：Chapter 3 ( 由前端 JavaScript 產生的資料，動態網站爬蟲實現 )

請注意：本系列文章為個人對應課程的消化吸收後，所整理出來的內容。換言之，並不一定會包含全部的課程內容，也有可能會添加其他資源來說明。

筆記

1. 可以使用 WebDriverWait 來讓自動化爬蟲得以加速
2. soup.select 使用 CSS 方法定位比起 find_all，可以減少一些解析上所導致的錯誤

# Prerequisites
# 1. download the latest version of python3
# 2. use the following scripts to create a virtual env
# python -m venv <folder_name>
# 3. Enter the virtual env
# source <folder_name>/Scripts/activate

import selenium

# print selenium version
print(selenium.__version__)

# load web-driver
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36"')
browser = webdriver.Chrome(executable_path='./chromedriver',chrome_options=options)

# get raw data from browser
import time
browser.get("https://shopee.tw/mall/search?keyword=iphone%2013")

### scroll automatically
for y in range(0, 10000, 500):
    browser.execute_script(f"window.scrollTo(0, {y})")
    time.sleep(0.5)

sourceRaw = browser.page_source

# handle the data from string to html structure
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(sourceRaw, "html.parser")
products = []

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
 
for item in soup.select('.shopee-search-item-result__item a'):
    link = f"https://shopee.tw{item['href']}"
    browser.get(link)
    
    WebDriverWait(browser, 8).until(
        EC.visibility_of_element_located(
            (By.CLASS_NAME, 'attM6y')
        )
    )

    soup = BeautifulSoup(browser.page_source, "html.parser")

    product = {}
    product['url'] = link
    product['name'] = soup.select('.product-briefing .attM6y span')[0].text
    product['price'] = soup.select('.product-briefing .Ybrg9j')[0].text

    products.append(product)

print('all products on the page 1:', products)

# quit browser
browser.quit()

# Prerequisites

# 1. download the latest version of python3

# 2. use the following scripts to create a virtual env

# python -m venv <folder_name>

# 3. Enter the virtual env

# source <folder_name>/Scripts/activate

import selenium

# print selenium version

print(selenium.__version__)

# load web-driver

from selenium import webdriver

options = webdriver.ChromeOptions()

options.add_argument('user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36"')

browser = webdriver.Chrome(executable_path='./chromedriver',chrome_options=options)

# get raw data from browser

import time

browser.get("https://shopee.tw/mall/search?keyword=iphone%2013")

### scroll automatically

for y in range(0, 10000, 500):

browser.execute_script(f"window.scrollTo(0, {y})")

time.sleep(0.5)

sourceRaw = browser.page_source

# handle the data from string to html structure

import requests

from bs4 import BeautifulSoup