斷斷續續地在網路和書籍上學習有關 Python 的爬蟲技術,但隨著反爬蟲的技術也是越來越精進,所以想說藉由 「超新手也能用 Python 爬蟲打造貨比千家的比價網站」這門課來看看能否解決相關的疑惑。這篇筆記用 Selenium 來爬取電商網站。
課程相關資訊
[連結]:https://hiskio.com/courses/527/lectures/27147
本篇範圍:Chapter 3 ( 由前端 JavaScript 產生的資料,動態網站爬蟲實現 )
請注意:本系列文章為個人對應課程的消化吸收後,所整理出來的內容。換言之,並不一定會包含全部的課程內容,也有可能會添加其他資源來說明。
筆記
1. 可以使用 WebDriverWait 來讓自動化爬蟲得以加速
2. soup.select 使用 CSS 方法定位比起 find_all,可以減少一些解析上所導致的錯誤
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
# Prerequisites # 1. download the latest version of python3 # 2. use the following scripts to create a virtual env # python -m venv <folder_name> # 3. Enter the virtual env # source <folder_name>/Scripts/activate import selenium # print selenium version print(selenium.__version__) # load web-driver from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument('user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36"') browser = webdriver.Chrome(executable_path='./chromedriver',chrome_options=options) # get raw data from browser import time browser.get("https://shopee.tw/mall/search?keyword=iphone%2013") ### scroll automatically for y in range(0, 10000, 500): browser.execute_script(f"window.scrollTo(0, {y})") time.sleep(0.5) sourceRaw = browser.page_source # handle the data from string to html structure import requests from bs4 import BeautifulSoup soup = BeautifulSoup(sourceRaw, "html.parser") products = [] from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait for item in soup.select('.shopee-search-item-result__item a'): link = f"https://shopee.tw{item['href']}" browser.get(link) WebDriverWait(browser, 8).until( EC.visibility_of_element_located( (By.CLASS_NAME, 'attM6y') ) ) soup = BeautifulSoup(browser.page_source, "html.parser") product = {} product['url'] = link product['name'] = soup.select('.product-briefing .attM6y span')[0].text product['price'] = soup.select('.product-briefing .Ybrg9j')[0].text products.append(product) print('all products on the page 1:', products) # quit browser browser.quit() |