[筆記] 超新手也能用 Python 爬蟲打造貨比千家的比價網站 – 儲存到 Google Spreadsheet @地瓜大的飛翔旅程

章節連結

課程相關資訊
重點整理
程式碼
系列文章

斷斷續續地在網路和書籍上學習有關 Python 的爬蟲技術，但隨著反爬蟲的技術也是越來越精進，所以想說藉由「超新手也能用 Python 爬蟲打造貨比千家的比價網站」這門課來看看能否解決相關的疑惑。這篇筆記下如何將爬取下來的資料，存檔到 Google Spreadsheet 中。

本篇範圍：Chapter 4 ( 資料很髒很亂怎麼辦？資料清理與資料整併 )

請注意：本系列文章為個人對應課程的消化吸收後，所整理出來的內容。換言之，並不一定會包含全部的課程內容，也有可能會添加其他資源來說明。

重點整理

1. 你需要一些事前準備，也就是前往 Google API Developer Console 申請相關 API
2. 將金鑰放在和程式相同的目錄下，內含 email
3. 建立一個新的 Google Spreadsheet，並在「分享權限」中新增 email
4. 安裝 pygsheet 套件，用 import pygsheet 引入使用
5. 將方才新增的 Google Spreadsheet 取名為 sheet-sample, 其中一頁的試算表名稱為 sample
6. 該 Spreadsheet 的 id 為網址的 /d/……/edit 之間的 “…” 字串

程式碼

import sys
import codecs
sys.stdout = codecs.getwriter('utf-8')(sys.stdout.detach())
 
import requests
from bs4 import BeautifulSoup
 
headers = { 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36' }
 
totalPage = 6
currentPage = 1
links = []
products = []
 
for currentPage in range(1, totalPage+1):
    url = f'https://m.momoshop.com.tw/search.momo?_advFirst=N&_advCp=N&curPage={currentPage}&searchType=&cateLevel=-1&ent=k&searchKeyword=iphone13&_advThreeHours=N&_isFuzzy=0&_imgSH=fourCardType'
    rawRes = requests.get(url,headers = headers)
    resText = rawRes.text
    soup = BeautifulSoup(resText,'html.parser')
    
    for item in soup.findAll('li', class_='goodsItemLi'):
        link = 'https://www.momoshop.com.tw/goods/GoodsDetail.jsp?'+item.a['href'].split('?',1)[1]
        productRawRes = requests.get(link,headers = headers)
        productRawText = productRawRes.text
        productSoup = BeautifulSoup(productRawText,'html.parser')
        
        product = {}
        product['name'] = productSoup.findAll('p',class_="fprdTitle")[0].text
        product['price'] = productSoup.select('.special span')[0].text
 
        products.append(product)
        links.append(link)
 
from datetime import date
import pandas as pd
df1 = pd.DataFrame(products)
df1['Source'] = 'Momo'
df1['created_time'] = date.today()
print(df1)
 
# df1.to_csv('./result/sample.csv', encoding='utf_8_sig')
# df1.to_excel('./result/sample.xls')

# save data to googlesheet
import pygsheets

KEY_FILE_LOCATION = 'ip-address-and-s-1559338183206-42b9f0f4c7a9.json'
SHEET_ID = '1ehFf6ziXHdxZ7czjgfbnvd_WOxsPtX6BoAZqTVuyPGc'

gc = pygsheets.authorize(service_file=KEY_FILE_LOCATION)
sht = gc.open_by_key(SHEET_ID)
wks = sht.worksheet_by_title('sample')
wks.clear()
wks.set_dataframe(df1, 'A1', copy_head=True)

import sys

import codecs

sys.stdout = codecs.getwriter('utf-8')(sys.stdout.detach())

import requests

from bs4 import BeautifulSoup

headers = { 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36' }

totalPage = 6

currentPage = 1

links = []

products = []

for currentPage in range(1, totalPage+1):

url = f'https://m.momoshop.com.tw/search.momo?_advFirst=N&_advCp=N&curPage={currentPage}&searchType=&cateLevel=-1&ent=k&searchKeyword=iphone13&_advThreeHours=N&_isFuzzy=0&_imgSH=fourCardType'

rawRes = requests.get(url,headers = headers)

resText = rawRes.text

soup = BeautifulSoup(resText,'html.parser')