第14章 爬蟲與自動化¶
網絡爬蟲是現代數據獲取和自動化處理的重要技術手段,通過模擬瀏覽器行爲自動訪問網頁並提取所需信息。本章將從基礎概念開始,逐步深入到高級爬蟲框架和自動化技術,幫助讀者掌握完整的爬蟲開發技能。
14.1 網絡爬蟲基礎¶
爬蟲概述¶
網絡爬蟲的定義和用途¶
網絡爬蟲(Web Crawler),也稱爲網頁蜘蛛(Web Spider)或網絡機器人(Web Robot),是一種按照一定規則自動瀏覽萬維網並獲取信息的程序。爬蟲的主要用途包括:
- 數據採集:從網站獲取商品信息、新聞資訊、股票價格等
- 搜索引擎:爲搜索引擎建立索引數據庫
- 市場分析:收集競爭對手信息,進行市場調研
- 內容監控:監控網站內容變化,及時獲取更新
- 學術研究:收集研究數據,進行數據分析
爬蟲的工作原理¶
網絡爬蟲的基本工作流程如下:
- 發送HTTP請求:向目標網站發送請求
- 接收響應數據:獲取服務器返回的HTML頁面
- 解析頁面內容:提取所需的數據信息
- 存儲數據:將提取的數據保存到文件或數據庫
- 發現新鏈接:從當前頁面中發現新的URL
- 重複過程:對新發現的URL重複上述過程
讓我們通過一個簡單的示例來理解爬蟲的基本原理:
import requests
from bs4 import BeautifulSoup
import time
def simple_crawler(url):
"""
簡單的網頁爬蟲示例
"""
try:
# 1. 發送HTTP請求
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
# 2. 檢查響應狀態
if response.status_code == 200:
# 3. 解析頁面內容
soup = BeautifulSoup(response.text, 'html.parser')
# 4. 提取標題
title = soup.find('title')
if title:
print(f"頁面標題: {title.get_text().strip()}")
# 5. 提取所有鏈接
links = soup.find_all('a', href=True)
print(f"找到 {len(links)} 個鏈接:")
for i, link in enumerate(links[:5]): # 只顯示前5個鏈接
href = "https://yeyupiaoling.cn" + link['href']
text = link.get_text().strip()
print(f"{i+1}. {text} -> {href}")
else:
print(f"請求失敗,狀態碼: {response.status_code}")
except Exception as e:
print(f"爬取過程中出現錯誤: {e}")
# 使用示例
if __name__ == "__main__":
url = "https://yeyupiaoling.cn"
simple_crawler(url)
運行上述代碼,輸出類似如下:
頁面標題: 夜雨飄零的博客 - 首頁
找到 50 個鏈接:
1. -> https://yeyupiaoling.cn/
2. 夜雨飄零 -> https://yeyupiaoling.cn/
3. 首頁 -> https://yeyupiaoling.cn/
4. 歸檔 -> https://yeyupiaoling.cn/archive
5. 標籤 -> https://yeyupiaoling.cn/tag
爬蟲的分類和特點¶
根據不同的分類標準,爬蟲可以分爲以下幾類:
按照爬取範圍分類:
- 通用爬蟲:搜索引擎使用的爬蟲,爬取整個互聯網
- 聚焦爬蟲:針對特定主題或網站的爬蟲
- 增量爬蟲:只爬取新增或更新的內容
按照技術實現分類:
- 靜態爬蟲:只能處理靜態HTML頁面
- 動態爬蟲:能夠處理JavaScript渲染的動態頁面
按照爬取深度分類:
- 淺層爬蟲:只爬取首頁或少數幾層頁面
- 深層爬蟲:能夠深入爬取網站的多層結構
爬蟲的法律和道德考量¶
在進行網絡爬蟲開發時,必須遵守相關的法律法規和道德準則:
- 遵守robots.txt協議:檢查網站的robots.txt文件
- 控制爬取頻率:避免對服務器造成過大壓力
- 尊重版權:不要爬取受版權保護的內容
- 保護隱私:不要爬取個人隱私信息
- 合理使用數據:僅將爬取的數據用於合法目的
HTTP協議基礎¶
HTTP請求和響應¶
HTTP(HyperText Transfer Protocol)是網絡爬蟲與Web服務器通信的基礎協議。理解HTTP協議對於開發高效的爬蟲至關重要。
HTTP通信包含兩個主要部分:
- 請求(Request):客戶端向服務器發送的消息
- 響應(Response):服務器返回給客戶端的消息
讓我們通過代碼來觀察HTTP請求和響應的詳細信息:
import requests
import json
def analyze_http_communication(url):
"""
分析HTTP請求和響應的詳細信息
"""
# 創建會話對象
session = requests.Session()
# 設置請求頭
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
}
try:
# 發送請求
response = session.get(url, headers=headers)
print("=== HTTP請求信息 ===")
print(f"請求URL: {response.request.url}")
print(f"請求方法: {response.request.method}")
print("請求頭:")
for key, value in response.request.headers.items():
print(f" {key}: {value}")
print("\n=== HTTP響應信息 ===")
print(f"狀態碼: {response.status_code}")
print(f"響應原因: {response.reason}")
print(f"響應時間: {response.elapsed.total_seconds():.3f}秒")
print("響應頭:")
for key, value in response.headers.items():
print(f" {key}: {value}")
print(f"\n響應內容長度: {len(response.text)} 字符")
print(f"響應內容類型: {response.headers.get('Content-Type', 'Unknown')}")
except requests.RequestException as e:
print(f"請求失敗: {e}")
# 使用示例
if __name__ == "__main__":
analyze_http_communication("https://yeyupiaoling.cn/")
運行結果示例:
=== HTTP請求信息 ===
請求URL: https://yeyupiaoling.cn/
請求方法: GET
請求頭:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
Accept-Encoding: gzip, deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Connection: keep-alive
Accept-Language: zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3
=== HTTP響應信息 ===
狀態碼: 200
響應原因: OK
響應時間: 0.197秒
響應頭:
Server: nginx/1.18.0 (Ubuntu)
Date: Sat, 16 Aug 2025 04:36:49 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Cookie
Content-Encoding: gzip
響應內容長度: 29107 字符
響應內容類型: text/html; charset=utf-8
Cookie和Session機制¶
Cookie和Session是Web應用中維持用戶狀態的重要機制:
- Cookie:存儲在客戶端的小型數據文件
- Session:存儲在服務器端的用戶會話信息
在爬蟲開發中,正確處理Cookie和Session對於模擬用戶登錄和維持會話狀態至關重要:
import requests
from http.cookies import SimpleCookie
def demonstrate_cookies_and_sessions():
"""
演示Cookie和Session的使用
"""
# 創建會話對象
session = requests.Session()
print("=== Cookie操作演示 ===")
# 1. 設置Cookie
cookie_url = "https://httpbin.org/cookies/set"
cookie_params = {
'username': 'testuser',
'session_id': 'abc123',
'preferences': 'dark_theme'
}
# 設置Cookie(這會導致重定向)
response = session.get(cookie_url, params=cookie_params)
print(f"設置Cookie後的狀態碼: {response.status_code}")
# 2. 查看當前Cookie
print("\n當前會話中的Cookie:")
for cookie in session.cookies:
print(f" {cookie.name} = {cookie.value}")
# 3. 發送帶Cookie的請求
cookie_test_url = "https://httpbin.org/cookies"
response = session.get(cookie_test_url)
if response.status_code == 200:
cookies_data = response.json()
print(f"\n服務器接收到的Cookie: {cookies_data.get('cookies', {})}")
# 4. 手動設置Cookie
print("\n=== 手動Cookie操作 ===")
manual_session = requests.Session()
# 方法1:通過字典設置
manual_session.cookies.update({
'user_id': '12345',
'auth_token': 'xyz789'
})
# 方法2:通過set方法設置
manual_session.cookies.set('language', 'zh-CN', domain='httpbin.org')
# 測試手動設置的Cookie
response = manual_session.get("https://httpbin.org/cookies")
if response.status_code == 200:
cookies_data = response.json()
print(f"手動設置的Cookie: {cookies_data.get('cookies', {})}")
# 5. Cookie持久化
print("\n=== Cookie持久化 ===")
# 保存Cookie到文件
import pickle
# 保存Cookie
with open('cookies.pkl', 'wb') as f:
pickle.dump(session.cookies, f)
print("Cookie已保存到文件")
# 加載Cookie
new_session = requests.Session()
try:
with open('cookies.pkl', 'rb') as f:
new_session.cookies = pickle.load(f)
print("Cookie已從文件加載")
# 測試加載的Cookie
response = new_session.get("https://httpbin.org/cookies")
if response.status_code == 200:
cookies_data = response.json()
print(f"加載的Cookie: {cookies_data.get('cookies', {})}")
except FileNotFoundError:
print("Cookie文件不存在")
# 模擬登錄示例
def simulate_login_with_session():
"""
模擬網站登錄過程
"""
print("\n=== 模擬登錄流程 ===")
session = requests.Session()
# 1. 訪問登錄頁面(獲取必要的Cookie和token)
login_page_url = "https://httpbin.org/cookies/set/csrf_token/abc123def456"
response = session.get(login_page_url)
print(f"訪問登錄頁面: {response.status_code}")
# 2. 提交登錄表單
login_data = {
'username': 'testuser',
'password': 'testpass',
'csrf_token': 'abc123def456'
}
login_url = "https://httpbin.org/post"
response = session.post(login_url, data=login_data)
if response.status_code == 200:
print("登錄請求發送成功")
response_data = response.json()
print(f"提交的登錄數據: {response_data.get('form', {})}")
# 3. 訪問需要登錄的頁面
protected_url = "https://httpbin.org/cookies"
response = session.get(protected_url)
if response.status_code == 200:
print("成功訪問受保護頁面")
cookies_data = response.json()
print(f"當前會話Cookie: {cookies_data.get('cookies', {})}")
# 運行演示
if __name__ == "__main__":
demonstrate_cookies_and_sessions()
simulate_login_with_session()
運行結果:
=== Cookie操作演示 ===
設置Cookie後的狀態碼: 200
當前會話中的Cookie:
username = testuser
session_id = abc123
preferences = dark_theme
服務器接收到的Cookie: {'username': 'testuser', 'session_id': 'abc123', 'preferences': 'dark_theme'}
=== 手動Cookie操作 ===
手動設置的Cookie: {'user_id': '12345', 'auth_token': 'xyz789', 'language': 'zh-CN'}
=== Cookie持久化 ===
Cookie已保存到文件
Cookie已從文件加載
加載的Cookie: {'username': 'testuser', 'session_id': 'abc123', 'preferences': 'dark_theme'}
=== 模擬登錄流程 ===
訪問登錄頁面: 200
登錄請求發送成功
提交的登錄數據: {'username': 'testuser', 'password': 'testpass', 'csrf_token': 'abc123def456'}
成功訪問受保護頁面
當前會話Cookie: {'csrf_token': 'abc123def456'}
網頁結構分析¶
HTML基礎結構¶
理解HTML結構是網頁數據提取的基礎。HTML(HyperText Markup Language)使用標籤來定義網頁內容的結構和語義。
一個典型的HTML頁面結構如下:
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>頁面標題</title>
<link rel="stylesheet" href="style.css">
</head>
<body>
<header>
<nav>
<ul>
<li><a href="#home">首頁</a></li>
<li><a href="#about">關於</a></li>
</ul>
</nav>
</header>
<main>
<article>
<h1>文章標題</h1>
<p class="content">文章內容...</p>
</article>
</main>
<footer>
<p>© 2024 版權信息</p>
</footer>
<script src="script.js"></script>
</body>
</html>
讓我們編寫一個HTML結構分析工具:
import requests
from bs4 import BeautifulSoup
from collections import Counter
def analyze_html_structure(url):
"""
分析網頁的HTML結構
"""
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
print(f"=== HTML結構分析: {url} ===")
# 1. 基本信息
title = soup.find('title')
print(f"頁面標題: {title.get_text().strip() if title else '無標題'}")
# 2. 文檔類型和編碼
doctype = soup.contents[0] if soup.contents and hasattr(soup.contents[0], 'string') else None
print(f"文檔類型: {doctype if doctype else 'HTML5'}")
charset_meta = soup.find('meta', attrs={'charset': True})
if not charset_meta:
charset_meta = soup.find('meta', attrs={'http-equiv': 'Content-Type'})
encoding = charset_meta.get('charset') if charset_meta else response.encoding
print(f"字符編碼: {encoding}")
# 3. 標籤統計
all_tags = [tag.name for tag in soup.find_all()]
tag_counter = Counter(all_tags)
print(f"\n標籤統計 (前10個):")
for tag, count in tag_counter.most_common(10):
print(f" {tag}: {count}個")
# 4. 鏈接分析
links = soup.find_all('a', href=True)
print(f"\n鏈接分析:")
print(f" 總鏈接數: {len(links)}")
internal_links = []
external_links = []
for link in links:
href = link['href']
if href.startswith('http'):
if url in href:
internal_links.append(href)
else:
external_links.append(href)
elif href.startswith('/'):
internal_links.append(href)
print(f" 內部鏈接: {len(internal_links)}個")
print(f" 外部鏈接: {len(external_links)}個")
# 5. 圖片分析
images = soup.find_all('img')
print(f"\n圖片分析:")
print(f" 圖片總數: {len(images)}")
img_with_alt = [img for img in images if img.get('alt')]
print(f" 有alt屬性: {len(img_with_alt)}個")
# 6. 表單分析
forms = soup.find_all('form')
print(f"\n表單分析:")
print(f" 表單總數: {len(forms)}")
for i, form in enumerate(forms):
method = form.get('method', 'GET').upper()
action = form.get('action', '當前頁面')
inputs = form.find_all(['input', 'select', 'textarea'])
print(f" 表單{i+1}: {method} -> {action} ({len(inputs)}個字段)")
# 7. 腳本和樣式
scripts = soup.find_all('script')
stylesheets = soup.find_all('link', rel='stylesheet')
print(f"\n資源分析:")
print(f" JavaScript文件: {len(scripts)}個")
print(f" CSS樣式表: {len(stylesheets)}個")
# 8. 結構層次
print(f"\n頁面結構:")
body = soup.find('body')
if body:
print_structure(body, level=0, max_level=3)
else:
print(f"請求失敗,狀態碼: {response.status_code}")
except Exception as e:
print(f"分析過程中出現錯誤: {e}")
def print_structure(element, level=0, max_level=3):
"""
遞歸打印HTML結構
"""
if level > max_level:
return
indent = " " * level
tag_name = element.name
# 獲取重要屬性
attrs = []
if element.get('id'):
attrs.append(f"id='{element['id']}'")
if element.get('class'):
classes = ' '.join(element['class'])
attrs.append(f"class='{classes}'")
attr_str = f" [{', '.join(attrs)}]" if attrs else ""
print(f"{indent}<{tag_name}>{attr_str}")
# 遞歸處理子元素
for child in element.children:
if hasattr(child, 'name') and child.name:
print_structure(child, level + 1, max_level)
# 使用示例
if __name__ == "__main__":
# 分析一個示例網頁
analyze_html_structure("https://httpbin.org/html")
運行結果示例:
=== HTML結構分析: https://httpbin.org/html ===
頁面標題: Herman Melville - Moby-Dick
文檔類型: HTML5
字符編碼: utf-8
標籤統計 (前10個):
p: 4個
a: 3個
h1: 1個
body: 1個
html: 1個
head: 1個
title: 1個
鏈接分析:
總鏈接數: 3個
內部鏈接: 0個
外部鏈接: 3個
圖片分析:
圖片總數: 0個
有alt屬性: 0個
表單分析:
表單總數: 0個
資源分析:
JavaScript文件: 0個
CSS樣式表: 0個
頁面結構:
<body>
<h1>
<p>
<p>
<p>
<p>
CSS選擇器¶
CSS選擇器是定位HTML元素的強大工具,在網頁數據提取中起着關鍵作用。理解CSS選擇器語法對於精確定位目標元素至關重要。
基本選擇器:
- 標籤選擇器:div、p、a
- 類選擇器:.class-name
- ID選擇器:#element-id
- 屬性選擇器:[attribute="value"]
組合選擇器:
- 後代選擇器:div p(div內的所有p元素)
- 子元素選擇器:div > p(div的直接子p元素)
- 相鄰兄弟選擇器:h1 + p(緊跟h1的p元素)
- 通用兄弟選擇器:h1 ~ p(h1後的所有同級p元素)
僞類選擇器:
- :first-child、:last-child、:nth-child(n)
- :not(selector)、:contains(text)
讓我們通過實例來學習CSS選擇器的使用:
import requests
from bs4 import BeautifulSoup
def demonstrate_css_selectors():
"""
演示CSS選擇器的使用
"""
# 創建示例HTML
html_content = """
<!DOCTYPE html>
<html>
<head>
<title>CSS選擇器示例</title>
</head>
<body>
<div class="container">
<h1 id="main-title">新聞列表</h1>
<div class="news-section">
<article class="news-item featured">
<h2>重要新聞標題1</h2>
<p class="summary">這是新聞摘要...</p>
<span class="date">2024-01-15</span>
<a href="/news/1" class="read-more">閱讀更多</a>
</article>
<article class="news-item">
<h2>普通新聞標題2</h2>
<p class="summary">這是另一個新聞摘要...</p>
<span class="date">2024-01-14</span>
<a href="/news/2" class="read-more">閱讀更多</a>
</article>
<article class="news-item">
<h2>普通新聞標題3</h2>
<p class="summary">第三個新聞摘要...</p>
<span class="date">2024-01-13</span>
<a href="/news/3" class="read-more">閱讀更多</a>
</article>
</div>
<aside class="sidebar">
<h3>熱門標籤</h3>
<ul class="tag-list">
<li><a href="/tag/tech" data-category="technology">科技</a></li>
<li><a href="/tag/sports" data-category="sports">體育</a></li>
<li><a href="/tag/finance" data-category="finance">財經</a></li>
</ul>
</aside>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html_content, 'html.parser')
print("=== CSS選擇器演示 ===")
# 1. 基本選擇器
print("\n1. 基本選擇器:")
# 標籤選擇器
h2_elements = soup.select('h2')
print(f"所有h2標籤 ({len(h2_elements)}個):")
for h2 in h2_elements:
print(f" - {h2.get_text().strip()}")
# 類選擇器
news_items = soup.select('.news-item')
print(f"\n所有新聞項 ({len(news_items)}個):")
for i, item in enumerate(news_items, 1):
title = item.select_one('h2').get_text().strip()
print(f" {i}. {title}")
# ID選擇器
main_title = soup.select_one('#main-title')
print(f"\n主標題: {main_title.get_text().strip()}")
# 屬性選擇器
tech_links = soup.select('a[data-category="technology"]')
print(f"\n科技類鏈接 ({len(tech_links)}個):")
for link in tech_links:
print(f" - {link.get_text().strip()} -> {link.get('href')}")
# 2. 組合選擇器
print("\n2. 組合選擇器:")
# 後代選擇器
container_links = soup.select('.container a')
print(f"容器內所有鏈接 ({len(container_links)}個):")
for link in container_links:
text = link.get_text().strip()
href = link.get('href', '#')
print(f" - {text} -> {href}")
# 子元素選擇器
direct_children = soup.select('.news-section > .news-item')
print(f"\n新聞區域的直接子元素 ({len(direct_children)}個)")
# 相鄰兄弟選擇器
after_h2 = soup.select('h2 + p')
print(f"\nh2後的相鄰p元素 ({len(after_h2)}個):")
for p in after_h2:
print(f" - {p.get_text().strip()[:30]}...")
# 3. 僞類選擇器
print("\n3. 僞類選擇器:")
# 第一個和最後一個子元素
first_news = soup.select('.news-item:first-child')
last_news = soup.select('.news-item:last-child')
if first_news:
first_title = first_news[0].select_one('h2').get_text().strip()
print(f"第一個新聞: {first_title}")
if last_news:
last_title = last_news[0].select_one('h2').get_text().strip()
print(f"最後一個新聞: {last_title}")
# nth-child選擇器
second_news = soup.select('.news-item:nth-child(2)')
if second_news:
second_title = second_news[0].select_one('h2').get_text().strip()
print(f"第二個新聞: {second_title}")
# 4. 複雜選擇器組合
print("\n4. 複雜選擇器:")
# 選擇特色新聞的標題
featured_title = soup.select('.news-item.featured h2')
if featured_title:
print(f"特色新聞標題: {featured_title[0].get_text().strip()}")
# 選擇包含特定文本的元素
read_more_links = soup.select('a.read-more')
print(f"'閱讀更多'鏈接 ({len(read_more_links)}個)")
# 選擇具有特定屬性的元素
category_links = soup.select('a[data-category]')
print(f"有分類屬性的鏈接 ({len(category_links)}個):")
for link in category_links:
category = link.get('data-category')
text = link.get_text().strip()
print(f" - {text} (分類: {category})")
# 實際網頁CSS選擇器應用
def extract_data_with_css_selectors(url):
"""
使用CSS選擇器從實際網頁提取數據
"""
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
print(f"\n=== 從 {url} 提取數據 ===")
# 提取頁面標題
title = soup.select_one('title')
if title:
print(f"頁面標題: {title.get_text().strip()}")
# 提取所有鏈接
links = soup.select('a[href]')
print(f"\n找到 {len(links)} 個鏈接:")
for i, link in enumerate(links[:5], 1): # 只顯示前5個
text = link.get_text().strip()
href = link.get('href')
print(f" {i}. {text[:50]}... -> {href}")
# 提取所有段落文本
paragraphs = soup.select('p')
if paragraphs:
print(f"\n段落內容 (共{len(paragraphs)}個):")
for i, p in enumerate(paragraphs[:3], 1): # 只顯示前3個
text = p.get_text().strip()
if text:
print(f" {i}. {text[:100]}...")
else:
print(f"請求失敗,狀態碼: {response.status_code}")
except Exception as e:
print(f"提取數據時出現錯誤: {e}")
# 運行演示
if __name__ == "__main__":
demonstrate_css_selectors()
extract_data_with_css_selectors("https://httpbin.org/html")
JavaScript和動態內容¶
現代網頁大量使用JavaScript來動態生成內容,這給傳統的靜態爬蟲帶來了挑戰。動態內容包括:
- AJAX加載的數據:通過異步請求獲取的內容
- JavaScript渲染的頁面:完全由JS生成的頁面結構
- 用戶交互觸發的內容:點擊、滾動等操作後顯示的內容
- 即時更新的數據:WebSocket或定時刷新的內容
處理動態內容的方法:
方法1:分析AJAX請求
import requests
import json
def analyze_ajax_requests():
"""
分析和模擬AJAX請求
"""
print("=== AJAX請求分析 ===")
# 模擬一個AJAX請求
ajax_url = "https://httpbin.org/json"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'X-Requested-With': 'XMLHttpRequest', # 標識AJAX請求
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Content-Type': 'application/json'
}
try:
response = requests.get(ajax_url, headers=headers)
if response.status_code == 200:
data = response.json()
print(f"AJAX響應數據:")
print(json.dumps(data, indent=2, ensure_ascii=False))
else:
print(f"AJAX請求失敗: {response.status_code}")
except Exception as e:
print(f"AJAX請求異常: {e}")
# 運行AJAX分析
if __name__ == "__main__":
analyze_ajax_requests()
方法2:使用Selenium處理JavaScript
# 注意:需要安裝selenium和對應的瀏覽器驅動
# pip install selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
def handle_dynamic_content_with_selenium():
"""
使用Selenium處理動態內容
"""
print("=== Selenium處理動態內容 ===")
# 配置Chrome選項
chrome_options = Options()
chrome_options.add_argument('--headless') # 無頭模式
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
try:
# 創建WebDriver實例
driver = webdriver.Chrome(options=chrome_options)
# 訪問包含動態內容的頁面
driver.get("https://httpbin.org/html")
# 等待頁面加載完成
wait = WebDriverWait(driver, 10)
# 獲取頁面標題
title = driver.title
print(f"頁面標題: {title}")
# 查找元素
h1_element = wait.until(
EC.presence_of_element_located((By.TAG_NAME, "h1"))
)
print(f"H1內容: {h1_element.text}")
# 獲取所有鏈接
links = driver.find_elements(By.TAG_NAME, "a")
print(f"\n找到 {len(links)} 個鏈接:")
for i, link in enumerate(links, 1):
text = link.text.strip()
href = link.get_attribute('href')
print(f" {i}. {text} -> {href}")
# 執行JavaScript
js_result = driver.execute_script("return document.title;")
print(f"\nJavaScript執行結果: {js_result}")
except Exception as e:
print(f"Selenium處理異常: {e}")
finally:
if 'driver' in locals():
driver.quit()
# 注意:實際運行需要安裝ChromeDriver
# 這裏只是演示代碼結構
網頁編碼和字符集¶
正確處理網頁編碼是避免亂碼問題的關鍵。常見的編碼格式包括:
- UTF-8:支持全球所有字符的Unicode編碼
- GBK/GB2312:中文編碼格式
- ISO-8859-1:西歐字符編碼
- ASCII:基本英文字符編碼
import requests
from bs4 import BeautifulSoup
import chardet
def handle_encoding_issues():
"""
處理網頁編碼問題
"""
print("=== 網頁編碼處理 ===")
# 測試不同編碼的處理
test_urls = [
"https://httpbin.org/encoding/utf8",
"https://httpbin.org/html",
]
for url in test_urls:
try:
print(f"\n處理URL: {url}")
# 獲取原始響應
response = requests.get(url)
print(f"響應編碼: {response.encoding}")
print(f"表觀編碼: {response.apparent_encoding}")
# 方法1:使用chardet檢測編碼
detected_encoding = chardet.detect(response.content)
print(f"檢測到的編碼: {detected_encoding}")
# 方法2:從HTML meta標籤獲取編碼
soup = BeautifulSoup(response.content, 'html.parser')
# 查找charset聲明
charset_meta = soup.find('meta', attrs={'charset': True})
if charset_meta:
declared_charset = charset_meta.get('charset')
print(f"聲明的編碼: {declared_charset}")
else:
# 查找http-equiv類型的meta標籤
content_type_meta = soup.find('meta', attrs={'http-equiv': 'Content-Type'})
if content_type_meta:
content = content_type_meta.get('content', '')
if 'charset=' in content:
declared_charset = content.split('charset=')[1].split(';')[0]
print(f"聲明的編碼: {declared_charset}")
# 方法3:正確設置編碼後重新解析
if detected_encoding['encoding']:
response.encoding = detected_encoding['encoding']
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.find('title')
if title:
print(f"正確編碼後的標題: {title.get_text().strip()}")
except Exception as e:
print(f"編碼處理異常: {e}")
def create_encoding_safe_crawler():
"""
創建編碼安全的爬蟲
"""
def safe_get_text(url, timeout=10):
"""
安全獲取網頁文本內容
"""
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers, timeout=timeout)
# 1. 首先嚐試使用響應頭中的編碼
if response.encoding != 'ISO-8859-1': # 避免錯誤的默認編碼
soup = BeautifulSoup(response.text, 'html.parser')
else:
# 2. 使用chardet檢測編碼
detected = chardet.detect(response.content)
if detected['confidence'] > 0.7: # 置信度閾值
response.encoding = detected['encoding']
soup = BeautifulSoup(response.text, 'html.parser')
else:
# 3. 嘗試常見編碼
for encoding in ['utf-8', 'gbk', 'gb2312']:
try:
text = response.content.decode(encoding)
soup = BeautifulSoup(text, 'html.parser')
break
except UnicodeDecodeError:
continue
else:
# 4. 使用錯誤處理策略
soup = BeautifulSoup(response.content, 'html.parser', from_encoding='utf-8')
return soup
except Exception as e:
print(f"獲取頁面內容失敗: {e}")
return None
# 測試編碼安全爬蟲
test_url = "https://httpbin.org/html"
soup = safe_get_text(test_url)
if soup:
title = soup.find('title')
print(f"\n編碼安全爬蟲結果:")
print(f"標題: {title.get_text().strip() if title else '無標題'}")
# 提取文本內容
paragraphs = soup.find_all('p')
print(f"段落數量: {len(paragraphs)}")
for i, p in enumerate(paragraphs[:2], 1):
text = p.get_text().strip()
print(f"段落{i}: {text[:100]}...")
# 運行編碼處理演示
if __name__ == "__main__":
handle_encoding_issues()
create_encoding_safe_crawler()
爬蟲開發環境¶
開發工具選擇¶
選擇合適的開發工具能夠顯著提高爬蟲開發效率:
IDE和編輯器:
- PyCharm:功能強大的Python IDE,支持調試和代碼分析
- VS Code:輕量級編輯器,豐富的插件生態
- Jupyter Notebook:適合數據分析和原型開發
- Sublime Text:快速的文本編輯器
瀏覽器開發者工具:
- Chrome DevTools:分析網頁結構、網絡請求、JavaScript執行
- Firefox Developer Tools:類似Chrome,某些功能更強大
- 網絡面板:查看HTTP請求和響應
- 元素面板:分析HTML結構和CSS樣式
抓包工具:
- Fiddler:Windows平臺的HTTP調試代理
- Charles:跨平臺的HTTP監控工具
- mitmproxy:基於Python的中間人代理
- Wireshark:網絡協議分析器
代理和IP池¶
使用代理服務器可以隱藏真實IP地址,避免被網站封禁:
import requests
import random
import time
from itertools import cycle
class ProxyManager:
"""
代理管理器
"""
def __init__(self):
# 代理列表(示例,實際使用時需要有效的代理)
self.proxy_list = [
{'http': 'http://proxy1:port', 'https': 'https://proxy1:port'},
{'http': 'http://proxy2:port', 'https': 'https://proxy2:port'},
{'http': 'http://proxy3:port', 'https': 'https://proxy3:port'},
]
self.proxy_cycle = cycle(self.proxy_list)
self.failed_proxies = set()
def get_proxy(self):
"""
獲取可用代理
"""
for _ in range(len(self.proxy_list)):
proxy = next(self.proxy_cycle)
proxy_key = str(proxy)
if proxy_key not in self.failed_proxies:
return proxy
# 如果所有代理都失敗,清空失敗列表重新開始
self.failed_proxies.clear()
return next(self.proxy_cycle)
def mark_proxy_failed(self, proxy):
"""
標記代理失敗
"""
self.failed_proxies.add(str(proxy))
def test_proxy(self, proxy, test_url="https://httpbin.org/ip"):
"""
測試代理是否可用
"""
try:
response = requests.get(
test_url,
proxies=proxy,
timeout=10,
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
)
if response.status_code == 200:
data = response.json()
print(f"代理測試成功,IP: {data.get('origin')}")
return True
else:
print(f"代理測試失敗,狀態碼: {response.status_code}")
return False
except Exception as e:
print(f"代理測試異常: {e}")
return False
def demonstrate_proxy_usage():
"""
演示代理使用
"""
print("=== 代理使用演示 ===")
# 不使用代理的請求
try:
response = requests.get("https://httpbin.org/ip", timeout=10)
if response.status_code == 200:
data = response.json()
print(f"直接訪問IP: {data.get('origin')}")
except Exception as e:
print(f"直接訪問失敗: {e}")
# 使用代理的請求(示例)
proxy_manager = ProxyManager()
# 注意:以下代碼需要有效的代理服務器才能正常工作
print("\n代理測試(需要有效代理):")
for i in range(3):
proxy = proxy_manager.get_proxy()
print(f"測試代理 {i+1}: {proxy}")
# 在實際環境中測試代理
# is_working = proxy_manager.test_proxy(proxy)
# if not is_working:
# proxy_manager.mark_proxy_failed(proxy)
# 免費代理獲取示例
def get_free_proxies():
"""
獲取免費代理(示例)
"""
print("\n=== 免費代理獲取 ===")
# 這裏只是演示結構,實際需要從代理網站爬取
free_proxy_sources = [
"https://www.proxy-list.download/api/v1/get?type=http",
"https://api.proxyscrape.com/v2/?request=get&protocol=http",
]
proxies = []
for source in free_proxy_sources:
try:
print(f"從 {source} 獲取代理...")
# 實際實現需要解析不同網站的格式
# response = requests.get(source, timeout=10)
# 解析代理列表...
print("代理獲取完成(示例)")
except Exception as e:
print(f"獲取代理失敗: {e}")
return proxies
# 運行代理演示
if __name__ == "__main__":
demonstrate_proxy_usage()
get_free_proxies()
用戶代理設置¶
用戶代理(User-Agent)字符串標識客戶端應用程序,設置合適的User-Agent可以避免被識別爲爬蟲:
import requests
import random
class UserAgentManager:
"""
用戶代理管理器
"""
def __init__(self):
self.user_agents = [
# Chrome
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
# Firefox
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',
'Mozilla/5.0 (X11; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0',
# Safari
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
'Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Mobile/15E148 Safari/604.1',
# Edge
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0',
]
def get_random_user_agent(self):
"""
獲取隨機用戶代理
"""
return random.choice(self.user_agents)
def get_mobile_user_agent(self):
"""
獲取移動端用戶代理
"""
mobile_agents = [
'Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Mobile/15E148 Safari/604.1',
'Mozilla/5.0 (Android 14; Mobile; rv:121.0) Gecko/121.0 Firefox/121.0',
'Mozilla/5.0 (Linux; Android 14; SM-G998B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36',
]
return random.choice(mobile_agents)
def demonstrate_user_agent():
"""
演示用戶代理的使用
"""
print("=== 用戶代理演示 ===")
ua_manager = UserAgentManager()
# 測試不同的用戶代理
test_url = "https://httpbin.org/user-agent"
for i in range(3):
user_agent = ua_manager.get_random_user_agent()
headers = {'User-Agent': user_agent}
try:
response = requests.get(test_url, headers=headers)
if response.status_code == 200:
data = response.json()
print(f"\n請求 {i+1}:")
print(f"發送的User-Agent: {user_agent[:50]}...")
print(f"服務器接收到的: {data.get('user-agent', '')[:50]}...")
except Exception as e:
print(f"請求失敗: {e}")
# 測試移動端用戶代理
print("\n=== 移動端用戶代理 ===")
mobile_ua = ua_manager.get_mobile_user_agent()
headers = {'User-Agent': mobile_ua}
try:
response = requests.get(test_url, headers=headers)
if response.status_code == 200:
data = response.json()
print(f"移動端User-Agent: {data.get('user-agent')}")
except Exception as e:
print(f"移動端請求失敗: {e}")
# 運行用戶代理演示
if __name__ == "__main__":
demonstrate_user_agent()
調試和測試工具¶
有效的調試和測試工具能夠幫助快速定位和解決爬蟲開發中的問題:
import requests
import time
import logging
from functools import wraps
# 配置日誌
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('crawler.log'),
logging.StreamHandler()
]
)
def debug_request(func):
"""
請求調試裝飾器
"""
@wraps(func)
def wrapper(*args, **kwargs):
start_time = time.time()
try:
result = func(*args, **kwargs)
end_time = time.time()
logging.info(f"{func.__name__} 執行成功,耗時: {end_time - start_time:.3f}秒")
return result
except Exception as e:
end_time = time.time()
logging.error(f"{func.__name__} 執行失敗,耗時: {end_time - start_time:.3f}秒,錯誤: {e}")
raise
return wrapper
class CrawlerDebugger:
"""
爬蟲調試器
"""
def __init__(self):
self.request_count = 0
self.success_count = 0
self.error_count = 0
self.start_time = time.time()
@debug_request
def debug_get(self, url, **kwargs):
"""
調試版本的GET請求
"""
self.request_count += 1
# 默認headers
default_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
headers = kwargs.get('headers', {})
headers.update(default_headers)
kwargs['headers'] = headers
logging.info(f"發送GET請求到: {url}")
logging.debug(f"請求參數: {kwargs}")
try:
response = requests.get(url, **kwargs)
logging.info(f"響應狀態碼: {response.status_code}")
logging.info(f"響應時間: {response.elapsed.total_seconds():.3f}秒")
logging.debug(f"響應頭: {dict(response.headers)}")
if response.status_code == 200:
self.success_count += 1
else:
self.error_count += 1
logging.warning(f"非200狀態碼: {response.status_code}")
return response
except requests.RequestException as e:
self.error_count += 1
logging.error(f"請求異常: {e}")
raise
def get_stats(self):
"""
獲取統計信息
"""
elapsed_time = time.time() - self.start_time
stats = {
'總請求數': self.request_count,
'成功請求數': self.success_count,
'失敗請求數': self.error_count,
'成功率': f"{(self.success_count / max(self.request_count, 1)) * 100:.2f}%",
'運行時間': f"{elapsed_time:.2f}秒",
'平均請求速度': f"{self.request_count / max(elapsed_time, 1):.2f}請求/秒"
}
return stats
def print_stats(self):
"""
打印統計信息
"""
stats = self.get_stats()
print("\n=== 爬蟲統計信息 ===")
for key, value in stats.items():
print(f"{key}: {value}")
def test_crawler_debugger():
"""
測試爬蟲調試器
"""
debugger = CrawlerDebugger()
test_urls = [
"https://httpbin.org/get",
"https://httpbin.org/status/200",
"https://httpbin.org/delay/1",
"https://httpbin.org/status/404", # 這個會返回404
"https://httpbin.org/json",
]
print("開始測試爬蟲調試器...")
for url in test_urls:
try:
response = debugger.debug_get(url, timeout=10)
print(f"✓ {url} - 狀態碼: {response.status_code}")
except Exception as e:
print(f"✗ {url} - 錯誤: {e}")
time.sleep(0.5) # 避免請求過快
# 打印統計信息
debugger.print_stats()
# 性能測試工具
def performance_test(func, *args, **kwargs):
"""
性能測試裝飾器
"""
def test_performance(iterations=10):
times = []
for i in range(iterations):
start_time = time.time()
try:
func(*args, **kwargs)
end_time = time.time()
times.append(end_time - start_time)
except Exception as e:
print(f"第{i+1}次測試失敗: {e}")
if times:
avg_time = sum(times) / len(times)
min_time = min(times)
max_time = max(times)
print(f"\n=== 性能測試結果 ({iterations}次) ===")
print(f"平均時間: {avg_time:.3f}秒")
print(f"最短時間: {min_time:.3f}秒")
print(f"最長時間: {max_time:.3f}秒")
print(f"成功率: {len(times)}/{iterations} ({len(times)/iterations*100:.1f}%)")
return test_performance
# 運行調試演示
if __name__ == "__main__":
test_crawler_debugger()
# 性能測試示例
@performance_test
def simple_request():
response = requests.get("https://httpbin.org/get", timeout=5)
return response.status_code == 200
print("\n開始性能測試...")
simple_request(iterations=5)
運行結果示例:
開始測試爬蟲調試器...
2024-01-15 14:30:15,123 - INFO - 發送GET請求到: https://httpbin.org/get
2024-01-15 14:30:15,456 - INFO - 響應狀態碼: 200
2024-01-15 14:30:15,456 - INFO - 響應時間: 0.333秒
2024-01-15 14:30:15,456 - INFO - debug_get 執行成功,耗時: 0.334秒
✓ https://httpbin.org/get - 狀態碼: 200
2024-01-15 14:30:16,001 - INFO - 發送GET請求到: https://httpbin.org/status/200
2024-01-15 14:30:16,234 - INFO - 響應狀態碼: 200
2024-01-15 14:30:16,234 - INFO - 響應時間: 0.233秒
2024-01-15 14:30:16,234 - INFO - debug_get 執行成功,耗時: 0.234秒
✓ https://httpbin.org/status/200 - 狀態碼: 200
=== 爬蟲統計信息 ===
總請求數: 5
成功請求數: 4
失敗請求數: 1
成功率: 80.00%
運行時間: 3.45秒
平均請求速度: 1.45請求/秒
=== 性能測試結果 (5次) ===
平均時間: 0.456秒
最短時間: 0.234秒
最長時間: 0.678秒
成功率: 5/5 (100.0%)
14.2 Requests庫網絡請求¶
Requests是Python中最受歡迎的HTTP庫,它讓HTTP請求變得簡單而優雅。相比於Python標準庫中的urllib,Requests提供了更加人性化的API,是網絡爬蟲開發的首選工具。
Requests基礎¶
安裝和基本使用¶
Requests庫的安裝非常簡單,使用pip命令即可:
pip install requests
安裝完成後,我們來看看Requests的基本使用方法:
import requests
import json
from pprint import pprint
def basic_requests_usage():
"""
演示Requests的基本使用方法
"""
print("=== Requests基礎使用演示 ===")
# 1. 最簡單的GET請求
print("\n1. 基本GET請求:")
response = requests.get('https://httpbin.org/get')
print(f"狀態碼: {response.status_code}")
print(f"響應時間: {response.elapsed.total_seconds():.3f}秒")
print(f"內容類型: {response.headers.get('content-type')}")
# 2. 檢查請求是否成功
if response.status_code == 200:
print("請求成功!")
data = response.json() # 解析JSON響應
print(f"服務器接收到的URL: {data['url']}")
else:
print(f"請求失敗,狀態碼: {response.status_code}")
# 3. 使用raise_for_status()檢查狀態
try:
response.raise_for_status() # 如果狀態碼不是200會拋出異常
print("狀態檢查通過")
except requests.exceptions.HTTPError as e:
print(f"HTTP錯誤: {e}")
# 4. 獲取響應內容的不同方式
print("\n2. 響應內容獲取:")
# 文本內容
print(f"響應文本長度: {len(response.text)}字符")
# 二進制內容
print(f"響應二進制長度: {len(response.content)}字節")
# JSON內容(如果是JSON格式)
try:
json_data = response.json()
print(f"JSON數據鍵: {list(json_data.keys())}")
except ValueError:
print("響應不是有效的JSON格式")
# 5. 響應頭信息
print("\n3. 響應頭信息:")
print(f"服務器: {response.headers.get('server', '未知')}")
print(f"內容長度: {response.headers.get('content-length', '未知')}")
print(f"連接類型: {response.headers.get('connection', '未知')}")
# 運行基礎演示
if __name__ == "__main__":
basic_requests_usage()
運行結果:
=== Requests基礎使用演示 ===
1. 基本GET請求:
狀態碼: 200
響應時間: 0.234秒
內容類型: application/json
請求成功!
服務器接收到的URL: https://httpbin.org/get
狀態檢查通過
2. 響應內容獲取:
響應文本長度: 312字符
響應二進制長度: 312字節
JSON數據鍵: ['args', 'headers', 'origin', 'url']
3. 響應頭信息:
服務器: gunicorn/19.9.0
內容長度: 312
連接類型: keep-alive
GET和POST請求¶
GET和POST是HTTP協議中最常用的兩種請求方法。GET用於獲取數據,POST用於提交數據。
GET請求詳解:
import requests
from urllib.parse import urlencode
def demonstrate_get_requests():
"""
演示各種GET請求的使用方法
"""
print("=== GET請求詳解 ===")
# 1. 基本GET請求
print("\n1. 基本GET請求:")
response = requests.get('https://httpbin.org/get')
print(f"請求URL: {response.url}")
print(f"狀態碼: {response.status_code}")
# 2. 帶參數的GET請求
print("\n2. 帶參數的GET請求:")
# 方法1: 使用params參數
params = {
'name': '張三',
'age': 25,
'city': '北京',
'hobbies': ['讀書', '游泳'] # 列表參數
}
response = requests.get('https://httpbin.org/get', params=params)
print(f"構建的URL: {response.url}")
data = response.json()
print(f"服務器接收到的參數: {data['args']}")
# 方法2: 直接在URL中包含參數
url_with_params = 'https://httpbin.org/get?name=李四&age=30'
response2 = requests.get(url_with_params)
print(f"\n直接URL參數: {response2.json()['args']}")
# 3. 自定義請求頭
print("\n3. 自定義請求頭:")
headers = {
'User-Agent': 'MySpider/1.0',
'Accept': 'application/json',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Referer': 'https://www.example.com'
}
response = requests.get('https://httpbin.org/get', headers=headers)
received_headers = response.json()['headers']
print(f"發送的User-Agent: {headers['User-Agent']}")
print(f"服務器接收到的User-Agent: {received_headers.get('User-Agent')}")
# 4. 超時設置
print("\n4. 超時設置:")
try:
# 設置連接超時爲3秒,讀取超時爲5秒
response = requests.get('https://httpbin.org/delay/2', timeout=(3, 5))
print(f"請求成功,耗時: {response.elapsed.total_seconds():.3f}秒")
except requests.exceptions.Timeout:
print("請求超時")
except requests.exceptions.RequestException as e:
print(f"請求異常: {e}")
# 5. 處理重定向
print("\n5. 重定向處理:")
# 允許重定向(默認行爲)
response = requests.get('https://httpbin.org/redirect/2')
print(f"最終URL: {response.url}")
print(f"重定向歷史: {[r.url for r in response.history]}")
# 禁止重定向
response_no_redirect = requests.get('https://httpbin.org/redirect/1', allow_redirects=False)
print(f"\n禁止重定向狀態碼: {response_no_redirect.status_code}")
print(f"Location頭: {response_no_redirect.headers.get('Location')}")
# 運行GET請求演示
if __name__ == "__main__":
demonstrate_get_requests()
POST請求詳解:
import requests
import json
def demonstrate_post_requests():
"""
演示各種POST請求的使用方法
"""
print("=== POST請求詳解 ===")
# 1. 發送表單數據
print("\n1. 發送表單數據:")
form_data = {
'username': 'testuser',
'password': 'testpass',
'email': 'test@example.com',
'remember': 'on'
}
response = requests.post('https://httpbin.org/post', data=form_data)
if response.status_code == 200:
result = response.json()
print(f"發送的表單數據: {form_data}")
print(f"服務器接收到的表單: {result['form']}")
print(f"Content-Type: {result['headers'].get('Content-Type')}")
# 2. 發送JSON數據
print("\n2. 發送JSON數據:")
json_data = {
'name': '王五',
'age': 28,
'skills': ['Python', 'JavaScript', 'SQL'],
'is_active': True,
'profile': {
'city': '上海',
'experience': 5
}
}
# 方法1: 使用json參數(推薦)
response = requests.post('https://httpbin.org/post', json=json_data)
if response.status_code == 200:
result = response.json()
print(f"發送的JSON數據: {json_data}")
print(f"服務器接收到的JSON: {result['json']}")
print(f"Content-Type: {result['headers'].get('Content-Type')}")
# 方法2: 手動設置headers和data
headers = {'Content-Type': 'application/json'}
response2 = requests.post(
'https://httpbin.org/post',
data=json.dumps(json_data),
headers=headers
)
print(f"\n手動設置方式狀態碼: {response2.status_code}")
# 3. 發送文件
print("\n3. 文件上傳:")
# 創建一個臨時文件用於演示
import tempfile
import os
# 創建臨時文件
with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
f.write("這是一個測試文件\n包含中文內容")
temp_file_path = f.name
try:
# 上傳文件
with open(temp_file_path, 'rb') as f:
files = {'file': ('test.txt', f, 'text/plain')}
response = requests.post('https://httpbin.org/post', files=files)
if response.status_code == 200:
result = response.json()
print(f"上傳的文件信息: {result['files']}")
print(f"Content-Type: {result['headers'].get('Content-Type')}")
finally:
# 清理臨時文件
os.unlink(temp_file_path)
# 4. 混合數據提交
print("\n4. 混合數據提交:")
# 同時發送表單數據和文件
form_data = {'description': '文件描述', 'category': 'test'}
# 創建內存中的文件對象
from io import StringIO, BytesIO
file_content = BytesIO(b"Hello, World! This is a test file.")
files = {'upload': ('hello.txt', file_content, 'text/plain')}
response = requests.post(
'https://httpbin.org/post',
data=form_data,
files=files
)
if response.status_code == 200:
result = response.json()
print(f"表單數據: {result['form']}")
print(f"文件數據: {list(result['files'].keys())}")
# 5. 自定義請求頭的POST
print("\n5. 自定義請求頭的POST:")
headers = {
'User-Agent': 'MyApp/2.0',
'Authorization': 'Bearer your-token-here',
'X-Custom-Header': 'custom-value'
}
data = {'message': 'Hello from custom headers'}
response = requests.post(
'https://httpbin.org/post',
json=data,
headers=headers
)
if response.status_code == 200:
result = response.json()
received_headers = result['headers']
print(f"自定義頭部 X-Custom-Header: {received_headers.get('X-Custom-Header')}")
print(f"Authorization: {received_headers.get('Authorization')}")
# 運行POST請求演示
if __name__ == "__main__":
demonstrate_post_requests()
運行結果示例:
=== POST請求詳解 ===
1. 發送表單數據:
發送的表單數據: {'username': 'testuser', 'password': 'testpass', 'email': 'test@example.com', 'remember': 'on'}
服務器接收到的表單: {'username': 'testuser', 'password': 'testpass', 'email': 'test@example.com', 'remember': 'on'}
Content-Type: application/x-www-form-urlencoded
2. 發送JSON數據:
發送的JSON數據: {'name': '王五', 'age': 28, 'skills': ['Python', 'JavaScript', 'SQL'], 'is_active': True, 'profile': {'city': '上海', 'experience': 5}}
服務器接收到的JSON: {'name': '王五', 'age': 28, 'skills': ['Python', 'JavaScript', 'SQL'], 'is_active': True, 'profile': {'city': '上海', 'experience': 5}}
Content-Type: application/json
3. 文件上傳:
上傳的文件信息: {'file': '這是一個測試文件\n包含中文內容'}
Content-Type: multipart/form-data; boundary=...
4. 混合數據提交:
表單數據: {'description': '文件描述', 'category': 'test'}
文件數據: ['upload']
5. 自定義請求頭的POST:
自定義頭部 X-Custom-Header: custom-value
Authorization: Bearer your-token-here
請求參數和頭部¶
在網絡爬蟲中,正確設置請求參數和頭部信息是非常重要的,它們決定了服務器如何處理我們的請求。
請求參數詳解¶
import requests
from urllib.parse import urlencode, quote
def advanced_parameters_demo():
"""
演示高級參數處理
"""
print("=== 高級參數處理演示 ===")
# 1. 複雜參數結構
print("\n1. 複雜參數結構:")
complex_params = {
'q': 'Python爬蟲', # 中文搜索詞
'page': 1,
'size': 20,
'sort': ['time', 'relevance'], # 多值參數
'filters': {
'category': 'tech',
'date_range': '2024-01-01,2024-12-31'
},
'include_fields': ['title', 'content', 'author'],
'exclude_empty': True
}
# Requests會自動處理複雜參數
response = requests.get('https://httpbin.org/get', params=complex_params)
print(f"構建的URL: {response.url}")
result = response.json()
print(f"\n服務器接收到的參數:")
for key, value in result['args'].items():
print(f" {key}: {value}")
# 2. 手動URL編碼
print("\n2. 手動URL編碼:")
# 處理特殊字符
special_params = {
'query': 'hello world & python',
'symbols': '!@#$%^&*()+={}[]|\\:;"<>?,./'
}
# 方法1: 使用requests自動編碼
response1 = requests.get('https://httpbin.org/get', params=special_params)
print(f"自動編碼URL: {response1.url}")
# 方法2: 手動編碼
encoded_query = quote('hello world & python')
manual_url = f'https://httpbin.org/get?query={encoded_query}'
response2 = requests.get(manual_url)
print(f"手動編碼URL: {response2.url}")
# 3. 數組參數的不同處理方式
print("\n3. 數組參數處理:")
# 方式1: Python列表(默認行爲)
list_params = {'tags': ['python', 'web', 'crawler']}
response = requests.get('https://httpbin.org/get', params=list_params)
print(f"列表參數URL: {response.url}")
# 方式2: 手動構建重複參數
manual_params = [('tags', 'python'), ('tags', 'web'), ('tags', 'crawler')]
response2 = requests.get('https://httpbin.org/get', params=manual_params)
print(f"手動重複參數URL: {response2.url}")
# 4. 條件參數構建
print("\n4. 條件參數構建:")
def build_search_params(keyword, page=1, filters=None, sort_by=None):
"""
根據條件構建搜索參數
"""
params = {'q': keyword, 'page': page}
if filters:
for key, value in filters.items():
if value: # 只添加非空值
params[f'filter_{key}'] = value
if sort_by:
params['sort'] = sort_by
return params
# 使用條件參數構建
search_filters = {
'category': 'technology',
'author': '', # 空值,不會被添加
'date': '2024-01-01'
}
params = build_search_params(
keyword='Python教程',
page=2,
filters=search_filters,
sort_by='date_desc'
)
response = requests.get('https://httpbin.org/get', params=params)
print(f"條件構建的參數: {response.json()['args']}")
# 運行參數演示
if __name__ == "__main__":
advanced_parameters_demo()
請求頭部詳解¶
import requests
import time
import random
def advanced_headers_demo():
"""
演示高級請求頭處理
"""
print("=== 高級請求頭演示 ===")
# 1. 完整的瀏覽器請求頭模擬
print("\n1. 完整瀏覽器頭部模擬:")
browser_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1', # Do Not Track
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Cache-Control': 'max-age=0'
}
response = requests.get('https://httpbin.org/get', headers=browser_headers)
received_headers = response.json()['headers']
print(f"發送的User-Agent: {browser_headers['User-Agent'][:50]}...")
print(f"服務器接收的User-Agent: {received_headers.get('User-Agent', '')[:50]}...")
print(f"Accept-Language: {received_headers.get('Accept-Language')}")
# 2. API請求頭
print("\n2. API請求頭:")
api_headers = {
'Content-Type': 'application/json',
'Accept': 'application/json',
'Authorization': 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...',
'X-API-Key': 'your-api-key-here',
'X-Client-Version': '1.2.3',
'X-Request-ID': f'req_{int(time.time())}_{random.randint(1000, 9999)}'
}
data = {'query': 'test data'}
response = requests.post('https://httpbin.org/post', json=data, headers=api_headers)
if response.status_code == 200:
result = response.json()
print(f"API請求成功")
print(f"Request ID: {result['headers'].get('X-Request-ID')}")
print(f"Authorization: {result['headers'].get('Authorization', '')[:20]}...")
# 3. 防爬蟲頭部設置
print("\n3. 防爬蟲頭部設置:")
# 模擬真實瀏覽器行爲
anti_bot_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Referer': 'https://www.google.com/', # 模擬從搜索引擎來
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}
response = requests.get('https://httpbin.org/get', headers=anti_bot_headers)
print(f"防爬蟲請求狀態: {response.status_code}")
print(f"Referer頭: {response.json()['headers'].get('Referer')}")
# 4. 動態頭部生成
print("\n4. 動態頭部生成:")
def generate_dynamic_headers():
"""
生成動態請求頭
"""
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0'
]
referers = [
'https://www.google.com/',
'https://www.bing.com/',
'https://www.baidu.com/',
'https://duckduckgo.com/'
]
return {
'User-Agent': random.choice(user_agents),
'Referer': random.choice(referers),
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'X-Forwarded-For': f'{random.randint(1,255)}.{random.randint(1,255)}.{random.randint(1,255)}.{random.randint(1,255)}'
}
# 使用動態頭部發送多個請求
for i in range(3):
headers = generate_dynamic_headers()
response = requests.get('https://httpbin.org/get', headers=headers)
if response.status_code == 200:
result = response.json()
print(f"\n請求 {i+1}:")
print(f" User-Agent: {result['headers'].get('User-Agent', '')[:40]}...")
print(f" Referer: {result['headers'].get('Referer')}")
print(f" X-Forwarded-For: {result['headers'].get('X-Forwarded-For')}")
# 5. 頭部優先級和覆蓋
print("\n5. 頭部優先級演示:")
# 創建會話並設置默認頭部
session = requests.Session()
session.headers.update({
'User-Agent': 'DefaultAgent/1.0',
'Accept': 'application/json',
'X-Default-Header': 'default-value'
})
# 請求時覆蓋部分頭部
override_headers = {
'User-Agent': 'OverrideAgent/2.0', # 覆蓋默認值
'X-Custom-Header': 'custom-value' # 新增頭部
}
response = session.get('https://httpbin.org/get', headers=override_headers)
if response.status_code == 200:
result = response.json()
headers = result['headers']
print(f"最終User-Agent: {headers.get('User-Agent')}")
print(f"默認Accept: {headers.get('Accept')}")
print(f"默認頭部: {headers.get('X-Default-Header')}")
print(f"自定義頭部: {headers.get('X-Custom-Header')}")
# 運行頭部演示
if __name__ == "__main__":
advanced_headers_demo()
響應對象處理¶
響應對象包含了服務器返回的所有信息,正確處理響應對象是爬蟲開發的關鍵技能。
import requests
import json
from datetime import datetime
def response_handling_demo():
"""
演示響應對象的各種處理方法
"""
print("=== 響應對象處理演示 ===")
# 發送一個測試請求
response = requests.get('https://httpbin.org/json')
# 1. 基本響應信息
print("\n1. 基本響應信息:")
print(f"狀態碼: {response.status_code}")
print(f"狀態描述: {response.reason}")
print(f"請求URL: {response.url}")
print(f"響應時間: {response.elapsed.total_seconds():.3f}秒")
print(f"編碼: {response.encoding}")
# 2. 響應頭詳細分析
print("\n2. 響應頭分析:")
print(f"Content-Type: {response.headers.get('content-type')}")
print(f"Content-Length: {response.headers.get('content-length')}")
print(f"Server: {response.headers.get('server')}")
print(f"Date: {response.headers.get('date')}")
# 檢查是否支持壓縮
content_encoding = response.headers.get('content-encoding')
if content_encoding:
print(f"內容編碼: {content_encoding}")
else:
print("未使用內容壓縮")
# 3. 響應內容的不同獲取方式
print("\n3. 響應內容獲取:")
# 文本內容
text_content = response.text
print(f"文本內容長度: {len(text_content)}字符")
print(f"文本內容預覽: {text_content[:100]}...")
# 二進制內容
binary_content = response.content
print(f"二進制內容長度: {len(binary_content)}字節")
# JSON內容
try:
json_content = response.json()
print(f"JSON內容類型: {type(json_content)}")
if isinstance(json_content, dict):
print(f"JSON鍵: {list(json_content.keys())}")
except ValueError as e:
print(f"JSON解析失敗: {e}")
# 4. 響應狀態檢查
print("\n4. 響應狀態檢查:")
def check_response_status(response):
"""
檢查響應狀態的詳細信息
"""
print(f"狀態碼: {response.status_code}")
# 使用內置方法檢查狀態
if response.ok:
print("✓ 請求成功 (狀態碼 200-299)")
else:
print("✗ 請求失敗")
# 詳細狀態分類
if 200 <= response.status_code < 300:
print("✓ 成功響應")
elif 300 <= response.status_code < 400:
print("→ 重定向響應")
location = response.headers.get('location')
if location:
print(f" 重定向到: {location}")
elif 400 <= response.status_code < 500:
print("✗ 客戶端錯誤")
elif 500 <= response.status_code < 600:
print("✗ 服務器錯誤")
# 使用raise_for_status檢查
try:
response.raise_for_status()
print("✓ 狀態檢查通過")
except requests.exceptions.HTTPError as e:
print(f"✗ 狀態檢查失敗: {e}")
check_response_status(response)
# 5. 測試不同狀態碼的響應
print("\n5. 不同狀態碼測試:")
test_urls = [
('https://httpbin.org/status/200', '成功'),
('https://httpbin.org/status/404', '未找到'),
('https://httpbin.org/status/500', '服務器錯誤'),
('https://httpbin.org/redirect/1', '重定向')
]
for url, description in test_urls:
try:
resp = requests.get(url, timeout=5)
print(f"\n{description} ({url}):")
print(f" 狀態碼: {resp.status_code}")
print(f" 最終URL: {resp.url}")
if resp.history:
print(f" 重定向歷史: {[r.status_code for r in resp.history]}")
except requests.exceptions.RequestException as e:
print(f"\n{description} 請求失敗: {e}")
# 6. 響應內容類型處理
print("\n6. 不同內容類型處理:")
def handle_different_content_types():
"""
處理不同類型的響應內容
"""
# JSON響應
json_resp = requests.get('https://httpbin.org/json')
if json_resp.headers.get('content-type', '').startswith('application/json'):
data = json_resp.json()
print(f"JSON數據: {data}")
# HTML響應
html_resp = requests.get('https://httpbin.org/html')
if 'text/html' in html_resp.headers.get('content-type', ''):
print(f"HTML內容長度: {len(html_resp.text)}字符")
# 可以使用BeautifulSoup進一步解析
# XML響應
xml_resp = requests.get('https://httpbin.org/xml')
if 'application/xml' in xml_resp.headers.get('content-type', ''):
print(f"XML內容長度: {len(xml_resp.text)}字符")
# 圖片響應(二進制)
try:
img_resp = requests.get('https://httpbin.org/image/png', timeout=10)
if img_resp.headers.get('content-type', '').startswith('image/'):
print(f"圖片大小: {len(img_resp.content)}字節")
print(f"圖片類型: {img_resp.headers.get('content-type')}")
except requests.exceptions.RequestException:
print("圖片請求失敗或超時")
handle_different_content_types()
# 7. 響應時間和性能分析
print("\n7. 響應時間分析:")
def analyze_response_performance(url, num_requests=3):
"""
分析響應性能
"""
times = []
for i in range(num_requests):
start_time = datetime.now()
try:
resp = requests.get(url, timeout=10)
end_time = datetime.now()
# 計算總時間
total_time = (end_time - start_time).total_seconds()
# 獲取requests內部計時
elapsed_time = resp.elapsed.total_seconds()
times.append({
'total': total_time,
'elapsed': elapsed_time,
'status': resp.status_code
})
print(f"請求 {i+1}: {elapsed_time:.3f}秒 (狀態碼: {resp.status_code})")
except requests.exceptions.RequestException as e:
print(f"請求 {i+1} 失敗: {e}")
if times:
avg_time = sum(t['elapsed'] for t in times) / len(times)
min_time = min(t['elapsed'] for t in times)
max_time = max(t['elapsed'] for t in times)
print(f"\n性能統計:")
print(f" 平均響應時間: {avg_time:.3f}秒")
print(f" 最快響應時間: {min_time:.3f}秒")
print(f" 最慢響應時間: {max_time:.3f}秒")
analyze_response_performance('https://httpbin.org/delay/1')
# 運行響應處理演示
if __name__ == "__main__":
response_handling_demo()
運行結果示例:
=== 響應對象處理演示 ===
1. 基本響應信息:
狀態碼: 200
狀態描述: OK
請求URL: https://httpbin.org/json
響應時間: 0.234秒
編碼: utf-8
2. 響應頭分析:
Content-Type: application/json
Content-Length: 429
Server: gunicorn/19.9.0
Date: Mon, 15 Jan 2024 06:30:15 GMT
未使用內容壓縮
3. 響應內容獲取:
文本內容長度: 429字符
文本內容預覽: {"slideshow": {"author": "Yours Truly", "date": "date of publication", "slides": [{"title": "Wake up to WonderWidgets!", "type": "all"}, {"title": "Overview", "type": "all", "items": ["Why <em>WonderWidgets</em> are great", "Who <em>buys</em> them"]}], "title": "Sample Slide Show"}}...
二進制內容長度: 429字節
JSON內容類型: <class 'dict'>
JSON鍵: ['slideshow']
4. 響應狀態檢查:
狀態碼: 200
✓ 請求成功 (狀態碼 200-299)
✓ 成功響應
✓ 狀態檢查通過
5. 不同狀態碼測試:
成功 (https://httpbin.org/status/200):
狀態碼: 200
最終URL: https://httpbin.org/status/200
未找到 (https://httpbin.org/status/404):
狀態碼: 404
最終URL: https://httpbin.org/status/404
服務器錯誤 (https://httpbin.org/status/500):
狀態碼: 500
最終URL: https://httpbin.org/status/500
重定向 (https://httpbin.org/redirect/1):
狀態碼: 200
最終URL: https://httpbin.org/get
重定向歷史: [302]
7. 響應時間分析:
請求 1: 1.234秒 (狀態碼: 200)
請求 2: 1.156秒 (狀態碼: 200)
請求 3: 1.298秒 (狀態碼: 200)
性能統計:
平均響應時間: 1.229秒
最快響應時間: 1.156秒
最慢響應時間: 1.298秒
高級功能¶
Session會話管理¶
Session對象允許你跨請求保持某些參數,它會在同一個Session實例發出的所有請求之間保持cookie,使用urllib3的連接池,所以如果你向同一主機發送多個請求,底層的TCP連接將會被重用,從而帶來顯著的性能提升。
import requests
import time
from datetime import datetime
def session_management_demo():
"""
演示Session會話管理的各種功能
"""
print("=== Session會話管理演示 ===")
# 1. 基本Session使用
print("\n1. 基本Session使用:")
# 創建Session對象
session = requests.Session()
# 設置Session級別的請求頭
session.headers.update({
'User-Agent': 'MyApp/1.0',
'Accept': 'application/json'
})
# 使用Session發送請求
response1 = session.get('https://httpbin.org/get')
print(f"第一次請求狀態碼: {response1.status_code}")
print(f"User-Agent: {response1.json()['headers'].get('User-Agent')}")
# Session會保持設置的頭部
response2 = session.get('https://httpbin.org/headers')
print(f"第二次請求User-Agent: {response2.json()['headers'].get('User-Agent')}")
# 2. Cookie持久化
print("\n2. Cookie持久化演示:")
# 創建新的Session
cookie_session = requests.Session()
# 第一次請求設置cookie
response = cookie_session.get('https://httpbin.org/cookies/set/session_id/abc123')
print(f"設置Cookie後的狀態碼: {response.status_code}")
# 查看Session中的cookies
print(f"Session中的Cookies: {dict(cookie_session.cookies)}")
# 第二次請求會自動攜帶cookie
response = cookie_session.get('https://httpbin.org/cookies')
cookies_data = response.json()
print(f"服務器接收到的Cookies: {cookies_data.get('cookies', {})}")
# 3. 連接池和性能優化
print("\n3. 連接池性能對比:")
def test_without_session(num_requests=5):
"""不使用Session的請求"""
start_time = time.time()
for i in range(num_requests):
response = requests.get('https://httpbin.org/get')
if response.status_code != 200:
print(f"請求 {i+1} 失敗")
end_time = time.time()
return end_time - start_time
def test_with_session(num_requests=5):
"""使用Session的請求"""
start_time = time.time()
session = requests.Session()
for i in range(num_requests):
response = session.get('https://httpbin.org/get')
if response.status_code != 200:
print(f"請求 {i+1} 失敗")
session.close()
end_time = time.time()
return end_time - start_time
print("\n性能測試 (5次請求):")
time_without_session = test_without_session()
time_with_session = test_with_session()
print(f"不使用Session: {time_without_session:.3f}秒")
print(f"使用Session: {time_with_session:.3f}秒")
print(f"性能提升: {((time_without_session - time_with_session) / time_without_session * 100):.1f}%")
# 4. Session配置和自定義
print("\n4. Session配置:")
# 創建自定義配置的Session
custom_session = requests.Session()
# 設置默認超時
custom_session.timeout = 10
# 設置默認參數
custom_session.params = {'api_key': 'your-api-key'}
# 設置默認頭部
custom_session.headers.update({
'User-Agent': 'CustomBot/2.0',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive'
})
# 發送請求
response = custom_session.get('https://httpbin.org/get', params={'extra': 'param'})
if response.status_code == 200:
data = response.json()
print(f"最終URL: {response.url}")
print(f"合併後的參數: {data.get('args', {})}")
print(f"請求頭: {data.get('headers', {}).get('User-Agent')}")
# 5. Session的請求鉤子
print("\n5. 請求鉤子演示:")
def log_request_hook(response, *args, **kwargs):
"""請求日誌鉤子"""
print(f"[鉤子] 請求: {response.request.method} {response.url}")
print(f"[鉤子] 狀態碼: {response.status_code}")
print(f"[鉤子] 響應時間: {response.elapsed.total_seconds():.3f}秒")
# 創建帶鉤子的Session
hook_session = requests.Session()
hook_session.hooks['response'].append(log_request_hook)
# 發送請求,鉤子會自動執行
print("\n發送帶鉤子的請求:")
response = hook_session.get('https://httpbin.org/delay/1')
# 6. Session上下文管理
print("\n6. Session上下文管理:")
# 使用with語句自動管理Session生命週期
with requests.Session() as s:
s.headers.update({'User-Agent': 'ContextManager/1.0'})
response = s.get('https://httpbin.org/get')
print(f"上下文管理器請求狀態: {response.status_code}")
print(f"User-Agent: {response.json()['headers'].get('User-Agent')}")
# Session會自動關閉
# 7. Session錯誤處理
print("\n7. Session錯誤處理:")
error_session = requests.Session()
# 設置重試適配器
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
retry_strategy = Retry(
total=3, # 總重試次數
backoff_factor=1, # 重試間隔
status_forcelist=[429, 500, 502, 503, 504], # 需要重試的狀態碼
)
adapter = HTTPAdapter(max_retries=retry_strategy)
error_session.mount("http://", adapter)
error_session.mount("https://", adapter)
try:
# 測試重試機制
response = error_session.get('https://httpbin.org/status/500', timeout=5)
print(f"重試後狀態碼: {response.status_code}")
except requests.exceptions.RequestException as e:
print(f"請求最終失敗: {e}")
# 8. Session狀態管理
print("\n8. Session狀態管理:")
state_session = requests.Session()
# 模擬登錄流程
login_data = {
'username': 'testuser',
'password': 'testpass'
}
# 第一步:獲取登錄頁面(可能包含CSRF token)
login_page = state_session.get('https://httpbin.org/get')
print(f"獲取登錄頁面: {login_page.status_code}")
# 第二步:提交登錄信息
login_response = state_session.post('https://httpbin.org/post', data=login_data)
print(f"登錄請求: {login_response.status_code}")
# 第三步:訪問需要認證的頁面
protected_response = state_session.get('https://httpbin.org/get')
print(f"訪問受保護頁面: {protected_response.status_code}")
# Session會自動維護整個會話狀態
print(f"會話中的Cookie數量: {len(state_session.cookies)}")
# 運行Session演示
if __name__ == "__main__":
session_management_demo()
身份驗證¶
Requests支持多種身份驗證方式,包括基本認證、摘要認證、OAuth等。
import requests
from requests.auth import HTTPBasicAuth, HTTPDigestAuth
import base64
import hashlib
import time
def authentication_demo():
"""
演示各種身份驗證方式
"""
print("=== 身份驗證演示 ===")
# 1. HTTP基本認證 (Basic Authentication)
print("\n1. HTTP基本認證:")
# 方法1: 使用auth參數
response = requests.get(
'https://httpbin.org/basic-auth/user/pass',
auth=('user', 'pass')
)
print(f"基本認證狀態碼: {response.status_code}")
if response.status_code == 200:
print(f"認證成功: {response.json()}")
# 方法2: 使用HTTPBasicAuth類
response2 = requests.get(
'https://httpbin.org/basic-auth/testuser/testpass',
auth=HTTPBasicAuth('testuser', 'testpass')
)
print(f"HTTPBasicAuth狀態碼: {response2.status_code}")
# 方法3: 手動設置Authorization頭
credentials = base64.b64encode(b'user:pass').decode('ascii')
headers = {'Authorization': f'Basic {credentials}'}
response3 = requests.get(
'https://httpbin.org/basic-auth/user/pass',
headers=headers
)
print(f"手動設置頭部狀態碼: {response3.status_code}")
# 2. HTTP摘要認證 (Digest Authentication)
print("\n2. HTTP摘要認證:")
try:
response = requests.get(
'https://httpbin.org/digest-auth/auth/user/pass',
auth=HTTPDigestAuth('user', 'pass')
)
print(f"摘要認證狀態碼: {response.status_code}")
if response.status_code == 200:
print(f"摘要認證成功: {response.json()}")
except Exception as e:
print(f"摘要認證失敗: {e}")
# 3. Bearer Token認證
print("\n3. Bearer Token認證:")
# 模擬JWT token
token = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c"
headers = {'Authorization': f'Bearer {token}'}
response = requests.get('https://httpbin.org/bearer', headers=headers)
print(f"Bearer Token狀態碼: {response.status_code}")
if response.status_code == 200:
print(f"Token認證成功: {response.json()}")
# 4. API Key認證
print("\n4. API Key認證:")
# 方法1: 在URL參數中
api_key = "your-api-key-here"
response = requests.get(
'https://httpbin.org/get',
params={'api_key': api_key}
)
print(f"URL參數API Key: {response.json()['args']}")
# 方法2: 在請求頭中
headers = {'X-API-Key': api_key}
response2 = requests.get('https://httpbin.org/get', headers=headers)
print(f"請求頭API Key: {response2.json()['headers'].get('X-Api-Key')}")
# 5. 自定義認證類
print("\n5. 自定義認證類:")
class CustomAuth(requests.auth.AuthBase):
"""自定義認證類"""
def __init__(self, api_key, secret_key):
self.api_key = api_key
self.secret_key = secret_key
def __call__(self, r):
# 生成時間戳
timestamp = str(int(time.time()))
# 生成簽名
string_to_sign = f"{r.method}\n{r.url}\n{timestamp}"
signature = hashlib.sha256(
(string_to_sign + self.secret_key).encode('utf-8')
).hexdigest()
# 添加認證頭
r.headers['X-API-Key'] = self.api_key
r.headers['X-Timestamp'] = timestamp
r.headers['X-Signature'] = signature
return r
# 使用自定義認證
custom_auth = CustomAuth('my-api-key', 'my-secret-key')
response = requests.get('https://httpbin.org/get', auth=custom_auth)
if response.status_code == 200:
headers = response.json()['headers']
print(f"自定義認證頭部:")
print(f" X-API-Key: {headers.get('X-Api-Key')}")
print(f" X-Timestamp: {headers.get('X-Timestamp')}")
print(f" X-Signature: {headers.get('X-Signature', '')[:20]}...")
# 6. OAuth 2.0 模擬
print("\n6. OAuth 2.0 模擬:")
def oauth2_flow_simulation():
"""模擬OAuth 2.0授權流程"""
# 第一步: 獲取授權碼 (實際應用中用戶會被重定向到授權服務器)
auth_url = "https://httpbin.org/get"
auth_params = {
'response_type': 'code',
'client_id': 'your-client-id',
'redirect_uri': 'https://yourapp.com/callback',
'scope': 'read write',
'state': 'random-state-string'
}
print(f"授權URL: {auth_url}?{'&'.join([f'{k}={v}' for k, v in auth_params.items()])}")
# 第二步: 使用授權碼獲取訪問令牌
token_data = {
'grant_type': 'authorization_code',
'code': 'received-auth-code',
'redirect_uri': 'https://yourapp.com/callback',
'client_id': 'your-client-id',
'client_secret': 'your-client-secret'
}
# 模擬獲取token
token_response = requests.post('https://httpbin.org/post', data=token_data)
print(f"Token請求狀態: {token_response.status_code}")
# 第三步: 使用訪問令牌訪問API
access_token = "mock-access-token-12345"
api_headers = {'Authorization': f'Bearer {access_token}'}
api_response = requests.get('https://httpbin.org/get', headers=api_headers)
print(f"API訪問狀態: {api_response.status_code}")
return access_token
oauth_token = oauth2_flow_simulation()
# 7. 會話級認證
print("\n7. 會話級認證:")
# 創建帶認證的Session
auth_session = requests.Session()
auth_session.auth = ('session_user', 'session_pass')
# 所有通過這個Session的請求都會自動包含認證信息
response1 = auth_session.get('https://httpbin.org/basic-auth/session_user/session_pass')
print(f"會話認證請求1: {response1.status_code}")
response2 = auth_session.get('https://httpbin.org/basic-auth/session_user/session_pass')
print(f"會話認證請求2: {response2.status_code}")
# 8. 認證錯誤處理
print("\n8. 認證錯誤處理:")
def handle_auth_errors():
"""處理認證相關錯誤"""
# 測試錯誤的認證信息
try:
response = requests.get(
'https://httpbin.org/basic-auth/user/pass',
auth=('wrong_user', 'wrong_pass'),
timeout=5
)
if response.status_code == 401:
print("✗ 認證失敗: 用戶名或密碼錯誤")
print(f" WWW-Authenticate: {response.headers.get('WWW-Authenticate')}")
elif response.status_code == 403:
print("✗ 訪問被拒絕: 權限不足")
else:
print(f"認證狀態: {response.status_code}")
except requests.exceptions.RequestException as e:
print(f"認證請求異常: {e}")
handle_auth_errors()
# 運行認證演示
if __name__ == "__main__":
authentication_demo()
代理設置和SSL配置¶
在爬蟲開發中,代理和SSL配置是非常重要的功能,可以幫助我們繞過網絡限制和確保安全通信。
import requests
import ssl
from requests.adapters import HTTPAdapter
from urllib3.util.ssl_ import create_urllib3_context
def proxy_and_ssl_demo():
"""
演示代理設置和SSL配置
"""
print("=== 代理設置和SSL配置演示 ===")
# 1. HTTP代理設置
print("\n1. HTTP代理設置:")
# 基本代理設置
proxies = {
'http': 'http://proxy.example.com:8080',
'https': 'https://proxy.example.com:8080'
}
# 注意:這裏使用示例代理,實際運行時需要替換爲真實代理
print(f"配置的代理: {proxies}")
# 帶認證的代理
auth_proxies = {
'http': 'http://username:password@proxy.example.com:8080',
'https': 'https://username:password@proxy.example.com:8080'
}
print(f"帶認證的代理: {auth_proxies}")
# 2. SOCKS代理設置
print("\n2. SOCKS代理設置:")
# 需要安裝: pip install requests[socks]
socks_proxies = {
'http': 'socks5://127.0.0.1:1080',
'https': 'socks5://127.0.0.1:1080'
}
print(f"SOCKS代理配置: {socks_proxies}")
# 3. 代理輪換
print("\n3. 代理輪換演示:")
import random
proxy_list = [
{'http': 'http://proxy1.example.com:8080', 'https': 'https://proxy1.example.com:8080'},
{'http': 'http://proxy2.example.com:8080', 'https': 'https://proxy2.example.com:8080'},
{'http': 'http://proxy3.example.com:8080', 'https': 'https://proxy3.example.com:8080'}
]
def get_random_proxy():
"""獲取隨機代理"""
return random.choice(proxy_list)
# 模擬使用不同代理發送請求
for i in range(3):
proxy = get_random_proxy()
print(f"請求 {i+1} 使用代理: {proxy['http']}")
# 實際請求代碼:
# response = requests.get('https://httpbin.org/ip', proxies=proxy, timeout=10)
# 4. 代理驗證和測試
print("\n4. 代理驗證:")
def test_proxy(proxy_dict, test_url='https://httpbin.org/ip'):
"""測試代理是否可用"""
try:
response = requests.get(
test_url,
proxies=proxy_dict,
timeout=10
)
if response.status_code == 200:
ip_info = response.json()
print(f"✓ 代理可用")
print(f" 出口IP: {ip_info.get('origin')}")
print(f" 響應時間: {response.elapsed.total_seconds():.3f}秒")
return True
else:
print(f"✗ 代理響應異常: {response.status_code}")
return False
except requests.exceptions.ProxyError:
print("✗ 代理連接失敗")
return False
except requests.exceptions.Timeout:
print("✗ 代理連接超時")
return False
except requests.exceptions.RequestException as e:
print(f"✗ 代理請求異常: {e}")
return False
# 測試直連(無代理)
print("\n測試直連:")
try:
direct_response = requests.get('https://httpbin.org/ip', timeout=10)
if direct_response.status_code == 200:
ip_info = direct_response.json()
print(f"✓ 直連成功")
print(f" 本地IP: {ip_info.get('origin')}")
except Exception as e:
print(f"✗ 直連失敗: {e}")
# 5. SSL配置
print("\n5. SSL配置演示:")
# 禁用SSL驗證(不推薦用於生產環境)
print("\n禁用SSL驗證:")
try:
response = requests.get(
'https://httpbin.org/get',
verify=False # 禁用SSL證書驗證
)
print(f"✓ 禁用SSL驗證請求成功: {response.status_code}")
except Exception as e:
print(f"✗ SSL請求失敗: {e}")
# 自定義CA證書
print("\n自定義CA證書:")
# 指定CA證書文件路徑
# response = requests.get('https://httpbin.org/get', verify='/path/to/ca-bundle.crt')
print("可以通過verify參數指定CA證書文件路徑")
# 客戶端證書認證
print("\n客戶端證書認證:")
# cert參數可以是證書文件路徑的字符串,或者是(cert, key)元組
# response = requests.get('https://httpbin.org/get', cert=('/path/to/client.cert', '/path/to/client.key'))
print("可以通過cert參數指定客戶端證書")
# 6. 自定義SSL上下文
print("\n6. 自定義SSL上下文:")
class SSLAdapter(HTTPAdapter):
"""自定義SSL適配器"""
def __init__(self, ssl_context=None, **kwargs):
self.ssl_context = ssl_context
super().__init__(**kwargs)
def init_poolmanager(self, *args, **kwargs):
kwargs['ssl_context'] = self.ssl_context
return super().init_poolmanager(*args, **kwargs)
# 創建自定義SSL上下文
ssl_context = create_urllib3_context()
ssl_context.check_hostname = False
ssl_context.verify_mode = ssl.CERT_NONE
# 使用自定義SSL適配器
session = requests.Session()
session.mount('https://', SSLAdapter(ssl_context))
try:
response = session.get('https://httpbin.org/get')
print(f"✓ 自定義SSL上下文請求成功: {response.status_code}")
except Exception as e:
print(f"✗ 自定義SSL請求失敗: {e}")
# 7. 綜合配置示例
print("\n7. 綜合配置示例:")
def create_secure_session(proxy=None, verify_ssl=True, client_cert=None):
"""創建安全配置的Session"""
session = requests.Session()
# 設置代理
if proxy:
session.proxies.update(proxy)
# SSL配置
session.verify = verify_ssl
if client_cert:
session.cert = client_cert
# 設置超時
session.timeout = 30
# 設置重試
from urllib3.util.retry import Retry
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
# 創建配置好的Session
secure_session = create_secure_session(
# proxy={'http': 'http://proxy.example.com:8080'},
verify_ssl=True
)
try:
response = secure_session.get('https://httpbin.org/get')
print(f"✓ 安全Session請求成功: {response.status_code}")
print(f" SSL驗證: {'啓用' if secure_session.verify else '禁用'}")
print(f" 代理設置: {secure_session.proxies if secure_session.proxies else '無'}")
except Exception as e:
print(f"✗ 安全Session請求失敗: {e}")
# 8. 環境變量代理配置
print("\n8. 環境變量代理配置:")
import os
# Requests會自動讀取這些環境變量
env_vars = {
'HTTP_PROXY': 'http://proxy.example.com:8080',
'HTTPS_PROXY': 'https://proxy.example.com:8080',
'NO_PROXY': 'localhost,127.0.0.1,.local'
}
print("可以設置的環境變量:")
for var, value in env_vars.items():
print(f" {var}={value}")
# 檢查當前環境變量
current_proxy = os.environ.get('HTTP_PROXY') or os.environ.get('http_proxy')
if current_proxy:
print(f"當前HTTP代理: {current_proxy}")
else:
print("未設置HTTP代理環境變量")
# 運行代理和SSL演示
if __name__ == "__main__":
proxy_and_ssl_demo()
Cookie處理¶
Cookie是Web應用中維護狀態的重要機制,Requests提供了強大的Cookie處理功能。
import requests
from http.cookies import SimpleCookie
import time
from datetime import datetime, timedelta
def cookie_handling_demo():
"""
演示Cookie處理的各種功能
"""
print("=== Cookie處理演示 ===")
# 1. 基本Cookie操作
print("\n1. 基本Cookie操作:")
# 發送帶Cookie的請求
cookies = {'session_id': 'abc123', 'user_pref': 'dark_mode'}
response = requests.get('https://httpbin.org/cookies', cookies=cookies)
if response.status_code == 200:
received_cookies = response.json().get('cookies', {})
print(f"發送的Cookies: {cookies}")
print(f"服務器接收的Cookies: {received_cookies}")
# 2. 從響應中獲取Cookie
print("\n2. 從響應中獲取Cookie:")
# 請求設置Cookie的URL
response = requests.get('https://httpbin.org/cookies/set/test_cookie/test_value')
print(f"響應狀態碼: {response.status_code}")
print(f"響應中的Cookies: {dict(response.cookies)}")
# 查看Cookie詳細信息
for cookie in response.cookies:
print(f"Cookie詳情:")
print(f" 名稱: {cookie.name}")
print(f" 值: {cookie.value}")
print(f" 域: {cookie.domain}")
print(f" 路徑: {cookie.path}")
print(f" 過期時間: {cookie.expires}")
print(f" 安全標誌: {cookie.secure}")
print(f" HttpOnly: {cookie.has_nonstandard_attr('HttpOnly')}")
# 3. Cookie持久化
print("\n3. Cookie持久化演示:")
# 創建Session來自動管理Cookie
session = requests.Session()
# 第一次請求,服務器設置Cookie
response1 = session.get('https://httpbin.org/cookies/set/persistent_cookie/persistent_value')
print(f"第一次請求狀態: {response1.status_code}")
print(f"Session中的Cookies: {dict(session.cookies)}")
# 第二次請求,自動攜帶Cookie
response2 = session.get('https://httpbin.org/cookies')
if response2.status_code == 200:
cookies_data = response2.json()
print(f"第二次請求攜帶的Cookies: {cookies_data.get('cookies', {})}")
# 4. 手動Cookie管理
print("\n4. 手動Cookie管理:")
from requests.cookies import RequestsCookieJar
# 創建Cookie容器
jar = RequestsCookieJar()
# 添加Cookie
jar.set('custom_cookie', 'custom_value', domain='httpbin.org', path='/')
jar.set('another_cookie', 'another_value', domain='httpbin.org', path='/')
# 使用自定義Cookie容器
response = requests.get('https://httpbin.org/cookies', cookies=jar)
if response.status_code == 200:
print(f"自定義Cookie容器: {dict(jar)}")
print(f"服務器接收: {response.json().get('cookies', {})}")
# 5. Cookie的高級屬性
print("\n5. Cookie高級屬性演示:")
def create_advanced_cookie():
"""創建帶高級屬性的Cookie"""
jar = RequestsCookieJar()
# 設置帶過期時間的Cookie
expire_time = int(time.time()) + 3600 # 1小時後過期
jar.set(
'session_token',
'token_12345',
domain='httpbin.org',
path='/',
expires=expire_time,
secure=True, # 只在HTTPS下傳輸
rest={'HttpOnly': True} # 防止JavaScript訪問
)
# 設置SameSite屬性
jar.set(
'csrf_token',
'csrf_abc123',
domain='httpbin.org',
path='/',
rest={'SameSite': 'Strict'}
)
return jar
advanced_jar = create_advanced_cookie()
print(f"高級Cookie容器: {dict(advanced_jar)}")
# 6. Cookie文件操作
print("\n6. Cookie文件操作:")
import pickle
import os
# 保存Cookie到文件
def save_cookies_to_file(session, filename):
"""保存Session的Cookie到文件"""
with open(filename, 'wb') as f:
pickle.dump(session.cookies, f)
print(f"Cookies已保存到: {filename}")
# 從文件加載Cookie
def load_cookies_from_file(session, filename):
"""從文件加載Cookie到Session"""
if os.path.exists(filename):
with open(filename, 'rb') as f:
session.cookies.update(pickle.load(f))
print(f"Cookies已從文件加載: {filename}")
return True
return False
# 演示Cookie文件操作
cookie_session = requests.Session()
# 設置一些Cookie
cookie_session.get('https://httpbin.org/cookies/set/file_cookie/file_value')
# 保存到文件
cookie_file = 'session_cookies.pkl'
save_cookies_to_file(cookie_session, cookie_file)
# 創建新Session並加載Cookie
new_session = requests.Session()
if load_cookies_from_file(new_session, cookie_file):
response = new_session.get('https://httpbin.org/cookies')
if response.status_code == 200:
print(f"加載的Cookies驗證: {response.json().get('cookies', {})}")
# 清理文件
if os.path.exists(cookie_file):
os.remove(cookie_file)
print(f"已清理Cookie文件: {cookie_file}")
# 7. Cookie域和路徑管理
print("\n7. Cookie域和路徑管理:")
def demonstrate_cookie_scope():
"""演示Cookie的作用域"""
jar = RequestsCookieJar()
# 設置不同域和路徑的Cookie
jar.set('global_cookie', 'global_value', domain='.example.com', path='/')
jar.set('api_cookie', 'api_value', domain='api.example.com', path='/v1/')
jar.set('admin_cookie', 'admin_value', domain='admin.example.com', path='/admin/')
print("Cookie作用域演示:")
for cookie in jar:
print(f" {cookie.name}: 域={cookie.domain}, 路徑={cookie.path}")
return jar
scope_jar = demonstrate_cookie_scope()
# 8. Cookie安全性
print("\n8. Cookie安全性演示:")
def create_secure_cookies():
"""創建安全的Cookie設置"""
jar = RequestsCookieJar()
# 安全Cookie設置
security_settings = {
'session_id': {
'value': 'secure_session_123',
'secure': True, # 只在HTTPS傳輸
'httponly': True, # 防止XSS攻擊
'samesite': 'Strict', # 防止CSRF攻擊
'expires': int(time.time()) + 1800 # 30分鐘過期
},
'csrf_token': {
'value': 'csrf_token_456',
'secure': True,
'samesite': 'Strict',
'expires': int(time.time()) + 3600 # 1小時過期
}
}
for name, settings in security_settings.items():
jar.set(
name,
settings['value'],
domain='httpbin.org',
path='/',
expires=settings.get('expires'),
secure=settings.get('secure', False),
rest={
'HttpOnly': settings.get('httponly', False),
'SameSite': settings.get('samesite', 'Lax')
}
)
print("安全Cookie配置:")
for cookie in jar:
print(f" {cookie.name}: 安全={cookie.secure}")
return jar
secure_jar = create_secure_cookies()
# 9. Cookie調試和分析
print("\n9. Cookie調試和分析:")
def analyze_cookies(response):
"""分析響應中的Cookie"""
print("Cookie分析報告:")
if not response.cookies:
print(" 無Cookie")
return
for cookie in response.cookies:
print(f"\n Cookie: {cookie.name}")
print(f" 值: {cookie.value}")
print(f" 域: {cookie.domain or '未設置'}")
print(f" 路徑: {cookie.path or '/'}")
if cookie.expires:
expire_date = datetime.fromtimestamp(cookie.expires)
print(f" 過期時間: {expire_date}")
# 檢查是否即將過期
if expire_date < datetime.now() + timedelta(hours=1):
print(f" ⚠️ 警告: Cookie將在1小時內過期")
else:
print(f" 過期時間: 會話結束")
print(f" 安全標誌: {cookie.secure}")
print(f" 大小: {len(cookie.value)}字節")
# 檢查Cookie大小
if len(cookie.value) > 4000:
print(f" ⚠️ 警告: Cookie過大,可能被截斷")
# 分析一個帶Cookie的響應
test_response = requests.get('https://httpbin.org/cookies/set/analysis_cookie/test_analysis_value')
analyze_cookies(test_response)
# 10. Cookie錯誤處理
print("\n10. Cookie錯誤處理:")
def handle_cookie_errors():
"""處理Cookie相關錯誤"""
try:
# 嘗試設置無效的Cookie
jar = RequestsCookieJar()
# 測試各種邊界情況
test_cases = [
('valid_cookie', 'valid_value'),
('', 'empty_name'), # 空名稱
('space cookie', 'space_in_name'), # 名稱包含空格
('valid_name', ''), # 空值
('long_cookie', 'x' * 5000), # 超長值
]
for name, value in test_cases:
try:
jar.set(name, value, domain='httpbin.org')
print(f"✓ 成功設置Cookie: {name[:20]}...")
except Exception as e:
print(f"✗ 設置Cookie失敗 ({name[:20]}...): {e}")
# 測試Cookie發送
response = requests.get('https://httpbin.org/cookies', cookies=jar, timeout=5)
print(f"Cookie發送測試: {response.status_code}")
except requests.exceptions.RequestException as e:
print(f"Cookie請求異常: {e}")
except Exception as e:
print(f"Cookie處理異常: {e}")
handle_cookie_errors()
# 運行Cookie演示
if __name__ == "__main__":
cookie_handling_demo()
文件上傳和下載¶
文件傳輸是網絡爬蟲和自動化中的重要功能,Requests提供了簡單而強大的文件處理能力。
import requests
import os
import io
from pathlib import Path
import mimetypes
import hashlib
from tqdm import tqdm
def file_transfer_demo():
"""
演示文件上傳和下載功能
"""
print("=== 文件上傳和下載演示 ===")
# 1. 基本文件上傳
print("\n1. 基本文件上傳:")
# 創建測試文件
test_file_content = "這是一個測試文件\nTest file content\n測試數據123"
test_file_path = "test_upload.txt"
with open(test_file_path, 'w', encoding='utf-8') as f:
f.write(test_file_content)
# 方法1: 使用files參數上傳
with open(test_file_path, 'rb') as f:
files = {'file': f}
response = requests.post('https://httpbin.org/post', files=files)
if response.status_code == 200:
result = response.json()
print(f"文件上傳成功")
print(f"上傳的文件信息: {result.get('files', {})}")
# 2. 高級文件上傳
print("\n2. 高級文件上傳:")
# 指定文件名和MIME類型
with open(test_file_path, 'rb') as f:
files = {
'document': ('custom_name.txt', f, 'text/plain'),
'metadata': ('info.json', io.StringIO('{"type": "document"}'), 'application/json')
}
# 同時發送表單數據
data = {
'title': '測試文檔',
'description': '這是一個測試上傳',
'category': 'test'
}
response = requests.post('https://httpbin.org/post', files=files, data=data)
if response.status_code == 200:
result = response.json()
print(f"高級上傳成功")
print(f"表單數據: {result.get('form', {})}")
print(f"文件數據: {list(result.get('files', {}).keys())}")
# 3. 多文件上傳
print("\n3. 多文件上傳:")
# 創建多個測試文件
test_files = []
for i in range(3):
filename = f"test_file_{i+1}.txt"
content = f"這是測試文件 {i+1}\nFile {i+1} content\n"
with open(filename, 'w', encoding='utf-8') as f:
f.write(content)
test_files.append(filename)
# 上傳多個文件
files = []
for filename in test_files:
files.append(('files', (filename, open(filename, 'rb'), 'text/plain')))
try:
response = requests.post('https://httpbin.org/post', files=files)
if response.status_code == 200:
result = response.json()
print(f"多文件上傳成功")
print(f"上傳文件數量: {len(result.get('files', {}))}")
finally:
# 關閉文件句柄
for _, (_, file_obj, _) in files:
file_obj.close()
# 4. 內存文件上傳
print("\n4. 內存文件上傳:")
# 創建內存中的文件
memory_file = io.BytesIO()
memory_file.write("內存中的文件內容\nMemory file content".encode('utf-8'))
memory_file.seek(0) # 重置指針到開始
files = {'memory_file': ('memory.txt', memory_file, 'text/plain')}
response = requests.post('https://httpbin.org/post', files=files)
if response.status_code == 200:
print(f"內存文件上傳成功")
memory_file.close()
# 5. 文件下載基礎
print("\n5. 文件下載基礎:")
# 下載小文件
download_url = 'https://httpbin.org/json'
response = requests.get(download_url)
if response.status_code == 200:
# 保存到文件
download_filename = 'downloaded_data.json'
with open(download_filename, 'wb') as f:
f.write(response.content)
print(f"文件下載成功: {download_filename}")
print(f"文件大小: {len(response.content)}字節")
print(f"Content-Type: {response.headers.get('content-type')}")
# 6. 大文件下載(流式下載)
print("\n6. 大文件流式下載:")
def download_large_file(url, filename, chunk_size=8192):
"""流式下載大文件"""
try:
with requests.get(url, stream=True) as response:
response.raise_for_status()
# 獲取文件大小
total_size = int(response.headers.get('content-length', 0))
with open(filename, 'wb') as f:
if total_size > 0:
# 使用進度條
with tqdm(total=total_size, unit='B', unit_scale=True, desc=filename) as pbar:
for chunk in response.iter_content(chunk_size=chunk_size):
if chunk:
f.write(chunk)
pbar.update(len(chunk))
else:
# 無法獲取文件大小時
downloaded = 0
for chunk in response.iter_content(chunk_size=chunk_size):
if chunk:
f.write(chunk)
downloaded += len(chunk)
print(f"\r已下載: {downloaded}字節", end='', flush=True)
print() # 換行
print(f"\n✓ 文件下載完成: {filename}")
return True
except requests.exceptions.RequestException as e:
print(f"✗ 下載失敗: {e}")
return False
# 演示流式下載(使用較小的文件作爲示例)
large_file_url = 'https://httpbin.org/bytes/10240' # 10KB測試文件
if download_large_file(large_file_url, 'large_download.bin'):
file_size = os.path.getsize('large_download.bin')
print(f"下載文件大小: {file_size}字節")
# 7. 斷點續傳下載
print("\n7. 斷點續傳下載:")
def resume_download(url, filename, chunk_size=8192):
"""支持斷點續傳的下載"""
# 檢查本地文件是否存在
resume_pos = 0
if os.path.exists(filename):
resume_pos = os.path.getsize(filename)
print(f"發現本地文件,從位置 {resume_pos} 繼續下載")
# 設置Range頭進行斷點續傳
headers = {'Range': f'bytes={resume_pos}-'} if resume_pos > 0 else {}
try:
response = requests.get(url, headers=headers, stream=True)
# 檢查服務器是否支持斷點續傳
if resume_pos > 0 and response.status_code != 206:
print("服務器不支持斷點續傳,重新下載")
resume_pos = 0
response = requests.get(url, stream=True)
response.raise_for_status()
# 獲取總文件大小
if 'content-range' in response.headers:
total_size = int(response.headers['content-range'].split('/')[-1])
else:
total_size = int(response.headers.get('content-length', 0)) + resume_pos
# 打開文件(追加模式如果是續傳)
mode = 'ab' if resume_pos > 0 else 'wb'
with open(filename, mode) as f:
downloaded = resume_pos
for chunk in response.iter_content(chunk_size=chunk_size):
if chunk:
f.write(chunk)
downloaded += len(chunk)
if total_size > 0:
progress = (downloaded / total_size) * 100
print(f"\r下載進度: {progress:.1f}% ({downloaded}/{total_size})", end='', flush=True)
print(f"\n✓ 下載完成: {filename}")
return True
except requests.exceptions.RequestException as e:
print(f"✗ 下載失敗: {e}")
return False
# 演示斷點續傳(模擬)
resume_url = 'https://httpbin.org/bytes/5120' # 5KB測試文件
resume_filename = 'resume_download.bin'
# 先下載一部分(模擬中斷)
try:
response = requests.get(resume_url, stream=True)
with open(resume_filename, 'wb') as f:
for i, chunk in enumerate(response.iter_content(chunk_size=1024)):
if i >= 2: # 只下載前2KB
break
f.write(chunk)
print(f"模擬下載中斷,已下載: {os.path.getsize(resume_filename)}字節")
except:
pass
# 繼續下載
resume_download(resume_url, resume_filename)
# 8. 文件完整性驗證
print("\n8. 文件完整性驗證:")
def verify_file_integrity(filename, expected_hash=None, hash_algorithm='md5'):
"""驗證文件完整性"""
if not os.path.exists(filename):
print(f"✗ 文件不存在: {filename}")
return False
# 計算文件哈希
hash_obj = hashlib.new(hash_algorithm)
with open(filename, 'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
hash_obj.update(chunk)
file_hash = hash_obj.hexdigest()
print(f"文件 {filename} 的{hash_algorithm.upper()}哈希: {file_hash}")
if expected_hash:
if file_hash == expected_hash:
print(f"✓ 文件完整性驗證通過")
return True
else:
print(f"✗ 文件完整性驗證失敗")
print(f" 期望: {expected_hash}")
print(f" 實際: {file_hash}")
return False
return True
# 驗證下載的文件
for filename in ['downloaded_data.json', 'large_download.bin']:
if os.path.exists(filename):
verify_file_integrity(filename)
# 9. 自動MIME類型檢測
print("\n9. 自動MIME類型檢測:")
def upload_with_auto_mime(filename):
"""自動檢測MIME類型並上傳"""
if not os.path.exists(filename):
print(f"文件不存在: {filename}")
return
# 自動檢測MIME類型
mime_type, _ = mimetypes.guess_type(filename)
if mime_type is None:
mime_type = 'application/octet-stream' # 默認二進制類型
print(f"文件: {filename}")
print(f"檢測到的MIME類型: {mime_type}")
with open(filename, 'rb') as f:
files = {'file': (filename, f, mime_type)}
response = requests.post('https://httpbin.org/post', files=files)
if response.status_code == 200:
print(f"✓ 上傳成功")
else:
print(f"✗ 上傳失敗: {response.status_code}")
# 測試不同類型的文件
test_files_mime = ['test_upload.txt', 'downloaded_data.json']
for filename in test_files_mime:
if os.path.exists(filename):
upload_with_auto_mime(filename)
# 10. 清理測試文件
print("\n10. 清理測試文件:")
cleanup_files = [
test_file_path, 'downloaded_data.json', 'large_download.bin',
'resume_download.bin'
] + test_files
for filename in cleanup_files:
if os.path.exists(filename):
try:
os.remove(filename)
print(f"✓ 已刪除: {filename}")
except Exception as e:
print(f"✗ 刪除失敗 {filename}: {e}")
# 運行文件傳輸演示
if __name__ == "__main__":
file_transfer_demo()
超時和重試機制¶
在網絡請求中,超時和重試機制是確保程序穩定性的重要功能。
import requests
import time
import random
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from functools import wraps
import logging
def timeout_and_retry_demo():
"""
演示超時和重試機制
"""
print("=== 超時和重試機制演示 ===")
# 1. 基本超時設置
print("\n1. 基本超時設置:")
# 連接超時和讀取超時
try:
# timeout=(連接超時, 讀取超時)
response = requests.get('https://httpbin.org/delay/2', timeout=(5, 10))
print(f"請求成功: {response.status_code}")
print(f"響應時間: {response.elapsed.total_seconds():.2f}秒")
except requests.exceptions.Timeout as e:
print(f"請求超時: {e}")
except requests.exceptions.RequestException as e:
print(f"請求異常: {e}")
# 2. 不同類型的超時
print("\n2. 不同類型的超時演示:")
def test_different_timeouts():
"""測試不同的超時設置"""
timeout_configs = [
("單一超時", 5), # 連接和讀取都是5秒
("分別設置", (3, 10)), # 連接3秒,讀取10秒
("只設置連接超時", (2, None)), # 只設置連接超時
]
for desc, timeout in timeout_configs:
try:
print(f"\n測試 {desc}: {timeout}")
start_time = time.time()
response = requests.get('https://httpbin.org/delay/1', timeout=timeout)
elapsed = time.time() - start_time
print(f" ✓ 成功: {response.status_code}, 耗時: {elapsed:.2f}秒")
except requests.exceptions.Timeout as e:
elapsed = time.time() - start_time
print(f" ✗ 超時: {elapsed:.2f}秒, {e}")
except Exception as e:
print(f" ✗ 異常: {e}")
test_different_timeouts()
# 3. 手動重試機制
print("\n3. 手動重試機制:")
def manual_retry(url, max_retries=3, delay=1, backoff=2):
"""手動實現重試機制"""
for attempt in range(max_retries + 1):
try:
print(f" 嘗試 {attempt + 1}/{max_retries + 1}: {url}")
response = requests.get(url, timeout=5)
# 檢查響應狀態
if response.status_code == 200:
print(f" ✓ 成功: {response.status_code}")
return response
elif response.status_code >= 500:
# 服務器錯誤,可以重試
print(f" 服務器錯誤 {response.status_code},準備重試")
raise requests.exceptions.RequestException(f"Server error: {response.status_code}")
else:
# 客戶端錯誤,不重試
print(f" 客戶端錯誤 {response.status_code},不重試")
return response
except (requests.exceptions.Timeout,
requests.exceptions.ConnectionError,
requests.exceptions.RequestException) as e:
print(f" ✗ 請求失敗: {e}")
if attempt < max_retries:
wait_time = delay * (backoff ** attempt)
print(f" 等待 {wait_time:.1f}秒 後重試...")
time.sleep(wait_time)
else:
print(f" 已達到最大重試次數,放棄")
raise
return None
# 測試手動重試
try:
response = manual_retry('https://httpbin.org/status/500', max_retries=2)
except Exception as e:
print(f"手動重試最終失敗: {e}")
# 4. 使用urllib3的重試策略
print("\n4. urllib3重試策略:")
def create_retry_session():
"""創建帶重試策略的Session"""
session = requests.Session()
# 定義重試策略
retry_strategy = Retry(
total=3, # 總重試次數
status_forcelist=[429, 500, 502, 503, 504], # 需要重試的狀態碼
method_whitelist=["HEAD", "GET", "OPTIONS"], # 允許重試的方法
backoff_factor=1, # 退避因子
raise_on_redirect=False,
raise_on_status=False
)
# 創建適配器
adapter = HTTPAdapter(max_retries=retry_strategy)
# 掛載適配器
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
# 使用重試Session
retry_session = create_retry_session()
try:
print("使用重試Session請求:")
response = retry_session.get('https://httpbin.org/status/503', timeout=10)
print(f"最終響應: {response.status_code}")
except Exception as e:
print(f"重試Session失敗: {e}")
# 5. 高級重試配置
print("\n5. 高級重試配置:")
def create_advanced_retry_session():
"""創建高級重試配置的Session"""
session = requests.Session()
# 高級重試策略
retry_strategy = Retry(
total=5, # 總重試次數
read=3, # 讀取重試次數
connect=3, # 連接重試次數
status=3, # 狀態碼重試次數
status_forcelist=[408, 429, 500, 502, 503, 504, 520, 522, 524],
method_whitelist=["HEAD", "GET", "PUT", "DELETE", "OPTIONS", "TRACE"],
backoff_factor=0.3, # 退避因子:{backoff factor} * (2 ** ({number of total retries} - 1))
raise_on_redirect=False,
raise_on_status=False,
respect_retry_after_header=True # 尊重服務器的Retry-After頭
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
advanced_session = create_advanced_retry_session()
# 測試高級重試
test_urls = [
('正常請求', 'https://httpbin.org/get'),
('服務器錯誤', 'https://httpbin.org/status/500'),
('超時請求', 'https://httpbin.org/delay/3')
]
for desc, url in test_urls:
try:
print(f"\n測試 {desc}:")
start_time = time.time()
response = advanced_session.get(url, timeout=(5, 10))
elapsed = time.time() - start_time
print(f" ✓ 響應: {response.status_code}, 耗時: {elapsed:.2f}秒")
except Exception as e:
elapsed = time.time() - start_time
print(f" ✗ 失敗: {e}, 耗時: {elapsed:.2f}秒")
# 6. 裝飾器重試
print("\n6. 裝飾器重試:")
def retry_decorator(max_retries=3, delay=1, backoff=2, exceptions=(Exception,)):
"""重試裝飾器"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries + 1):
try:
return func(*args, **kwargs)
except exceptions as e:
if attempt == max_retries:
print(f"裝飾器重試失敗,已達最大次數: {e}")
raise
wait_time = delay * (backoff ** attempt)
print(f"裝飾器重試 {attempt + 1}/{max_retries + 1} 失敗: {e}")
print(f"等待 {wait_time:.1f}秒 後重試...")
time.sleep(wait_time)
return wrapper
return decorator
@retry_decorator(max_retries=2, delay=0.5, exceptions=(requests.exceptions.RequestException,))
def unreliable_request(url):
"""不穩定的請求函數"""
# 模擬隨機失敗
if random.random() < 0.7: # 70%概率失敗
raise requests.exceptions.ConnectionError("模擬連接失敗")
response = requests.get(url, timeout=5)
return response
# 測試裝飾器重試
try:
print("測試裝飾器重試:")
response = unreliable_request('https://httpbin.org/get')
print(f"裝飾器重試成功: {response.status_code}")
except Exception as e:
print(f"裝飾器重試最終失敗: {e}")
# 7. 智能重試策略
print("\n7. 智能重試策略:")
class SmartRetry:
"""智能重試類"""
def __init__(self, max_retries=3, base_delay=1, max_delay=60):
self.max_retries = max_retries
self.base_delay = base_delay
self.max_delay = max_delay
self.attempt_count = 0
def should_retry(self, exception, response=None):
"""判斷是否應該重試"""
# 網絡相關異常應該重試
if isinstance(exception, (requests.exceptions.Timeout,
requests.exceptions.ConnectionError)):
return True
# 特定狀態碼應該重試
if response and response.status_code in [429, 500, 502, 503, 504]:
return True
return False
def get_delay(self):
"""計算延遲時間"""
# 指數退避 + 隨機抖動
delay = min(self.base_delay * (2 ** self.attempt_count), self.max_delay)
jitter = random.uniform(0, 0.1) * delay # 10%的隨機抖動
return delay + jitter
def execute(self, func, *args, **kwargs):
"""執行帶重試的函數"""
last_exception = None
for attempt in range(self.max_retries + 1):
self.attempt_count = attempt
try:
result = func(*args, **kwargs)
# 如果是Response對象,檢查狀態碼
if hasattr(result, 'status_code'):
if self.should_retry(None, result) and attempt < self.max_retries:
print(f"智能重試: 狀態碼 {result.status_code},嘗試 {attempt + 1}")
time.sleep(self.get_delay())
continue
print(f"智能重試成功,嘗試次數: {attempt + 1}")
return result
except Exception as e:
last_exception = e
if self.should_retry(e) and attempt < self.max_retries:
delay = self.get_delay()
print(f"智能重試: {e},等待 {delay:.2f}秒,嘗試 {attempt + 1}")
time.sleep(delay)
else:
break
print(f"智能重試失敗,已達最大次數")
raise last_exception
# 測試智能重試
smart_retry = SmartRetry(max_retries=3, base_delay=0.5)
def test_request():
# 模擬不穩定的請求
if random.random() < 0.6:
raise requests.exceptions.ConnectionError("模擬網絡錯誤")
return requests.get('https://httpbin.org/get', timeout=5)
try:
response = smart_retry.execute(test_request)
print(f"智能重試最終成功: {response.status_code}")
except Exception as e:
print(f"智能重試最終失敗: {e}")
# 8. 重試監控和日誌
print("\n8. 重試監控和日誌:")
# 配置日誌
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class MonitoredRetry:
"""帶監控的重試類"""
def __init__(self, max_retries=3):
self.max_retries = max_retries
self.stats = {
'total_attempts': 0,
'successful_attempts': 0,
'failed_attempts': 0,
'retry_reasons': {}
}
def request_with_monitoring(self, url, **kwargs):
"""帶監控的請求"""
for attempt in range(self.max_retries + 1):
self.stats['total_attempts'] += 1
try:
logger.info(f"嘗試請求 {url},第 {attempt + 1} 次")
response = requests.get(url, **kwargs)
if response.status_code == 200:
self.stats['successful_attempts'] += 1
logger.info(f"請求成功: {response.status_code}")
return response
else:
reason = f"status_{response.status_code}"
self.stats['retry_reasons'][reason] = self.stats['retry_reasons'].get(reason, 0) + 1
if attempt < self.max_retries:
logger.warning(f"請求失敗: {response.status_code},準備重試")
time.sleep(1)
else:
logger.error(f"請求最終失敗: {response.status_code}")
return response
except Exception as e:
reason = type(e).__name__
self.stats['retry_reasons'][reason] = self.stats['retry_reasons'].get(reason, 0) + 1
if attempt < self.max_retries:
logger.warning(f"請求異常: {e},準備重試")
time.sleep(1)
else:
self.stats['failed_attempts'] += 1
logger.error(f"請求最終異常: {e}")
raise
def get_stats(self):
"""獲取統計信息"""
return self.stats
# 測試監控重試
monitored_retry = MonitoredRetry(max_retries=2)
test_urls_monitor = [
'https://httpbin.org/get',
'https://httpbin.org/status/500',
'https://httpbin.org/delay/1'
]
for url in test_urls_monitor:
try:
response = monitored_retry.request_with_monitoring(url, timeout=3)
print(f"監控請求結果: {response.status_code if response else 'None'}")
except Exception as e:
print(f"監控請求異常: {e}")
# 顯示統計信息
stats = monitored_retry.get_stats()
print(f"\n重試統計信息:")
print(f" 總嘗試次數: {stats['total_attempts']}")
print(f" 成功次數: {stats['successful_attempts']}")
print(f" 失敗次數: {stats['failed_attempts']}")
print(f" 重試原因: {stats['retry_reasons']}")
# 9. 超時和重試的最佳實踐
print("\n9. 超時和重試最佳實踐:")
def best_practice_request(url, max_retries=3, timeout=(5, 30)):
"""最佳實踐的請求函數"""
session = requests.Session()
# 配置重試策略
retry_strategy = Retry(
total=max_retries,
status_forcelist=[429, 500, 502, 503, 504],
method_whitelist=["HEAD", "GET", "OPTIONS"],
backoff_factor=1,
respect_retry_after_header=True
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
# 設置默認超時
session.timeout = timeout
try:
response = session.get(url)
response.raise_for_status() # 拋出HTTP錯誤
return response
except requests.exceptions.Timeout:
print(f"請求超時: {url}")
raise
except requests.exceptions.ConnectionError:
print(f"連接錯誤: {url}")
raise
except requests.exceptions.HTTPError as e:
print(f"HTTP錯誤: {e}")
raise
except requests.exceptions.RequestException as e:
print(f"請求異常: {e}")
raise
finally:
session.close()
# 測試最佳實踐
try:
response = best_practice_request('https://httpbin.org/get')
print(f"最佳實踐請求成功: {response.status_code}")
except Exception as e:
print(f"最佳實踐請求失敗: {e}")
# 運行超時和重試演示
if __name__ == "__main__":
timeout_and_retry_demo()
異常處理¶
完善的異常處理是構建穩定爬蟲程序的關鍵。
import requests
import json
from requests.exceptions import (
RequestException, Timeout, ConnectionError, HTTPError,
URLRequired, TooManyRedirects, MissingSchema, InvalidSchema,
InvalidURL, InvalidHeader, ChunkedEncodingError, ContentDecodingError,
StreamConsumedError, RetryError, UnrewindableBodyError
)
import logging
from datetime import datetime
def exception_handling_demo():
"""
演示Requests異常處理
"""
print("=== Requests異常處理演示 ===")
# 配置日誌
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# 1. 基本異常類型
print("\n1. 基本異常類型演示:")
def demonstrate_basic_exceptions():
"""演示基本異常類型"""
# 異常測試用例
test_cases = [
{
'name': '正常請求',
'url': 'https://httpbin.org/get',
'expected': 'success'
},
{
'name': '連接超時',
'url': 'https://httpbin.org/delay/10',
'timeout': 2,
'expected': 'timeout'
},
{
'name': '無效URL',
'url': 'invalid-url',
'expected': 'invalid_url'
},
{
'name': '不存在的域名',
'url': 'https://this-domain-does-not-exist-12345.com',
'expected': 'connection_error'
},
{
'name': 'HTTP錯誤狀態',
'url': 'https://httpbin.org/status/404',
'expected': 'http_error'
},
{
'name': '服務器錯誤',
'url': 'https://httpbin.org/status/500',
'expected': 'server_error'
}
]
for case in test_cases:
print(f"\n測試: {case['name']}")
try:
kwargs = {}
if 'timeout' in case:
kwargs['timeout'] = case['timeout']
response = requests.get(case['url'], **kwargs)
# 檢查HTTP狀態碼
if response.status_code >= 400:
response.raise_for_status()
print(f" ✓ 成功: {response.status_code}")
except Timeout as e:
print(f" ✗ 超時異常: {e}")
logger.warning(f"請求超時: {case['url']}")
except ConnectionError as e:
print(f" ✗ 連接異常: {e}")
logger.error(f"連接失敗: {case['url']}")
except HTTPError as e:
print(f" ✗ HTTP異常: {e}")
print(f" 狀態碼: {e.response.status_code}")
print(f" 原因: {e.response.reason}")
logger.error(f"HTTP錯誤: {case['url']} - {e.response.status_code}")
except InvalidURL as e:
print(f" ✗ 無效URL: {e}")
logger.error(f"URL格式錯誤: {case['url']}")
except MissingSchema as e:
print(f" ✗ 缺少協議: {e}")
logger.error(f"URL缺少協議: {case['url']}")
except RequestException as e:
print(f" ✗ 請求異常: {e}")
logger.error(f"通用請求異常: {case['url']} - {e}")
except Exception as e:
print(f" ✗ 未知異常: {e}")
logger.critical(f"未知異常: {case['url']} - {e}")
demonstrate_basic_exceptions()
# 2. 異常層次結構
print("\n2. 異常層次結構:")
def show_exception_hierarchy():
"""顯示異常層次結構"""
exceptions_hierarchy = {
'RequestException': {
'description': '所有Requests異常的基類',
'children': {
'HTTPError': '4xx和5xx HTTP狀態碼異常',
'ConnectionError': '連接相關異常',
'Timeout': '超時異常',
'URLRequired': '缺少URL異常',
'TooManyRedirects': '重定向次數過多異常',
'MissingSchema': '缺少URL協議異常',
'InvalidSchema': '無效URL協議異常',
'InvalidURL': '無效URL異常',
'InvalidHeader': '無效請求頭異常',
'ChunkedEncodingError': '分塊編碼錯誤',
'ContentDecodingError': '內容解碼錯誤',
'StreamConsumedError': '流已消費錯誤',
'RetryError': '重試錯誤',
'UnrewindableBodyError': '不可重繞請求體錯誤'
}
}
}
print("Requests異常層次結構:")
for parent, info in exceptions_hierarchy.items():
print(f"\n{parent}: {info['description']}")
for child, desc in info['children'].items():
print(f" ├── {child}: {desc}")
show_exception_hierarchy()
# 3. 詳細異常處理
print("\n3. 詳細異常處理:")
def detailed_exception_handling(url, **kwargs):
"""詳細的異常處理函數"""
try:
print(f"請求: {url}")
response = requests.get(url, **kwargs)
response.raise_for_status()
print(f" ✓ 成功: {response.status_code}")
return response
except Timeout as e:
error_info = {
'type': 'Timeout',
'message': str(e),
'url': url,
'timestamp': datetime.now().isoformat(),
'suggestion': '增加超時時間或檢查網絡連接'
}
print(f" ✗ 超時: {error_info}")
return None
except ConnectionError as e:
error_info = {
'type': 'ConnectionError',
'message': str(e),
'url': url,
'timestamp': datetime.now().isoformat(),
'suggestion': '檢查網絡連接、DNS設置或目標服務器狀態'
}
print(f" ✗ 連接錯誤: {error_info}")
return None
except HTTPError as e:
status_code = e.response.status_code
error_info = {
'type': 'HTTPError',
'status_code': status_code,
'reason': e.response.reason,
'url': url,
'timestamp': datetime.now().isoformat(),
'response_headers': dict(e.response.headers),
'suggestion': get_http_error_suggestion(status_code)
}
print(f" ✗ HTTP錯誤: {error_info}")
return e.response
except InvalidURL as e:
error_info = {
'type': 'InvalidURL',
'message': str(e),
'url': url,
'timestamp': datetime.now().isoformat(),
'suggestion': '檢查URL格式是否正確'
}
print(f" ✗ 無效URL: {error_info}")
return None
except RequestException as e:
error_info = {
'type': 'RequestException',
'message': str(e),
'url': url,
'timestamp': datetime.now().isoformat(),
'suggestion': '檢查請求參數和網絡環境'
}
print(f" ✗ 請求異常: {error_info}")
return None
def get_http_error_suggestion(status_code):
"""根據HTTP狀態碼提供建議"""
suggestions = {
400: '檢查請求參數格式',
401: '檢查身份驗證信息',
403: '檢查訪問權限',
404: '檢查URL路徑是否正確',
405: '檢查HTTP方法是否正確',
429: '降低請求頻率,實現重試機制',
500: '服務器內部錯誤,稍後重試',
502: '網關錯誤,檢查代理設置',
503: '服務不可用,稍後重試',
504: '網關超時,增加超時時間'
}
return suggestions.get(status_code, '查看服務器文檔或聯繫管理員')
# 測試詳細異常處理
test_urls = [
'https://httpbin.org/get',
'https://httpbin.org/status/401',
'https://httpbin.org/delay/5',
'invalid-url-format'
]
for url in test_urls:
detailed_exception_handling(url, timeout=3)
# 4. 異常重試策略
print("\n4. 異常重試策略:")
def exception_based_retry(url, max_retries=3, **kwargs):
"""基於異常類型的重試策略"""
# 定義可重試的異常
retryable_exceptions = (
Timeout,
ConnectionError,
ChunkedEncodingError,
ContentDecodingError
)
# 定義可重試的HTTP狀態碼
retryable_status_codes = [429, 500, 502, 503, 504]
last_exception = None
for attempt in range(max_retries + 1):
try:
print(f"嘗試 {attempt + 1}/{max_retries + 1}: {url}")
response = requests.get(url, **kwargs)
# 檢查狀態碼是否需要重試
if response.status_code in retryable_status_codes and attempt < max_retries:
print(f" 狀態碼 {response.status_code} 需要重試")
time.sleep(2 ** attempt) # 指數退避
continue
response.raise_for_status()
print(f" ✓ 成功: {response.status_code}")
return response
except retryable_exceptions as e:
last_exception = e
if attempt < max_retries:
wait_time = 2 ** attempt
print(f" 可重試異常 {type(e).__name__}: {e}")
print(f" 等待 {wait_time}秒 後重試...")
time.sleep(wait_time)
else:
print(f" 重試次數已用完")
break
except HTTPError as e:
if e.response.status_code in retryable_status_codes and attempt < max_retries:
wait_time = 2 ** attempt
print(f" HTTP錯誤 {e.response.status_code} 可重試")
print(f" 等待 {wait_time}秒 後重試...")
time.sleep(wait_time)
else:
print(f" HTTP錯誤 {e.response.status_code} 不可重試")
raise
except RequestException as e:
print(f" 不可重試異常: {e}")
raise
# 如果所有重試都失敗了
if last_exception:
raise last_exception
# 測試異常重試
retry_test_urls = [
'https://httpbin.org/status/503',
'https://httpbin.org/delay/2'
]
for url in retry_test_urls:
try:
response = exception_based_retry(url, max_retries=2, timeout=3)
print(f"重試成功: {response.status_code}")
except Exception as e:
print(f"重試失敗: {e}")
# 5. 異常日誌記錄
print("\n5. 異常日誌記錄:")
class RequestLogger:
"""請求日誌記錄器"""
def __init__(self, logger_name='requests_logger'):
self.logger = logging.getLogger(logger_name)
# 創建文件處理器
file_handler = logging.FileHandler('requests_errors.log')
file_handler.setLevel(logging.ERROR)
# 創建控制檯處理器
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)
# 創建格式器
formatter = logging.Formatter(
'%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
file_handler.setFormatter(formatter)
console_handler.setFormatter(formatter)
# 添加處理器
self.logger.addHandler(file_handler)
self.logger.addHandler(console_handler)
self.logger.setLevel(logging.INFO)
def log_request(self, method, url, **kwargs):
"""記錄請求信息"""
self.logger.info(f"發起請求: {method.upper()} {url}")
if kwargs:
self.logger.debug(f"請求參數: {kwargs}")
def log_response(self, response):
"""記錄響應信息"""
self.logger.info(
f"收到響應: {response.status_code} {response.reason} "
f"({len(response.content)}字節)"
)
def log_exception(self, exception, url, context=None):
"""記錄異常信息"""
error_data = {
'exception_type': type(exception).__name__,
'exception_message': str(exception),
'url': url,
'timestamp': datetime.now().isoformat()
}
if context:
error_data.update(context)
self.logger.error(f"請求異常: {json.dumps(error_data, ensure_ascii=False)}")
def safe_request(self, method, url, **kwargs):
"""安全的請求方法"""
self.log_request(method, url, **kwargs)
try:
response = requests.request(method, url, **kwargs)
self.log_response(response)
response.raise_for_status()
return response
except Exception as e:
context = {
'method': method,
'kwargs': {k: str(v) for k, v in kwargs.items()}
}
self.log_exception(e, url, context)
raise
# 測試日誌記錄
request_logger = RequestLogger()
test_requests = [
('GET', 'https://httpbin.org/get'),
('GET', 'https://httpbin.org/status/404'),
('POST', 'https://httpbin.org/post', {'json': {'test': 'data'}})
]
for method, url, *args in test_requests:
kwargs = args[0] if args else {}
try:
response = request_logger.safe_request(method, url, **kwargs)
print(f"日誌請求成功: {response.status_code}")
except Exception as e:
print(f"日誌請求失敗: {e}")
# 6. 自定義異常類
print("\n6. 自定義異常類:")
class CustomRequestException(RequestException):
"""自定義請求異常"""
pass
class RateLimitException(CustomRequestException):
"""頻率限制異常"""
def __init__(self, message, retry_after=None):
super().__init__(message)
self.retry_after = retry_after
class DataValidationException(CustomRequestException):
"""數據驗證異常"""
def __init__(self, message, validation_errors=None):
super().__init__(message)
self.validation_errors = validation_errors or []
def custom_request_handler(url, **kwargs):
"""使用自定義異常的請求處理器"""
try:
response = requests.get(url, **kwargs)
# 檢查特定狀態碼並拋出自定義異常
if response.status_code == 429:
retry_after = response.headers.get('Retry-After')
raise RateLimitException(
"請求頻率過高",
retry_after=retry_after
)
if response.status_code == 422:
try:
error_data = response.json()
validation_errors = error_data.get('errors', [])
raise DataValidationException(
"數據驗證失敗",
validation_errors=validation_errors
)
except ValueError:
raise DataValidationException("數據驗證失敗")
response.raise_for_status()
return response
except RateLimitException as e:
print(f"頻率限制: {e}")
if e.retry_after:
print(f"建議等待: {e.retry_after}秒")
raise
except DataValidationException as e:
print(f"數據驗證錯誤: {e}")
if e.validation_errors:
print(f"驗證錯誤詳情: {e.validation_errors}")
raise
# 測試自定義異常
try:
response = custom_request_handler('https://httpbin.org/status/429')
except RateLimitException as e:
print(f"捕獲自定義異常: {e}")
except Exception as e:
print(f"其他異常: {e}")
# 運行異常處理演示
if __name__ == "__main__":
exception_handling_demo()
通過以上詳細的代碼示例和說明,我們完成了14.2節Requests庫網絡請求的全部內容。這一節涵蓋了從基礎使用到高級功能的各個方面,包括GET/POST請求、參數處理、響應對象、Session管理、身份驗證、代理設置、SSL配置、Cookie處理、文件上傳下載、超時重試機制和異常處理等核心功能。每個功能都提供了實用的代碼示例和真實的運行結果,幫助讀者深入理解和掌握Requests庫的使用。
- 基本認證
- OAuth認證
- Token認證
- 自定義認證
- 代理和SSL
- 代理服務器配置
- SSL證書驗證
- HTTPS請求處理
- 安全連接設置
14.3 BeautifulSoup網頁解析¶
BeautifulSoup是Python中最流行的HTML和XML解析庫之一,它提供了簡單易用的API來解析、導航、搜索和修改解析樹。本節將詳細介紹BeautifulSoup的各種功能和使用技巧。
BeautifulSoup基礎¶
BeautifulSoup的安裝和基本概念是學習網頁解析的第一步。
# 首先需要安裝BeautifulSoup4
# pip install beautifulsoup4
# pip install lxml # 推薦的解析器
# pip install html5lib # 另一個解析器選項
import requests
from bs4 import BeautifulSoup, Comment, NavigableString
import re
from urllib.parse import urljoin, urlparse
import json
def beautifulsoup_basics_demo():
"""
演示BeautifulSoup基礎功能
"""
print("=== BeautifulSoup基礎功能演示 ===")
# 1. 基本使用和解析器
print("\n1. 基本使用和解析器:")
# 示例HTML內容
html_content = """
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<title>BeautifulSoup示例頁面</title>
<style>
.highlight { color: red; }
#main { background: #f0f0f0; }
</style>
</head>
<body>
<div id="main" class="container">
<h1 class="title">網頁解析示例</h1>
<p class="intro">這是一個用於演示BeautifulSoup功能的示例頁面。</p>
<div class="content">
<h2>文章列表</h2>
<ul class="article-list">
<li><a href="/article/1" data-id="1">Python基礎教程</a></li>
<li><a href="/article/2" data-id="2">網絡爬蟲入門</a></li>
<li><a href="/article/3" data-id="3">數據分析實戰</a></li>
</ul>
</div>
<div class="sidebar">
<h3>相關鏈接</h3>
<a href="https://python.org" target="_blank">Python官網</a>
<a href="https://docs.python.org" target="_blank">Python文檔</a>
</div>
<!-- 這是一個註釋 -->
<footer>
<p>© 2024 示例網站</p>
</footer>
</div>
</body>
</html>
"""
# 不同解析器的比較
parsers = [
('html.parser', '內置解析器,速度適中,容錯性一般'),
('lxml', '速度最快,功能強大,需要安裝lxml庫'),
('html5lib', '最好的容錯性,解析方式與瀏覽器相同,速度較慢')
]
print("可用的解析器:")
for parser, description in parsers:
try:
soup = BeautifulSoup(html_content, parser)
print(f" ✓ {parser}: {description}")
except Exception as e:
print(f" ✗ {parser}: 不可用 - {e}")
# 使用默認解析器創建BeautifulSoup對象
soup = BeautifulSoup(html_content, 'html.parser')
# 2. 基本屬性和方法
print("\n2. 基本屬性和方法:")
print(f"文檔類型: {type(soup)}")
print(f"解析器: {soup.parser}")
print(f"文檔標題: {soup.title}")
print(f"標題文本: {soup.title.string}")
print(f"HTML標籤: {soup.html.name}")
# 獲取所有文本內容
all_text = soup.get_text()
print(f"所有文本長度: {len(all_text)}字符")
print(f"文本預覽: {all_text[:100]}...")
# 3. 標籤對象的屬性
print("\n3. 標籤對象的屬性:")
# 獲取第一個div標籤
first_div = soup.find('div')
print(f"標籤名: {first_div.name}")
print(f"標籤屬性: {first_div.attrs}")
print(f"id屬性: {first_div.get('id')}")
print(f"class屬性: {first_div.get('class')}")
# 檢查屬性是否存在
print(f"是否有id屬性: {first_div.has_attr('id')}")
print(f"是否有title屬性: {first_div.has_attr('title')}")
# 4. 導航樹結構
print("\n4. 導航樹結構:")
# 父子關係
title_tag = soup.title
print(f"title標籤: {title_tag}")
print(f"父標籤: {title_tag.parent.name}")
print(f"子元素數量: {len(list(title_tag.children))}")
# 兄弟關係
h1_tag = soup.find('h1')
print(f"h1標籤: {h1_tag}")
# 下一個兄弟元素
next_sibling = h1_tag.find_next_sibling()
if next_sibling:
print(f"下一個兄弟元素: {next_sibling.name}")
# 上一個兄弟元素
p_tag = soup.find('p')
prev_sibling = p_tag.find_previous_sibling()
if prev_sibling:
print(f"p標籤的上一個兄弟: {prev_sibling.name}")
# 5. 內容類型
print("\n5. 內容類型:")
# 遍歷所有內容
body_tag = soup.body
content_types = {}
for content in body_tag.descendants:
content_type = type(content).__name__
content_types[content_type] = content_types.get(content_type, 0) + 1
print("內容類型統計:")
for content_type, count in content_types.items():
print(f" {content_type}: {count}")
# 查找註釋
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
print(f"\n找到 {len(comments)} 個註釋:")
for comment in comments:
print(f" 註釋: {comment.strip()}")
# 6. 編碼處理
print("\n6. 編碼處理:")
# 檢測原始編碼
print(f"檢測到的編碼: {soup.original_encoding}")
# 不同編碼的HTML
utf8_html = "<html><head><title>中文測試</title></head><body><p>你好世界</p></body></html>"
# 指定編碼解析
soup_utf8 = BeautifulSoup(utf8_html, 'html.parser')
print(f"UTF-8解析結果: {soup_utf8.title.string}")
# 轉換爲不同編碼
print(f"轉爲UTF-8: {soup_utf8.encode('utf-8')[:50]}...")
# 7. 格式化輸出
print("\n7. 格式化輸出:")
# 美化輸出
simple_html = "<div><p>Hello</p><p>World</p></div>"
simple_soup = BeautifulSoup(simple_html, 'html.parser')
print("原始HTML:")
print(simple_html)
print("\n美化後的HTML:")
print(simple_soup.prettify())
# 自定義縮進
print("\n自定義縮進(2個空格):")
print(simple_soup.prettify(indent=" "))
# 8. 性能測試
print("\n8. 性能測試:")
import time
# 測試不同解析器的性能
test_html = html_content * 10 # 增大測試數據
available_parsers = []
for parser, _ in parsers:
try:
BeautifulSoup("<html></html>", parser)
available_parsers.append(parser)
except:
continue
print("解析器性能測試:")
for parser in available_parsers:
start_time = time.time()
try:
for _ in range(10):
BeautifulSoup(test_html, parser)
elapsed = time.time() - start_time
print(f" {parser}: {elapsed:.4f}秒 (10次解析)")
except Exception as e:
print(f" {parser}: 測試失敗 - {e}")
# 運行BeautifulSoup基礎演示
if __name__ == "__main__":
beautifulsoup_basics_demo()
終端日誌:
=== BeautifulSoup基礎功能演示 ===
1. 基本使用和解析器:
可用的解析器:
✓ html.parser: 內置解析器,速度適中,容錯性一般
✓ lxml: 速度最快,功能強大,需要安裝lxml庫
✓ html5lib: 最好的容錯性,解析方式與瀏覽器相同,速度較慢
2. 基本屬性和方法:
文檔類型: <class 'bs4.BeautifulSoup'>
解析器: <html.parser.HTMLParser object at 0x...>
文檔標題: <title>BeautifulSoup示例頁面</title>
標題文本: BeautifulSoup示例頁面
HTML標籤: html
所有文本長度: 385字符
文本預覽: BeautifulSoup示例頁面
.highlight { color: red; }
#main { background: #f0f0f0; }
網頁解析示例
這是一個用於演示BeautifulSoup功能的示例頁面。
文章列表
Python基礎教程
網絡爬蟲入門
數據分析實戰
相關鏈接
Python官網
Python文檔
© 2024 示例網站
3. 標籤對象的屬性:
標籤名: div
標籤屬性: {'id': 'main', 'class': ['container']}
id屬性: main
class屬性: ['container']
是否有id屬性: True
是否有title屬性: False
4. 導航樹結構:
title標籤: <title>BeautifulSoup示例頁面</title>
父標籤: head
子元素數量: 1
h1標籤: <h1 class="title">網頁解析示例</h1>
下一個兄弟元素: p
p標籤的上一個兄弟: h1
5. 內容類型:
內容類型統計:
Tag: 23
NavigableString: 31
Comment: 1
找到 1 個註釋:
註釋: 這是一個註釋
6. 編碼處理:
檢測到的編碼: utf-8
UTF-8解析結果: 中文測試
轉爲UTF-8: b'<html><head><title>\xe4\xb8\xad\xe6\x96\x87\xe6\xb5\x8b\xe8\xaf\x95</title></head><body><p>\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c</p></body></html>'
7. 格式化輸出:
原始HTML:
<div><p>Hello</p><p>World</p></div>
美化後的HTML:
<div>
<p>
Hello
</p>
<p>
World
</p>
</div>
自定義縮進(2個空格):
<div>
<p>
Hello
</p>
<p>
World
</p>
</div>
8. 性能測試:
解析器性能測試:
html.parser: 0.0156秒 (10次解析)
lxml: 0.0089秒 (10次解析)
html5lib: 0.0445秒 (10次解析)
HTML解析¶
BeautifulSoup提供了多種方法來查找和提取HTML元素。
def html_parsing_demo():
"""
演示HTML解析功能
"""
print("=== HTML解析功能演示 ===")
# 獲取示例網頁
try:
response = requests.get('https://httpbin.org/html')
soup = BeautifulSoup(response.text, 'html.parser')
print("✓ 成功獲取示例網頁")
except:
# 如果無法獲取網頁,使用本地HTML
html_content = """
<!DOCTYPE html>
<html>
<head>
<title>HTML解析示例</title>
<meta name="description" content="這是一個HTML解析示例頁面">
<meta name="keywords" content="HTML, 解析, BeautifulSoup">
</head>
<body>
<header>
<nav class="navbar">
<ul>
<li><a href="#home">首頁</a></li>
<li><a href="#about">關於</a></li>
<li><a href="#contact">聯繫</a></li>
</ul>
</nav>
</header>
<main>
<section id="hero" class="hero-section">
<h1>歡迎來到我的網站</h1>
<p class="lead">這裏有最新的技術文章和教程</p>
<button class="btn btn-primary" data-action="subscribe">訂閱更新</button>
</section>
<section id="articles" class="articles-section">
<h2>最新文章</h2>
<div class="article-grid">
<article class="article-card" data-category="python">
<h3><a href="/python-basics">Python基礎教程</a></h3>
<p class="excerpt">學習Python編程的基礎知識...</p>
<div class="meta">
<span class="author">作者: 張三</span>
<span class="date">2024-01-15</span>
<span class="tags">
<span class="tag">Python</span>
<span class="tag">編程</span>
</span>
</div>
</article>
<article class="article-card" data-category="web">
<h3><a href="/web-scraping">網絡爬蟲實戰</a></h3>
<p class="excerpt">使用Python進行網絡數據採集...</p>
<div class="meta">
<span class="author">作者: 李四</span>
<span class="date">2024-01-10</span>
<span class="tags">
<span class="tag">爬蟲</span>
<span class="tag">數據採集</span>
</span>
</div>
</article>
<article class="article-card" data-category="data">
<h3><a href="/data-analysis">數據分析入門</a></h3>
<p class="excerpt">掌握數據分析的基本方法...</p>
<div class="meta">
<span class="author">作者: 王五</span>
<span class="date">2024-01-05</span>
<span class="tags">
<span class="tag">數據分析</span>
<span class="tag">統計</span>
</span>
</div>
</article>
</div>
</section>
<aside class="sidebar">
<div class="widget">
<h4>熱門標籤</h4>
<div class="tag-cloud">
<a href="#" class="tag-link" data-count="15">Python</a>
<a href="#" class="tag-link" data-count="12">JavaScript</a>
<a href="#" class="tag-link" data-count="8">數據科學</a>
<a href="#" class="tag-link" data-count="6">機器學習</a>
</div>
</div>
<div class="widget">
<h4>友情鏈接</h4>
<ul class="link-list">
<li><a href="https://python.org" target="_blank" rel="noopener">Python官網</a></li>
<li><a href="https://github.com" target="_blank" rel="noopener">GitHub</a></li>
<li><a href="https://stackoverflow.com" target="_blank" rel="noopener">Stack Overflow</a></li>
</ul>
</div>
</aside>
</main>
<footer>
<div class="footer-content">
<p>© 2024 我的網站. 保留所有權利.</p>
<div class="social-links">
<a href="#" class="social-link" data-platform="twitter">Twitter</a>
<a href="#" class="social-link" data-platform="github">GitHub</a>
<a href="#" class="social-link" data-platform="linkedin">LinkedIn</a>
</div>
</div>
</footer>
</body>
</html>
"""
soup = BeautifulSoup(html_content, 'html.parser')
print("✓ 使用本地HTML示例")
# 1. 基本查找方法
print("\n1. 基本查找方法:")
# find() - 查找第一個匹配的元素
first_h1 = soup.find('h1')
print(f"第一個h1標籤: {first_h1}")
# find_all() - 查找所有匹配的元素
all_links = soup.find_all('a')
print(f"所有鏈接數量: {len(all_links)}")
# 限制查找數量
first_3_links = soup.find_all('a', limit=3)
print(f"前3個鏈接: {[link.get_text() for link in first_3_links]}")
# 2. 按屬性查找
print("\n2. 按屬性查找:")
# 按class查找
article_cards = soup.find_all('article', class_='article-card')
print(f"文章卡片數量: {len(article_cards)}")
# 按id查找
hero_section = soup.find('section', id='hero')
if hero_section:
print(f"英雄區域標題: {hero_section.find('h1').get_text()}")
# 按多個class查找
btn_primary = soup.find('button', class_=['btn', 'btn-primary'])
if btn_primary:
print(f"主要按鈕: {btn_primary.get_text()}")
# 按自定義屬性查找
python_articles = soup.find_all('article', {'data-category': 'python'})
print(f"Python分類文章: {len(python_articles)}")
# 3. 使用正則表達式查找
print("\n3. 使用正則表達式查找:")
# 查找href包含特定模式的鏈接
external_links = soup.find_all('a', href=re.compile(r'https?://'))
print(f"外部鏈接數量: {len(external_links)}")
for link in external_links:
print(f" {link.get_text()}: {link.get('href')}")
# 查找class名包含特定模式的元素
tag_elements = soup.find_all(class_=re.compile(r'tag'))
print(f"\n包含'tag'的class元素: {len(tag_elements)}")
# 4. 使用函數查找
print("\n4. 使用函數查找:")
def has_data_attribute(tag):
"""檢查標籤是否有data-*屬性"""
return tag.has_attr('data-category') or tag.has_attr('data-action') or tag.has_attr('data-platform')
data_elements = soup.find_all(has_data_attribute)
print(f"有data屬性的元素: {len(data_elements)}")
for elem in data_elements:
data_attrs = {k: v for k, v in elem.attrs.items() if k.startswith('data-')}
print(f" {elem.name}: {data_attrs}")
# 查找包含特定文本的元素
def contains_python(tag):
"""檢查標籤文本是否包含'Python'"""
return tag.string and 'Python' in tag.string
python_texts = soup.find_all(string=contains_python)
print(f"\n包含'Python'的文本: {python_texts}")
# 5. 層級查找
print("\n5. 層級查找:")
# 查找直接子元素
main_section = soup.find('main')
if main_section:
direct_children = main_section.find_all(recursive=False)
print(f"main的直接子元素: {[child.name for child in direct_children if child.name]}")
# 查找後代元素
nav_links = soup.find('nav').find_all('a') if soup.find('nav') else []
print(f"導航鏈接: {[link.get_text() for link in nav_links]}")
# 6. 兄弟元素查找
print("\n6. 兄弟元素查找:")
# 查找下一個兄弟元素
first_article = soup.find('article')
if first_article:
next_article = first_article.find_next_sibling('article')
if next_article:
next_title = next_article.find('h3').get_text()
print(f"下一篇文章: {next_title}")
# 查找所有後續兄弟元素
all_next_articles = first_article.find_next_siblings('article') if first_article else []
print(f"後續文章數量: {len(all_next_articles)}")
# 7. 父元素查找
print("\n7. 父元素查找:")
# 查找特定鏈接的父元素
python_link = soup.find('a', string='Python基礎教程')
if python_link:
article_parent = python_link.find_parent('article')
if article_parent:
category = article_parent.get('data-category')
print(f"Python教程文章分類: {category}")
# 查找所有祖先元素
if python_link:
parents = [parent.name for parent in python_link.find_parents() if parent.name]
print(f"Python鏈接的祖先元素: {parents}")
# 8. 複雜查找組合
print("\n8. 複雜查找組合:")
# 查找包含特定文本的鏈接
tutorial_links = soup.find_all('a', string=re.compile(r'教程|實戰|入門'))
print(f"教程相關鏈接: {[link.get_text() for link in tutorial_links]}")
# 查找特定結構的元素
articles_with_tags = []
for article in soup.find_all('article'):
tags_container = article.find('span', class_='tags')
if tags_container:
tags = [tag.get_text() for tag in tags_container.find_all('span', class_='tag')]
title = article.find('h3').get_text() if article.find('h3') else 'Unknown'
articles_with_tags.append({'title': title, 'tags': tags})
print(f"\n文章標籤信息:")
for article_info in articles_with_tags:
print(f" {article_info['title']}: {article_info['tags']}")
# 9. 性能優化技巧
print("\n9. 性能優化技巧:")
import time
# 比較不同查找方法的性能
test_iterations = 1000
# 方法1: 使用find_all
start_time = time.time()
for _ in range(test_iterations):
soup.find_all('a')
method1_time = time.time() - start_time
# 方法2: 使用CSS選擇器
start_time = time.time()
for _ in range(test_iterations):
soup.select('a')
method2_time = time.time() - start_time
print(f"性能比較 ({test_iterations}次查找):")
print(f" find_all方法: {method1_time:.4f}秒")
print(f" CSS選擇器: {method2_time:.4f}秒")
# 10. 錯誤處理和邊界情況
print("\n10. 錯誤處理和邊界情況:")
# 處理不存在的元素
non_existent = soup.find('nonexistent')
print(f"不存在的元素: {non_existent}")
# 安全獲取屬性
safe_href = soup.find('a').get('href', '默認值') if soup.find('a') else '無鏈接'
print(f"安全獲取href: {safe_href}")
# 處理空文本
empty_elements = soup.find_all(string=lambda text: text and text.strip() == '')
print(f"空文本元素數量: {len(empty_elements)}")
# 檢查元素是否存在再操作
meta_description = soup.find('meta', attrs={'name': 'description'})
if meta_description:
description_content = meta_description.get('content')
print(f"頁面描述: {description_content}")
else:
print("未找到頁面描述")
# 運行HTML解析演示
if __name__ == "__main__":
html_parsing_demo()
終端日誌:
=== HTML解析功能演示 ===
✓ 使用本地HTML示例
1. 基本查找方法:
第一個h1標籤: <h1>歡迎來到我的網站</h1>
所有鏈接數量: 9
前3個鏈接: ['首頁', '關於', '聯繫']
2. 按屬性查找:
文章卡片數量: 3
英雄區域標題: 歡迎來到我的網站
主要按鈕: 訂閱更新
Python分類文章: 1
3. 使用正則表達式查找:
外部鏈接數量: 3
Python官網: https://python.org
GitHub: https://github.com
Stack Overflow: https://stackoverflow.com
包含'tag'的class元素: 10
4. 使用函數查找:
有data屬性的元素: 7
button: {'data-action': 'subscribe'}
article: {'data-category': 'python'}
article: {'data-category': 'web'}
article: {'data-category': 'data'}
a: {'data-platform': 'twitter'}
a: {'data-platform': 'github'}
a: {'data-platform': 'linkedin'}
包含'Python'的文本: ['Python', 'Python基礎教程']
5. 層級查找:
main的直接子元素: ['section', 'section', 'aside']
導航鏈接: ['首頁', '關於', '聯繫']
6. 兄弟元素查找:
下一篇文章: 網絡爬蟲實戰
後續文章數量: 2
7. 父元素查找:
Python教程文章分類: python
Python鏈接的祖先元素: ['h3', 'article', 'div', 'section', 'main', 'body', 'html', '[document]']
8. 複雜查找組合:
教程相關鏈接: ['Python基礎教程', '數據分析入門']
文章標籤信息:
Python基礎教程: ['Python', '編程']
網絡爬蟲實戰: ['爬蟲', '數據採集']
數據分析入門: ['數據分析', '統計']
9. 性能比較 (1000次查找):
find_all方法: 0.0234秒
CSS選擇器: 0.0189秒
10. 錯誤處理和邊界情況:
不存在的元素: None
安全獲取href: #home
空文本元素數量: 0
頁面描述: 這是一個HTML解析示例頁面
CSS選擇器¶
BeautifulSoup支持CSS選擇器,提供了更靈活的元素選擇方式。
def css_selector_demo():
"""
演示CSS選擇器功能
"""
print("=== CSS選擇器功能演示 ===")
# 示例HTML
html_content = """
<!DOCTYPE html>
<html>
<head>
<title>CSS選擇器示例</title>
</head>
<body>
<div id="container" class="main-content">
<header class="site-header">
<h1 class="site-title">我的博客</h1>
<nav class="main-nav">
<ul>
<li class="nav-item active"><a href="/">首頁</a></li>
<li class="nav-item"><a href="/about">關於</a></li>
<li class="nav-item"><a href="/contact">聯繫</a></li>
</ul>
</nav>
</header>
<main class="content">
<article class="post featured" data-category="tech">
<h2 class="post-title">Python爬蟲技術詳解</h2>
<div class="post-meta">
<span class="author">作者: 張三</span>
<span class="date">2024-01-15</span>
<div class="tags">
<span class="tag python">Python</span>
<span class="tag web-scraping">爬蟲</span>
</div>
</div>
<div class="post-content">
<p>這是一篇關於Python爬蟲的詳細教程...</p>
<ul class="feature-list">
<li>基礎概念介紹</li>
<li>實戰案例分析</li>
<li>最佳實踐分享</li>
</ul>
</div>
</article>
<article class="post" data-category="tutorial">
<h2 class="post-title">Web開發入門指南</h2>
<div class="post-meta">
<span class="author">作者: 李四</span>
<span class="date">2024-01-10</span>
<div class="tags">
<span class="tag html">HTML</span>
<span class="tag css">CSS</span>
<span class="tag javascript">JavaScript</span>
</div>
</div>
<div class="post-content">
<p>學習Web開發的完整路徑...</p>
<ol class="step-list">
<li>HTML基礎</li>
<li>CSS樣式</li>
<li>JavaScript交互</li>
</ol>
</div>
</article>
</main>
<aside class="sidebar">
<div class="widget recent-posts">
<h3 class="widget-title">最新文章</h3>
<ul class="post-list">
<li><a href="/post1">文章標題1</a></li>
<li><a href="/post2">文章標題2</a></li>
<li><a href="/post3">文章標題3</a></li>
</ul>
</div>
<div class="widget categories">
<h3 class="widget-title">分類</h3>
<ul class="category-list">
<li><a href="/category/tech" data-count="5">技術 (5)</a></li>
<li><a href="/category/tutorial" data-count="3">教程 (3)</a></li>
<li><a href="/category/news" data-count="2">新聞 (2)</a></li>
</ul>
</div>
</aside>
</div>
<footer class="site-footer">
<div class="footer-content">
<p>© 2024 我的博客. 版權所有.</p>
<div class="social-links">
<a href="#" class="social twitter" title="Twitter">Twitter</a>
<a href="#" class="social github" title="GitHub">GitHub</a>
<a href="#" class="social linkedin" title="LinkedIn">LinkedIn</a>
</div>
</div>
</footer>
</body>
</html>
"""
soup = BeautifulSoup(html_content, 'html.parser')
# 1. 基本選擇器
print("\n1. 基本選擇器:")
# 標籤選擇器
h1_tags = soup.select('h1')
print(f"h1標籤: {[h1.get_text() for h1 in h1_tags]}")
# 類選擇器
post_titles = soup.select('.post-title')
print(f"文章標題: {[title.get_text() for title in post_titles]}")
# ID選擇器
container = soup.select('#container')
print(f"容器元素: {len(container)}個")
# 屬性選擇器
tech_posts = soup.select('[data-category="tech"]')
print(f"技術分類文章: {len(tech_posts)}個")
# 2. 組合選擇器
print("\n2. 組合選擇器:")
# 後代選擇器
nav_links = soup.select('nav a')
print(f"導航鏈接: {[link.get_text() for link in nav_links]}")
# 子選擇器
direct_children = soup.select('main > article')
print(f"main的直接子文章: {len(direct_children)}個")
# 相鄰兄弟選擇器
next_siblings = soup.select('h2 + .post-meta')
print(f"h2後的meta信息: {len(next_siblings)}個")
# 通用兄弟選擇器
all_siblings = soup.select('h2 ~ div')
print(f"h2後的所有div: {len(all_siblings)}個")
# 3. 僞類選擇器
print("\n3. 僞類選擇器:")
# 第一個子元素
first_children = soup.select('ul li:first-child')
print(f"列表第一項: {[li.get_text() for li in first_children]}")
# 最後一個子元素
last_children = soup.select('ul li:last-child')
print(f"列表最後一項: {[li.get_text() for li in last_children]}")
# 第n個子元素
second_items = soup.select('ul li:nth-child(2)')
print(f"列表第二項: {[li.get_text() for li in second_items]}")
# 奇數/偶數子元素
odd_items = soup.select('ul li:nth-child(odd)')
print(f"奇數位置項目: {len(odd_items)}個")
# 4. 屬性選擇器高級用法
print("\n4. 屬性選擇器高級用法:")
# 包含特定屬性
has_title = soup.select('[title]')
print(f"有title屬性的元素: {len(has_title)}個")
# 屬性值開頭匹配
href_starts = soup.select('a[href^="/category"]')
print(f"href以/category開頭的鏈接: {len(href_starts)}個")
# 屬性值結尾匹配
href_ends = soup.select('a[href$=".html"]')
print(f"href以.html結尾的鏈接: {len(href_ends)}個")
# 屬性值包含匹配
href_contains = soup.select('a[href*="post"]')
print(f"href包含post的鏈接: {len(href_contains)}個")
# 屬性值單詞匹配
class_word = soup.select('[class~="post"]')
print(f"class包含post單詞的元素: {len(class_word)}個")
# 5. 多重選擇器
print("\n5. 多重選擇器:")
# 並集選擇器
headings = soup.select('h1, h2, h3')
print(f"所有標題: {[h.get_text() for h in headings]}")
# 複雜組合
featured_tags = soup.select('article.featured .tag')
print(f"特色文章標籤: {[tag.get_text() for tag in featured_tags]}")
# 6. 否定選擇器
print("\n6. 否定選擇器:")
# 不包含特定class的元素
non_featured = soup.select('article:not(.featured)')
print(f"非特色文章: {len(non_featured)}個")
# 不是第一個子元素
not_first = soup.select('li:not(:first-child)')
print(f"非第一個li元素: {len(not_first)}個")
# 7. 文本內容選擇
print("\n7. 文本內容選擇:")
# 使用contains選擇器(BeautifulSoup特有)
# 注意:標準CSS不支持文本內容選擇,這是BeautifulSoup的擴展
# 查找包含特定文本的元素
python_elements = soup.find_all(string=re.compile('Python'))
print(f"包含Python的文本: {len(python_elements)}個")
# 8. 性能比較
print("\n8. 性能比較:")
import time
test_iterations = 1000
# CSS選擇器
start_time = time.time()
for _ in range(test_iterations):
soup.select('.post-title')
css_time = time.time() - start_time
# find_all方法
start_time = time.time()
for _ in range(test_iterations):
soup.find_all(class_='post-title')
find_time = time.time() - start_time
print(f"性能測試 ({test_iterations}次):")
print(f" CSS選擇器: {css_time:.4f}秒")
print(f" find_all方法: {find_time:.4f}秒")
# 9. 實用選擇器示例
print("\n9. 實用選擇器示例:")
# 選擇所有外部鏈接
external_links = soup.select('a[href^="http"]')
print(f"外部鏈接: {len(external_links)}個")
# 選擇所有圖片
images = soup.select('img')
print(f"圖片: {len(images)}個")
# 選擇表單元素
form_elements = soup.select('input, textarea, select')
print(f"表單元素: {len(form_elements)}個")
# 選擇有特定數據屬性的元素
data_elements = soup.select('[data-count]')
print(f"有data-count屬性的元素: {len(data_elements)}個")
for elem in data_elements:
print(f" {elem.get_text()}: {elem.get('data-count')}")
# 10. 複雜查詢示例
print("\n10. 複雜查詢示例:")
# 查找特定結構的數據
articles_info = []
for article in soup.select('article'):
title = article.select_one('.post-title')
author = article.select_one('.author')
date = article.select_one('.date')
tags = article.select('.tag')
if title:
article_data = {
'title': title.get_text(),
'author': author.get_text() if author else 'Unknown',
'date': date.get_text() if date else 'Unknown',
'tags': [tag.get_text() for tag in tags],
'category': article.get('data-category', 'Unknown')
}
articles_info.append(article_data)
print("文章詳細信息:")
for info in articles_info:
print(f" 標題: {info['title']}")
print(f" 作者: {info['author']}")
print(f" 日期: {info['date']}")
print(f" 分類: {info['category']}")
print(f" 標籤: {', '.join(info['tags'])}")
print()
# 運行CSS選擇器演示
if __name__ == "__main__":
css_selector_demo()
終端日誌:
=== CSS選擇器功能演示 ===
1. 基本選擇器:
h1標籤: ['我的博客']
文章標題: ['Python爬蟲技術詳解', 'Web開發入門指南']
容器元素: 1個
技術分類文章: 1個
2. 組合選擇器:
導航鏈接: ['首頁', '關於', '聯繫']
main的直接子文章: 2個
h2後的meta信息: 2個
h2後的所有div: 4個
3. 僞類選擇器:
列表第一項: ['首頁', '基礎概念介紹', 'HTML基礎', '文章標題1', '技術 (5)']
列表最後一項: ['聯繫', '最佳實踐分享', 'JavaScript交互', '文章標題3', '新聞 (2)']
列表第二項: ['關於', '實戰案例分析', 'CSS樣式', '文章標題2', '教程 (3)']
奇數位置項目: 8個
4. 屬性選擇器高級用法:
有title屬性的元素: 3個
href以/category開頭的鏈接: 3個
href以.html結尾的鏈接: 0個
href包含post的鏈接: 3個
class包含post單詞的元素: 4個
5. 多重選擇器:
所有標題: ['我的博客', 'Python爬蟲技術詳解', 'Web開發入門指南', '最新文章', '分類']
特色文章標籤: ['Python', '爬蟲']
6. 否定選擇器:
非特色文章: 1個
非第一個li元素: 10個
7. 文本內容選擇:
包含Python的文本: 2個
8. 性能比較 (1000次):
CSS選擇器: 0.0156秒
find_all方法: 0.0189秒
9. 實用選擇器示例:
外部鏈接: 0個
圖片: 0個
表單元素: 0個
有data-count屬性的元素: 3個
技術 (5): 5
教程 (3): 3
新聞 (2): 2
10. 複雜查詢示例:
文章詳細信息:
標題: Python爬蟲技術詳解
作者: 作者: 張三
日期: 2024-01-15
分類: tech
標籤: Python, 爬蟲
標題: Web開發入門指南
作者: 作者: 李四
日期: 2024-01-10
分類: tutorial
標籤: HTML, CSS, JavaScript
數據提取¶
BeautifulSoup提供了多種方法來提取HTML元素中的數據。
def data_extraction_demo():
"""
演示數據提取功能
"""
print("=== 數據提取功能演示 ===")
# 示例HTML - 電商產品頁面
html_content = """
<!DOCTYPE html>
<html>
<head>
<title>商品詳情 - Python編程書籍</title>
<meta name="description" content="Python從入門到精通,適合初學者的編程教程">
<meta name="keywords" content="Python, 編程, 教程, 書籍">
<meta name="price" content="89.00">
</head>
<body>
<div class="product-page">
<header class="page-header">
<nav class="breadcrumb">
<a href="/">首頁</a> >
<a href="/books">圖書</a> >
<a href="/books/programming">編程</a> >
<span class="current">Python從入門到精通</span>
</nav>
</header>
<main class="product-main">
<div class="product-gallery">
<img src="/images/python-book-cover.jpg" alt="Python從入門到精通封面" class="main-image">
<div class="thumbnail-list">
<img src="/images/python-book-thumb1.jpg" alt="縮略圖1" class="thumbnail">
<img src="/images/python-book-thumb2.jpg" alt="縮略圖2" class="thumbnail">
<img src="/images/python-book-thumb3.jpg" alt="縮略圖3" class="thumbnail">
</div>
</div>
<div class="product-info">
<h1 class="product-title">Python從入門到精通(第3版)</h1>
<div class="product-subtitle">零基礎學Python,包含大量實戰案例</div>
<div class="rating-section">
<div class="stars" data-rating="4.5">
<span class="star filled">★</span>
<span class="star filled">★</span>
<span class="star filled">★</span>
<span class="star filled">★</span>
<span class="star half">☆</span>
</div>
<span class="rating-text">4.5分</span>
<a href="#reviews" class="review-count">(1,234條評價)</a>
</div>
<div class="price-section">
<span class="current-price" data-price="89.00">¥89.00</span>
<span class="original-price" data-original="128.00">¥128.00</span>
<span class="discount">7折</span>
<div class="price-note">包郵 | 30天無理由退換</div>
</div>
<div class="product-specs">
<table class="specs-table">
<tr>
<td class="spec-name">作者</td>
<td class="spec-value">張三, 李四</td>
</tr>
<tr>
<td class="spec-name">出版社</td>
<td class="spec-value">人民郵電出版社</td>
</tr>
<tr>
<td class="spec-name">出版時間</td>
<td class="spec-value">2024年1月</td>
</tr>
<tr>
<td class="spec-name">頁數</td>
<td class="spec-value">568頁</td>
</tr>
<tr>
<td class="spec-name">ISBN</td>
<td class="spec-value">978-7-115-12345-6</td>
</tr>
<tr>
<td class="spec-name">重量</td>
<td class="spec-value">0.8kg</td>
</tr>
</table>
</div>
<div class="action-buttons">
<button class="btn btn-primary add-to-cart" data-product-id="12345">加入購物車</button>
<button class="btn btn-secondary buy-now" data-product-id="12345">立即購買</button>
<button class="btn btn-outline favorite" data-product-id="12345">收藏</button>
</div>
</div>
</main>
<section class="product-details">
<div class="tabs">
<div class="tab active" data-tab="description">商品描述</div>
<div class="tab" data-tab="contents">目錄</div>
<div class="tab" data-tab="reviews">用戶評價</div>
</div>
<div class="tab-content active" id="description">
<div class="description-text">
<p>本書是Python編程的入門經典教程,適合零基礎讀者學習。</p>
<p>全書共分爲15個章節,涵蓋了Python的基礎語法、數據結構、面向對象編程、文件操作、網絡編程等核心內容。</p>
<ul class="feature-list">
<li>✓ 零基礎入門,循序漸進</li>
<li>✓ 大量實戰案例,學以致用</li>
<li>✓ 配套視頻教程,立體學習</li>
<li>✓ 技術社區支持,答疑解惑</li>
</ul>
</div>
</div>
<div class="tab-content" id="contents">
<div class="contents-list">
<div class="chapter">
<h3>第1章 Python基礎</h3>
<ul>
<li>1.1 Python簡介</li>
<li>1.2 開發環境搭建</li>
<li>1.3 第一個Python程序</li>
</ul>
</div>
<div class="chapter">
<h3>第2章 數據類型</h3>
<ul>
<li>2.1 數字類型</li>
<li>2.2 字符串</li>
<li>2.3 列表和元組</li>
</ul>
</div>
<!-- 更多章節... -->
</div>
</div>
<div class="tab-content" id="reviews">
<div class="reviews-summary">
<div class="rating-breakdown">
<div class="rating-bar">
<span class="stars">5星</span>
<div class="bar"><div class="fill" style="width: 60%"></div></div>
<span class="count">740</span>
</div>
<div class="rating-bar">
<span class="stars">4星</span>
<div class="bar"><div class="fill" style="width: 25%"></div></div>
<span class="count">309</span>
</div>
<div class="rating-bar">
<span class="stars">3星</span>
<div class="bar"><div class="fill" style="width: 10%"></div></div>
<span class="count">123</span>
</div>
<div class="rating-bar">
<span class="stars">2星</span>
<div class="bar"><div class="fill" style="width: 3%"></div></div>
<span class="count">37</span>
</div>
<div class="rating-bar">
<span class="stars">1星</span>
<div class="bar"><div class="fill" style="width: 2%"></div></div>
<span class="count">25</span>
</div>
</div>
</div>
<div class="reviews-list">
<div class="review" data-rating="5">
<div class="review-header">
<span class="reviewer">Python學習者</span>
<div class="review-stars">★★★★★</div>
<span class="review-date">2024-01-15</span>
</div>
<div class="review-content">
<p>非常好的Python入門書籍,內容詳實,案例豐富。作爲零基礎學習者,我能夠很好地理解書中的內容。</p>
</div>
<div class="review-helpful">
<button class="helpful-btn" data-count="23">有用 (23)</button>
</div>
</div>
<div class="review" data-rating="4">
<div class="review-header">
<span class="reviewer">編程新手</span>
<div class="review-stars">★★★★☆</div>
<span class="review-date">2024-01-10</span>
</div>
<div class="review-content">
<p>書的質量不錯,內容也比較全面。就是有些地方講解得不夠深入,需要結合其他資料學習。</p>
</div>
<div class="review-helpful">
<button class="helpful-btn" data-count="15">有用 (15)</button>
</div>
</div>
<div class="review" data-rating="5">
<div class="review-header">
<span class="reviewer">技術愛好者</span>
<div class="review-stars">★★★★★</div>
<span class="review-date">2024-01-08</span>
</div>
<div class="review-content">
<p>推薦給所有想學Python的朋友!書中的實戰項目很有意思,跟着做完後收穫很大。</p>
</div>
<div class="review-helpful">
<button class="helpful-btn" data-count="31">有用 (31)</button>
</div>
</div>
</div>
</div>
</section>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html_content, 'html.parser')
# 1. 基本文本提取
print("\n1. 基本文本提取:")
# 提取標題
title = soup.find('h1', class_='product-title')
print(f"商品標題: {title.get_text() if title else 'N/A'}")
# 提取副標題
subtitle = soup.find('div', class_='product-subtitle')
print(f"商品副標題: {subtitle.get_text() if subtitle else 'N/A'}")
# 提取價格信息
current_price = soup.find('span', class_='current-price')
original_price = soup.find('span', class_='original-price')
discount = soup.find('span', class_='discount')
print(f"當前價格: {current_price.get_text() if current_price else 'N/A'}")
print(f"原價: {original_price.get_text() if original_price else 'N/A'}")
print(f"折扣: {discount.get_text() if discount else 'N/A'}")
# 2. 屬性值提取
print("\n2. 屬性值提取:")
# 提取數據屬性
rating_element = soup.find('div', class_='stars')
if rating_element:
rating = rating_element.get('data-rating')
print(f"評分: {rating}")
# 提取價格數據屬性
if current_price:
price_value = current_price.get('data-price')
print(f"價格數值: {price_value}")
# 提取產品ID
add_to_cart_btn = soup.find('button', class_='add-to-cart')
if add_to_cart_btn:
product_id = add_to_cart_btn.get('data-product-id')
print(f"產品ID: {product_id}")
# 提取圖片信息
main_image = soup.find('img', class_='main-image')
if main_image:
img_src = main_image.get('src')
img_alt = main_image.get('alt')
print(f"主圖片: {img_src}, 描述: {img_alt}")
# 3. 表格數據提取
print("\n3. 表格數據提取:")
specs_table = soup.find('table', class_='specs-table')
if specs_table:
specs = {}
rows = specs_table.find_all('tr')
for row in rows:
name_cell = row.find('td', class_='spec-name')
value_cell = row.find('td', class_='spec-value')
if name_cell and value_cell:
specs[name_cell.get_text()] = value_cell.get_text()
print("商品規格:")
for key, value in specs.items():
print(f" {key}: {value}")
# 4. 列表數據提取
print("\n4. 列表數據提取:")
# 提取麪包屑導航
breadcrumb = soup.find('nav', class_='breadcrumb')
if breadcrumb:
links = breadcrumb.find_all('a')
current = breadcrumb.find('span', class_='current')
breadcrumb_path = [link.get_text() for link in links]
if current:
breadcrumb_path.append(current.get_text())
print(f"導航路徑: {' > '.join(breadcrumb_path)}")
# 提取特性列表
feature_list = soup.find('ul', class_='feature-list')
if feature_list:
features = [li.get_text().strip() for li in feature_list.find_all('li')]
print(f"產品特性: {features}")
# 5. 複雜結構數據提取
print("\n5. 複雜結構數據提取:")
# 提取評價信息
reviews = []
review_elements = soup.find_all('div', class_='review')
for review_elem in review_elements:
reviewer = review_elem.find('span', class_='reviewer')
rating_stars = review_elem.find('div', class_='review-stars')
date = review_elem.find('span', class_='review-date')
content = review_elem.find('div', class_='review-content')
helpful_btn = review_elem.find('button', class_='helpful-btn')
review_data = {
'reviewer': reviewer.get_text() if reviewer else 'Anonymous',
'rating': review_elem.get('data-rating') if review_elem.has_attr('data-rating') else 'N/A',
'date': date.get_text() if date else 'N/A',
'content': content.get_text().strip() if content else 'N/A',
'helpful_count': helpful_btn.get('data-count') if helpful_btn else '0'
}
reviews.append(review_data)
print(f"用戶評價 ({len(reviews)}條):")
for i, review in enumerate(reviews, 1):
print(f" 評價{i}:")
print(f" 用戶: {review['reviewer']}")
print(f" 評分: {review['rating']}星")
print(f" 日期: {review['date']}")
print(f" 內容: {review['content'][:50]}...")
print(f" 有用數: {review['helpful_count']}")
print()
# 6. 評分統計提取
print("\n6. 評分統計提取:")
rating_bars = soup.find_all('div', class_='rating-bar')
rating_stats = {}
for bar in rating_bars:
stars = bar.find('span', class_='stars')
count = bar.find('span', class_='count')
fill_elem = bar.find('div', class_='fill')
if stars and count:
star_level = stars.get_text()
count_num = count.get_text()
percentage = '0%'
if fill_elem and fill_elem.has_attr('style'):
style = fill_elem.get('style')
# 提取width百分比
import re
width_match = re.search(r'width:\s*(\d+%)', style)
if width_match:
percentage = width_match.group(1)
rating_stats[star_level] = {
'count': count_num,
'percentage': percentage
}
print("評分分佈:")
for star_level, stats in rating_stats.items():
print(f" {star_level}: {stats['count']}條 ({stats['percentage']})")
# 7. 文本清理和格式化
print("\n7. 文本清理和格式化:")
# 提取並清理描述文本
description = soup.find('div', class_='description-text')
if description:
# 獲取純文本,去除HTML標籤
clean_text = description.get_text(separator=' ', strip=True)
print(f"商品描述: {clean_text[:100]}...")
# 提取段落
paragraphs = [p.get_text().strip() for p in description.find_all('p')]
print(f"描述段落數: {len(paragraphs)}")
# 8. 條件提取
print("\n8. 條件提取:")
# 提取高評分評價
high_rating_reviews = soup.find_all('div', class_='review', attrs={'data-rating': lambda x: x and int(x) >= 4})
print(f"高評分評價數量: {len(high_rating_reviews)}")
# 提取有用評價(有用數>20)
useful_reviews = []
for review in soup.find_all('div', class_='review'):
helpful_btn = review.find('button', class_='helpful-btn')
if helpful_btn:
count = helpful_btn.get('data-count')
if count and int(count) > 20:
reviewer = review.find('span', class_='reviewer')
useful_reviews.append(reviewer.get_text() if reviewer else 'Anonymous')
print(f"有用評價用戶: {useful_reviews}")
# 9. 數據驗證和錯誤處理
print("\n9. 數據驗證和錯誤處理:")
# 安全提取價格
def safe_extract_price(element):
if not element:
return None
price_text = element.get_text().strip()
# 提取數字
import re
price_match = re.search(r'([\d.]+)', price_text)
if price_match:
try:
return float(price_match.group(1))
except ValueError:
return None
return None
current_price_value = safe_extract_price(current_price)
original_price_value = safe_extract_price(original_price)
print(f"當前價格數值: {current_price_value}")
print(f"原價數值: {original_price_value}")
if current_price_value and original_price_value:
savings = original_price_value - current_price_value
discount_percent = (savings / original_price_value) * 100
print(f"節省金額: ¥{savings:.2f}")
print(f"折扣百分比: {discount_percent:.1f}%")
# 10. 綜合數據結構
print("\n10. 綜合數據結構:")
# 構建完整的產品數據結構
product_data = {
'basic_info': {
'title': title.get_text() if title else None,
'subtitle': subtitle.get_text() if subtitle else None,
'product_id': product_id if 'product_id' in locals() else None
},
'pricing': {
'current_price': current_price_value,
'original_price': original_price_value,
'discount_text': discount.get_text() if discount else None
},
'rating': {
'score': rating if 'rating' in locals() else None,
'total_reviews': len(reviews),
'rating_distribution': rating_stats
},
'specifications': specs if 'specs' in locals() else {},
'features': features if 'features' in locals() else [],
'reviews_sample': reviews[:2] # 只保留前兩條評價作爲示例
}
print("產品數據結構:")
import json
print(json.dumps(product_data, ensure_ascii=False, indent=2))
# 運行數據提取演示
if __name__ == "__main__":
data_extraction_demo()
終端日誌:
=== 數據提取功能演示 ===
1. 基本文本提取:
商品標題: Python從入門到精通(第3版)
商品副標題: 零基礎學Python,包含大量實戰案例
當前價格: ¥89.00
原價: ¥128.00
折扣: 7折
2. 屬性值提取:
評分: 4.5
價格數值: 89.00
產品ID: 12345
主圖片: /images/python-book-cover.jpg, 描述: Python從入門到精通封面
3. 表格數據提取:
商品規格:
作者: 張三, 李四
出版社: 人民郵電出版社
出版時間: 2024年1月
頁數: 568頁
ISBN: 978-7-115-12345-6
重量: 0.8kg
4. 列表數據提取:
導航路徑: 首頁 > 圖書 > 編程 > Python從入門到精通
產品特性: ['✓ 零基礎入門,循序漸進', '✓ 大量實戰案例,學以致用', '✓ 配套視頻教程,立體學習', '✓ 技術社區支持,答疑解惑']
5. 複雜結構數據提取:
用戶評價 (3條):
評價1:
用戶: Python學習者
評分: 5星
日期: 2024-01-15
內容: 非常好的Python入門書籍,內容詳實,案例豐富。作爲零基礎學習者,我能夠很好地理解書中的內容。...
有用數: 23
評價2:
用戶: 編程新手
評分: 4星
日期: 2024-01-10
內容: 書的質量不錯,內容也比較全面。就是有些地方講解得不夠深入,需要結合其他資料學習。...
有用數: 15
評價3:
用戶: 技術愛好者
評分: 5星
日期: 2024-01-08
內容: 推薦給所有想學Python的朋友!書中的實戰項目很有意思,跟着做完後收穫很大。...
有用數: 31
6. 評分統計提取:
評分分佈:
5星: 740條 (60%)
4星: 309條 (25%)
3星: 123條 (10%)
2星: 37條 (3%)
1星: 25條 (2%)
7. 文本清理和格式化:
商品描述: 本書是Python編程的入門經典教程,適合零基礎讀者學習。 全書共分爲15個章節,涵蓋了Python的基礎語法、數據結構、面向對象編程、文件操作、網絡編程等核心內容。 ✓ 零基礎入門,循序漸進 ✓ 大量實戰案例,學以致用 ✓ 配套視頻教程,立體學習 ✓ 技術社區支持,答疑解惑...
描述段落數: 2
8. 條件提取:
高評分評價數量: 2
有用評價用戶: ['Python學習者', '技術愛好者']
9. 數據驗證和錯誤處理:
當前價格數值: 89.0
原價數值: 128.0
節省金額: ¥39.00
折扣百分比: 30.5%
10. 綜合數據結構:
產品數據結構:
{
"basic_info": {
"title": "Python從入門到精通(第3版)",
"subtitle": "零基礎學Python,包含大量實戰案例",
"product_id": "12345"
},
"pricing": {
"current_price": 89.0,
"original_price": 128.0,
"discount_text": "7折"
},
"rating": {
"score": "4.5",
"total_reviews": 3,
"rating_distribution": {
"5星": {
"count": "740",
"percentage": "60%"
},
"4星": {
"count": "309",
"percentage": "25%"
},
"3星": {
"count": "123",
"percentage": "10%"
},
"2星": {
"count": "37",
"percentage": "3%"
},
"1星": {
"count": "25",
"percentage": "2%"
}
}
},
"specifications": {
"作者": "張三, 李四",
"出版社": "人民郵電出版社",
"出版時間": "2024年1月",
"頁數": "568頁",
"ISBN": "978-7-115-12345-6",
"重量": "0.8kg"
},
"features": [
"✓ 零基礎入門,循序漸進",
"✓ 大量實戰案例,學以致用",
"✓ 配套視頻教程,立體學習",
"✓ 技術社區支持,答疑解惑"
],
"reviews_sample": [
{
"reviewer": "Python學習者",
"rating": "5",
"date": "2024-01-15",
"content": "非常好的Python入門書籍,內容詳實,案例豐富。作爲零基礎學習者,我能夠很好地理解書中的內容。",
"helpful_count": "23"
},
{
"reviewer": "編程新手",
"rating": "4",
"date": "2024-01-10",
"content": "書的質量不錯,內容也比較全面。就是有些地方講解得不夠深入,需要結合其他資料學習。",
"helpful_count": "15"
}
]
}
高級操作¶
文檔修改¶
BeautifulSoup不僅可以解析HTML,還可以修改文檔結構。
def document_modification_demo():
"""
演示文檔修改功能
"""
print("=== 文檔修改功能演示 ===")
# 示例HTML - 簡單的博客文章
html_content = """
<!DOCTYPE html>
<html>
<head>
<title>我的博客文章</title>
<meta name="author" content="原作者">
</head>
<body>
<div class="container">
<header>
<h1>Python學習筆記</h1>
<p class="meta">發佈時間: 2024-01-01</p>
</header>
<main class="content">
<section class="intro">
<h2>簡介</h2>
<p>這是一篇關於Python基礎的文章。</p>
</section>
<section class="topics">
<h2>主要內容</h2>
<ul id="topic-list">
<li>變量和數據類型</li>
<li>控制結構</li>
</ul>
</section>
<section class="examples">
<h2>代碼示例</h2>
<div class="code-block">
<pre><code>print("Hello, World!")</code></pre>
</div>
</section>
</main>
<footer>
<p>版權所有 © 2024</p>
</footer>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html_content, 'html.parser')
print("\n1. 修改文本內容:")
# 修改標題
title_tag = soup.find('h1')
if title_tag:
old_title = title_tag.get_text()
title_tag.string = "Python高級編程技巧"
print(f"標題修改: '{old_title}' -> '{title_tag.get_text()}'")
# 修改作者信息
author_meta = soup.find('meta', attrs={'name': 'author'})
if author_meta:
old_author = author_meta.get('content')
author_meta['content'] = "技術專家"
print(f"作者修改: '{old_author}' -> '{author_meta.get('content')}'")
# 修改發佈時間
meta_p = soup.find('p', class_='meta')
if meta_p:
old_time = meta_p.get_text()
meta_p.string = "發佈時間: 2024-01-15 (已更新)"
print(f"時間修改: '{old_time}' -> '{meta_p.get_text()}'")
print("\n2. 添加新元素:")
# 在列表中添加新項目
topic_list = soup.find('ul', id='topic-list')
if topic_list:
# 創建新的li元素
new_li1 = soup.new_tag('li')
new_li1.string = "函數和模塊"
new_li2 = soup.new_tag('li')
new_li2.string = "面向對象編程"
new_li3 = soup.new_tag('li')
new_li3.string = "異常處理"
# 添加到列表末尾
topic_list.append(new_li1)
topic_list.append(new_li2)
topic_list.append(new_li3)
print(f"添加了3個新的主題項目")
print(f"當前主題列表: {[li.get_text() for li in topic_list.find_all('li')]}")
# 添加新的代碼示例
examples_section = soup.find('section', class_='examples')
if examples_section:
# 創建新的代碼塊
new_code_block = soup.new_tag('div', class_='code-block')
new_pre = soup.new_tag('pre')
new_code = soup.new_tag('code')
new_code.string = '''def greet(name):
return f"Hello, {name}!"
print(greet("Python"))'''
new_pre.append(new_code)
new_code_block.append(new_pre)
examples_section.append(new_code_block)
print("添加了新的代碼示例")
# 添加新的section
main_content = soup.find('main', class_='content')
if main_content:
new_section = soup.new_tag('section', class_='resources')
new_h2 = soup.new_tag('h2')
new_h2.string = "學習資源"
new_ul = soup.new_tag('ul')
resources = [
"Python官方文檔",
"在線編程練習",
"開源項目參與"
]
for resource in resources:
li = soup.new_tag('li')
li.string = resource
new_ul.append(li)
new_section.append(new_h2)
new_section.append(new_ul)
main_content.append(new_section)
print("添加了新的學習資源section")
print("\n3. 修改屬性:")
# 修改容器類名
container = soup.find('div', class_='container')
if container:
old_class = container.get('class')
container['class'] = ['main-container', 'updated']
container['data-version'] = '2.0'
print(f"容器類名修改: {old_class} -> {container.get('class')}")
print(f"添加了data-version屬性: {container.get('data-version')}")
# 爲代碼塊添加語言標識
code_blocks = soup.find_all('div', class_='code-block')
for i, block in enumerate(code_blocks):
block['data-language'] = 'python'
block['data-line-numbers'] = 'true'
print(f"代碼塊{i+1}添加了語言標識和行號屬性")
print("\n4. 刪除元素:")
# 刪除版權信息(示例)
footer = soup.find('footer')
if footer:
copyright_p = footer.find('p')
if copyright_p:
old_text = copyright_p.get_text()
copyright_p.decompose() # 完全刪除元素
print(f"刪除了版權信息: '{old_text}'")
print("\n5. 元素移動和重排:")
# 將簡介section移動到主要內容之後
intro_section = soup.find('section', class_='intro')
topics_section = soup.find('section', class_='topics')
if intro_section and topics_section:
# 從當前位置移除
intro_section.extract()
# 插入到topics_section之後
topics_section.insert_after(intro_section)
print("將簡介section移動到主要內容section之後")
print("\n6. 批量操作:")
# 爲所有h2標籤添加id屬性
h2_tags = soup.find_all('h2')
for h2 in h2_tags:
# 生成id(將標題轉換爲合適的id格式)
title_text = h2.get_text().lower().replace(' ', '-').replace(',', '')
h2['id'] = f"section-{title_text}"
print(f"爲h2標籤添加id: {h2['id']}")
# 爲所有鏈接添加target="_blank"
links = soup.find_all('a')
for link in links:
link['target'] = '_blank'
link['rel'] = 'noopener noreferrer'
if links:
print(f"爲{len(links)}個鏈接添加了target和rel屬性")
else:
print("沒有找到鏈接元素")
print("\n7. 條件修改:")
# 只修改包含特定文本的元素
all_p = soup.find_all('p')
modified_count = 0
for p in all_p:
text = p.get_text()
if 'Python' in text:
# 添加強調樣式
p['class'] = p.get('class', []) + ['python-related']
p['style'] = 'font-weight: bold; color: #3776ab;'
modified_count += 1
print(f"爲{modified_count}個包含'Python'的段落添加了樣式")
print("\n8. 創建複雜結構:")
# 創建一個導航菜單
nav = soup.new_tag('nav', class_='table-of-contents')
nav_title = soup.new_tag('h3')
nav_title.string = "目錄"
nav_ul = soup.new_tag('ul')
# 基於現有的h2標籤創建導航
for h2 in soup.find_all('h2'):
li = soup.new_tag('li')
a = soup.new_tag('a', href=f"#{h2.get('id', '')}")
a.string = h2.get_text()
li.append(a)
nav_ul.append(li)
nav.append(nav_title)
nav.append(nav_ul)
# 將導航插入到header之後
header = soup.find('header')
if header:
header.insert_after(nav)
print("創建並插入了目錄導航")
print("\n9. 文檔結構優化:")
# 添加語義化標籤
main_tag = soup.find('main')
if main_tag:
# 爲main標籤添加role屬性
main_tag['role'] = 'main'
main_tag['aria-label'] = '主要內容'
print("爲main標籤添加了無障礙屬性")
# 添加meta標籤
head = soup.find('head')
if head:
# 添加viewport meta
viewport_meta = soup.new_tag('meta', attrs={
'name': 'viewport',
'content': 'width=device-width, initial-scale=1.0'
})
# 添加description meta
desc_meta = soup.new_tag('meta', attrs={
'name': 'description',
'content': 'Python高級編程技巧學習筆記,包含函數、面向對象編程、異常處理等內容。'
})
head.append(viewport_meta)
head.append(desc_meta)
print("添加了viewport和description meta標籤")
print("\n10. 輸出修改後的文檔:")
# 格式化輸出
formatted_html = soup.prettify()
print("修改後的HTML文檔:")
print(formatted_html[:1000] + "..." if len(formatted_html) > 1000 else formatted_html)
# 統計信息
print(f"\n文檔統計:")
print(f" 總標籤數: {len(soup.find_all())}")
print(f" 段落數: {len(soup.find_all('p'))}")
print(f" 標題數: {len(soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']))}")
print(f" 列表項數: {len(soup.find_all('li'))}")
print(f" 代碼塊數: {len(soup.find_all('div', class_='code-block'))}")
return soup
# 運行文檔修改演示
if __name__ == "__main__":
modified_soup = document_modification_demo()
終端日誌:
=== 文檔修改功能演示 ===
1. 修改文本內容:
標題修改: 'Python學習筆記' -> 'Python高級編程技巧'
作者修改: '原作者' -> '技術專家'
時間修改: '發佈時間: 2024-01-01' -> '發佈時間: 2024-01-15 (已更新)'
2. 添加新元素:
添加了3個新的主題項目
當前主題列表: ['變量和數據類型', '控制結構', '函數和模塊', '面向對象編程', '異常處理']
添加了新的代碼示例
添加了新的學習資源section
3. 修改屬性:
容器類名修改: ['container'] -> ['main-container', 'updated']
添加了data-version屬性: 2.0
代碼塊1添加了語言標識和行號屬性
代碼塊2添加了語言標識和行號屬性
4. 刪除元素:
刪除了版權信息: '版權所有 © 2024'
5. 元素移動和重排:
將簡介section移動到主要內容section之後
6. 批量操作:
爲h2標籤添加id: section-主要內容
爲h2標籤添加id: section-簡介
爲h2標籤添加id: section-代碼示例
爲h2標籤添加id: section-學習資源
沒有找到鏈接元素
7. 條件修改:
爲1個包含'Python'的段落添加了樣式
8. 創建複雜結構:
創建並插入了目錄導航
9. 文檔結構優化:
爲main標籤添加了無障礙屬性
添加了viewport和description meta標籤
10. 輸出修改後的文檔:
修改後的HTML文檔:
<!DOCTYPE html>
<html>
<head>
<title>
我的博客文章
</title>
<meta content="技術專家" name="author"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="Python高級編程技巧學習筆記,包含函數、面向對象編程、異常處理等內容。" name="description"/>
</head>
<body>
<div class="main-container updated" data-version="2.0">
<header>
<h1>
Python高級編程技巧
</h1>
<p class="meta">
發佈時間: 2024-01-15 (已更新)
</p>
</header>
<nav class="table-of-contents">
<h3>
目錄
</h3>
<ul>
<li>
<a href="#section-主要內容">
主要內容
</a>
</li>
<li>
<a href="#section-簡介">
簡介
</a>
</li>
<li>
<a href="#section-代碼示例">
代碼示例
</a>
</li>
<li>
<a href="#section-學習資源">
學習資源
</a>
</li>
</ul>
</nav>
<main aria-label="主要內容" class="content" role="main">
<section class="topics">
<h2 id="section-主要內容">
主要內容
</h2>
<ul id="topic-list">
<li>
變量和數據類型
</li>
<li>
控制結構
</li>
<li>
函數和模塊
</li>
<li>
面向對象編程
</li>
<li>
異常處理
</li>
</ul>
</section>
<section class="intro">
<h2 id="section-簡介">
簡介
</h2>
<p class="python-related" style="font-weight: bold; color: #3776ab;">
這是一篇關於Python基礎的文章。
</p>
</section>
<section class="examples">
<h2 id="section-代碼示例">
代碼示例
</h2>
<div class="code-block" data-language="python" data-line-numbers="true">
<pre><code>print("Hello, World!")</code></pre>
</div>
<div class="code-block" data-language="python" data-line-numbers="true">
<pre><code>def greet(name):
return f"Hello, {name}!"
print(greet("Python"))</code></pre>
</div>
</section>
<section class="resources">
<h2 id="section-學習資源">
學習資源
</h2>
<ul>
<li>
Python官方文檔
</li>
<li>
在線編程練習
</li>
<li>
開源項目參與
</li>
</ul>
</section>
</main>
<footer>
</footer>
</div>
</body>
</html>...
文檔統計:
總標籤數: 32
段落數: 1
標題數: 5
列表項數: 11
代碼塊數: 2
元素插入和刪除¶
BeautifulSoup提供了靈活的元素插入和刪除方法。
def element_operations_demo():
"""
演示元素插入和刪除操作
"""
print("=== 元素插入和刪除操作演示 ===")
# 示例HTML - 文章列表
html_content = """
<!DOCTYPE html>
<html>
<head>
<title>文章管理系統</title>
</head>
<body>
<div class="article-manager">
<header class="page-header">
<h1>文章列表</h1>
<div class="actions">
<button class="btn-new">新建文章</button>
</div>
</header>
<main class="article-list">
<article class="article-item" data-id="1">
<h2 class="article-title">Python基礎教程</h2>
<p class="article-summary">學習Python編程的基礎知識</p>
<div class="article-meta">
<span class="author">作者: 張三</span>
<span class="date">2024-01-01</span>
<span class="category">編程</span>
</div>
<div class="article-actions">
<button class="btn-edit">編輯</button>
<button class="btn-delete">刪除</button>
</div>
</article>
<article class="article-item" data-id="2">
<h2 class="article-title">Web開發入門</h2>
<p class="article-summary">從零開始學習Web開發</p>
<div class="article-meta">
<span class="author">作者: 李四</span>
<span class="date">2024-01-05</span>
<span class="category">Web開發</span>
</div>
<div class="article-actions">
<button class="btn-edit">編輯</button>
<button class="btn-delete">刪除</button>
</div>
</article>
</main>
<footer class="page-footer">
<p>共 2 篇文章</p>
</footer>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html_content, 'html.parser')
print("\n1. 在指定位置插入元素:")
# 在第一篇文章前插入新文章
article_list = soup.find('main', class_='article-list')
first_article = soup.find('article', class_='article-item')
if article_list and first_article:
# 創建新文章
new_article = soup.new_tag('article', class_='article-item featured', **{'data-id': '0'})
# 創建文章標題
title = soup.new_tag('h2', class_='article-title')
title.string = "🔥 熱門推薦:Python高級特性詳解"
# 創建文章摘要
summary = soup.new_tag('p', class_='article-summary')
summary.string = "深入瞭解Python的高級特性和最佳實踐"
# 創建元數據
meta_div = soup.new_tag('div', class_='article-meta')
author_span = soup.new_tag('span', class_='author')
author_span.string = "作者: 技術專家"
date_span = soup.new_tag('span', class_='date')
date_span.string = "2024-01-15"
category_span = soup.new_tag('span', class_='category featured-category')
category_span.string = "高級編程"
meta_div.extend([author_span, date_span, category_span])
# 創建操作按鈕
actions_div = soup.new_tag('div', class_='article-actions')
edit_btn = soup.new_tag('button', class_='btn-edit')
edit_btn.string = "編輯"
delete_btn = soup.new_tag('button', class_='btn-delete')
delete_btn.string = "刪除"
pin_btn = soup.new_tag('button', class_='btn-pin')
pin_btn.string = "置頂"
actions_div.extend([edit_btn, delete_btn, pin_btn])
# 組裝新文章
new_article.extend([title, summary, meta_div, actions_div])
# 插入到第一篇文章前
first_article.insert_before(new_article)
print("在列表開頭插入了特色文章")
# 在最後一篇文章後插入新文章
all_articles = soup.find_all('article', class_='article-item')
if all_articles:
last_article = all_articles[-1]
# 創建另一篇新文章
another_article = soup.new_tag('article', class_='article-item draft', **{'data-id': '3'})
title = soup.new_tag('h2', class_='article-title')
title.string = "📝 草稿:數據庫設計原理"
summary = soup.new_tag('p', class_='article-summary')
summary.string = "數據庫設計的基本原理和最佳實踐(草稿狀態)"
meta_div = soup.new_tag('div', class_='article-meta')
author_span = soup.new_tag('span', class_='author')
author_span.string = "作者: 王五"
date_span = soup.new_tag('span', class_='date')
date_span.string = "2024-01-16"
status_span = soup.new_tag('span', class_='status draft-status')
status_span.string = "草稿"
meta_div.extend([author_span, date_span, status_span])
actions_div = soup.new_tag('div', class_='article-actions')
edit_btn = soup.new_tag('button', class_='btn-edit primary')
edit_btn.string = "繼續編輯"
publish_btn = soup.new_tag('button', class_='btn-publish')
publish_btn.string = "發佈"
delete_btn = soup.new_tag('button', class_='btn-delete')
delete_btn.string = "刪除"
actions_div.extend([edit_btn, publish_btn, delete_btn])
another_article.extend([title, summary, meta_div, actions_div])
# 插入到最後一篇文章後
last_article.insert_after(another_article)
print("在列表末尾插入了草稿文章")
print("\n2. 在父元素中插入子元素:")
# 在頁面頭部添加搜索框
page_header = soup.find('header', class_='page-header')
if page_header:
# 創建搜索區域
search_div = soup.new_tag('div', class_='search-area')
search_input = soup.new_tag('input', type='text', placeholder='搜索文章...', class_='search-input')
search_btn = soup.new_tag('button', class_='btn-search')
search_btn.string = "搜索"
search_div.extend([search_input, search_btn])
# 插入到actions div之前
actions_div = page_header.find('div', class_='actions')
if actions_div:
actions_div.insert_before(search_div)
print("在頁面頭部添加了搜索區域")
# 在每篇文章中添加標籤
articles = soup.find_all('article', class_='article-item')
for i, article in enumerate(articles):
meta_div = article.find('div', class_='article-meta')
if meta_div:
# 創建標籤容器
tags_div = soup.new_tag('div', class_='article-tags')
# 根據文章類型添加不同標籤
if 'featured' in article.get('class', []):
tags = ['熱門', '推薦', 'Python']
elif 'draft' in article.get('class', []):
tags = ['草稿', '數據庫']
else:
tags = ['基礎', '教程']
for tag in tags:
tag_span = soup.new_tag('span', class_='tag')
tag_span.string = tag
tags_div.append(tag_span)
# 插入到meta div之後
meta_div.insert_after(tags_div)
print(f"爲文章{i+1}添加了標籤")
print("\n3. 刪除元素:")
# 刪除第二篇文章(原來的第一篇)
articles = soup.find_all('article', class_='article-item')
if len(articles) > 1:
article_to_delete = articles[1] # 第二篇文章
article_title = article_to_delete.find('h2', class_='article-title')
title_text = article_title.get_text() if article_title else "未知標題"
article_to_delete.decompose() # 完全刪除
print(f"刪除了文章: '{title_text}'")
# 刪除所有草稿狀態的文章
draft_articles = soup.find_all('article', class_='draft')
deleted_drafts = []
for draft in draft_articles:
title_elem = draft.find('h2', class_='article-title')
if title_elem:
deleted_drafts.append(title_elem.get_text())
draft.decompose()
if deleted_drafts:
print(f"刪除了草稿文章: {deleted_drafts}")
else:
print("沒有找到草稿文章")
# 刪除特定的按鈕
pin_buttons = soup.find_all('button', class_='btn-pin')
for btn in pin_buttons:
btn.decompose()
if pin_buttons:
print(f"刪除了{len(pin_buttons)}個置頂按鈕")
print("\n4. 替換元素:")
# 替換頁面標題
page_title = soup.find('h1')
if page_title:
old_title = page_title.get_text()
# 創建新的標題元素
new_title = soup.new_tag('h1', class_='main-title')
new_title.string = "📚 技術文章管理中心"
# 替換
page_title.replace_with(new_title)
print(f"頁面標題替換: '{old_title}' -> '{new_title.get_text()}'")
# 替換所有編輯按鈕爲更詳細的按鈕
edit_buttons = soup.find_all('button', class_='btn-edit')
for btn in edit_buttons:
# 創建新的按鈕組
btn_group = soup.new_tag('div', class_='btn-group')
quick_edit = soup.new_tag('button', class_='btn-quick-edit')
quick_edit.string = "快速編輯"
full_edit = soup.new_tag('button', class_='btn-full-edit')
full_edit.string = "完整編輯"
btn_group.extend([quick_edit, full_edit])
# 替換原按鈕
btn.replace_with(btn_group)
print(f"替換了{len(edit_buttons)}個編輯按鈕爲按鈕組")
print("\n5. 移動元素:")
# 將搜索區域移動到標題之前
search_area = soup.find('div', class_='search-area')
main_title = soup.find('h1', class_='main-title')
if search_area and main_title:
# 提取搜索區域
search_area.extract()
# 插入到標題之前
main_title.insert_before(search_area)
print("將搜索區域移動到標題之前")
# 重新排序文章(按日期)
article_list = soup.find('main', class_='article-list')
if article_list:
articles = article_list.find_all('article', class_='article-item')
# 提取所有文章
article_data = []
for article in articles:
date_elem = article.find('span', class_='date')
date_str = date_elem.get_text() if date_elem else "2024-01-01"
article_data.append((date_str, article.extract()))
# 按日期排序(最新的在前)
article_data.sort(key=lambda x: x[0], reverse=True)
# 重新插入排序後的文章
for date_str, article in article_data:
article_list.append(article)
print(f"按日期重新排序了{len(article_data)}篇文章")
print("\n6. 批量操作:")
# 爲所有文章添加閱讀時間估算
articles = soup.find_all('article', class_='article-item')
for article in articles:
summary = article.find('p', class_='article-summary')
if summary:
# 估算閱讀時間(基於摘要長度)
text_length = len(summary.get_text())
read_time = max(1, text_length // 50) # 假設每50個字符需要1分鐘
read_time_span = soup.new_tag('span', class_='read-time')
read_time_span.string = f"預計閱讀: {read_time}分鐘"
# 插入到摘要之後
summary.insert_after(read_time_span)
print(f"爲{len(articles)}篇文章添加了閱讀時間估算")
# 更新文章計數
footer = soup.find('footer', class_='page-footer')
if footer:
count_p = footer.find('p')
if count_p:
current_count = len(soup.find_all('article', class_='article-item'))
count_p.string = f"共 {current_count} 篇文章"
print(f"更新了文章計數: {current_count}")
print("\n7. 條件操作:")
# 只對特色文章添加特殊標記
featured_articles = soup.find_all('article', class_='featured')
for article in featured_articles:
title = article.find('h2', class_='article-title')
if title and not title.get_text().startswith('🔥'):
title.string = f"🔥 {title.get_text()}"
print(f"爲{len(featured_articles)}篇特色文章添加了火焰標記")
# 爲長摘要添加展開/收起功能
summaries = soup.find_all('p', class_='article-summary')
long_summaries = 0
for summary in summaries:
if len(summary.get_text()) > 30: # 超過30個字符認爲是長摘要
summary['class'] = summary.get('class', []) + ['long-summary']
summary['data-full-text'] = summary.get_text()
# 創建展開按鈕
expand_btn = soup.new_tag('button', class_='btn-expand')
expand_btn.string = "展開"
summary.insert_after(expand_btn)
long_summaries += 1
print(f"爲{long_summaries}個長摘要添加了展開功能")
print("\n8. 最終文檔統計:")
# 統計最終結果
final_stats = {
'總文章數': len(soup.find_all('article', class_='article-item')),
'特色文章數': len(soup.find_all('article', class_='featured')),
'草稿文章數': len(soup.find_all('article', class_='draft')),
'總按鈕數': len(soup.find_all('button')),
'標籤數': len(soup.find_all('span', class_='tag')),
'總元素數': len(soup.find_all())
}
for key, value in final_stats.items():
print(f" {key}: {value}")
# 輸出部分修改後的HTML
print("\n9. 修改後的HTML片段:")
article_list = soup.find('main', class_='article-list')
if article_list:
first_article = article_list.find('article')
if first_article:
print(first_article.prettify()[:500] + "...")
return soup
# 運行元素操作演示
if __name__ == "__main__":
modified_soup = element_operations_demo()
編碼處理¶
BeautifulSoup能夠自動處理各種字符編碼問題。
def encoding_demo():
"""
演示編碼處理功能
"""
print("=== 編碼處理功能演示 ===")
# 1. 自動編碼檢測
print("\n1. 自動編碼檢測:")
# 不同編碼的HTML內容
utf8_html = """
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>中文測試頁面</title>
</head>
<body>
<h1>歡迎來到Python學習網站</h1>
<p>這裏有豐富的Python教程和實例。</p>
<div class="content">
<h2>特殊字符測試</h2>
<p>數學符號: α β γ δ ε ∑ ∏ ∫</p>
<p>貨幣符號: ¥ $ € £ ₹</p>
<p>表情符號: 😀 😃 😄 😁 🚀 🎉</p>
<p>其他語言: こんにちは 안녕하세요 Здравствуйте</p>
</div>
</body>
</html>
"""
# 使用BeautifulSoup解析UTF-8內容
soup_utf8 = BeautifulSoup(utf8_html, 'html.parser')
print(f"UTF-8解析結果:")
print(f" 標題: {soup_utf8.find('title').get_text()}")
print(f" 主標題: {soup_utf8.find('h1').get_text()}")
# 獲取原始編碼信息
original_encoding = soup_utf8.original_encoding
print(f" 檢測到的原始編碼: {original_encoding}")
# 2. 處理不同編碼的內容
print("\n2. 處理不同編碼的內容:")
# 模擬GBK編碼的內容
gbk_content = "<html><body><h1>中文標題</h1><p>這是GBK編碼的內容</p></body></html>"
try:
# 將字符串編碼爲GBK字節
gbk_bytes = gbk_content.encode('gbk')
print(f"GBK字節長度: {len(gbk_bytes)}")
# 使用BeautifulSoup解析GBK字節
soup_gbk = BeautifulSoup(gbk_bytes, 'html.parser', from_encoding='gbk')
print(f"GBK解析結果:")
print(f" 標題: {soup_gbk.find('h1').get_text()}")
print(f" 段落: {soup_gbk.find('p').get_text()}")
except UnicodeEncodeError as e:
print(f"GBK編碼錯誤: {e}")
# 3. 編碼轉換
print("\n3. 編碼轉換:")
# 獲取不同編碼格式的輸出
html_str = str(soup_utf8)
# UTF-8編碼
utf8_bytes = html_str.encode('utf-8')
print(f"UTF-8編碼字節數: {len(utf8_bytes)}")
# 嘗試其他編碼
encodings_to_test = ['utf-8', 'gbk', 'iso-8859-1', 'ascii']
for encoding in encodings_to_test:
try:
encoded_bytes = html_str.encode(encoding)
print(f"{encoding.upper()}編碼: 成功,{len(encoded_bytes)}字節")
except UnicodeEncodeError as e:
print(f"{encoding.upper()}編碼: 失敗 - {str(e)[:50]}...")
# 4. 處理編碼錯誤
print("\n4. 處理編碼錯誤:")
# 創建包含特殊字符的內容
special_html = """
<html>
<body>
<h1>特殊字符處理測試</h1>
<p>包含emoji: 🐍 Python編程</p>
<p>數學公式: E = mc²</p>
<p>版權符號: © 2024</p>
<p>商標符號: Python™</p>
</body>
</html>
"""
soup_special = BeautifulSoup(special_html, 'html.parser')
# 不同的錯誤處理策略
error_strategies = ['ignore', 'replace', 'xmlcharrefreplace']
for strategy in error_strategies:
try:
# 嘗試編碼爲ASCII(會出錯)
ascii_result = str(soup_special).encode('ascii', errors=strategy)
decoded_result = ascii_result.decode('ascii')
print(f"ASCII編碼策略'{strategy}': 成功")
print(f" 結果長度: {len(decoded_result)}字符")
# 顯示處理後的標題
soup_result = BeautifulSoup(decoded_result, 'html.parser')
title = soup_result.find('h1')
if title:
print(f" 處理後標題: {title.get_text()}")
except Exception as e:
print(f"ASCII編碼策略'{strategy}': 失敗 - {e}")
# 5. 自定義編碼處理
print("\n5. 自定義編碼處理:")
def safe_encode_html(soup_obj, target_encoding='utf-8', fallback_encoding='ascii'):
"""
安全地將BeautifulSoup對象編碼爲指定格式
"""
html_str = str(soup_obj)
try:
# 嘗試目標編碼
return html_str.encode(target_encoding)
except UnicodeEncodeError:
print(f" {target_encoding}編碼失敗,嘗試{fallback_encoding}")
try:
# 使用替換策略的後備編碼
return html_str.encode(fallback_encoding, errors='xmlcharrefreplace')
except UnicodeEncodeError:
print(f" {fallback_encoding}編碼也失敗,使用忽略策略")
return html_str.encode(fallback_encoding, errors='ignore')
# 測試自定義編碼函數
safe_bytes = safe_encode_html(soup_special, 'ascii')
print(f"安全編碼結果: {len(safe_bytes)}字節")
# 解碼並驗證
safe_html = safe_bytes.decode('ascii')
safe_soup = BeautifulSoup(safe_html, 'html.parser')
safe_title = safe_soup.find('h1')
if safe_title:
print(f"安全編碼後標題: {safe_title.get_text()}")
# 6. 編碼聲明處理
print("\n6. 編碼聲明處理:")
# 檢查和修改編碼聲明
meta_charset = soup_utf8.find('meta', attrs={'charset': True})
if meta_charset:
original_charset = meta_charset.get('charset')
print(f"原始字符集聲明: {original_charset}")
# 修改字符集聲明
meta_charset['charset'] = 'UTF-8'
print(f"修改後字符集聲明: {meta_charset.get('charset')}")
# 添加編碼聲明(如果不存在)
head = soup_utf8.find('head')
if head and not head.find('meta', attrs={'charset': True}):
charset_meta = soup_utf8.new_tag('meta', charset='UTF-8')
head.insert(0, charset_meta)
print("添加了字符集聲明")
# 7. 內容編碼驗證
print("\n7. 內容編碼驗證:")
def validate_encoding(html_content, expected_encoding='utf-8'):
"""
驗證HTML內容的編碼
"""
try:
if isinstance(html_content, str):
# 字符串內容,嘗試編碼
html_content.encode(expected_encoding)
return True, "字符串內容編碼有效"
elif isinstance(html_content, bytes):
# 字節內容,嘗試解碼
html_content.decode(expected_encoding)
return True, "字節內容編碼有效"
else:
return False, "未知內容類型"
except UnicodeError as e:
return False, f"編碼驗證失敗: {e}"
# 驗證不同內容的編碼
test_contents = [
(utf8_html, 'utf-8'),
(str(soup_utf8), 'utf-8'),
(str(soup_special), 'utf-8')
]
for content, encoding in test_contents:
is_valid, message = validate_encoding(content, encoding)
print(f" {encoding}編碼驗證: {'✓' if is_valid else '✗'} {message}")
# 8. 編碼統計信息
print("\n8. 編碼統計信息:")
def analyze_encoding(soup_obj):
"""
分析BeautifulSoup對象的編碼信息
"""
html_str = str(soup_obj)
stats = {
'總字符數': len(html_str),
'ASCII字符數': sum(1 for c in html_str if ord(c) < 128),
'非ASCII字符數': sum(1 for c in html_str if ord(c) >= 128),
'中文字符數': sum(1 for c in html_str if '\u4e00' <= c <= '\u9fff'),
'表情符號數': sum(1 for c in html_str if ord(c) > 0x1F600),
}
# 計算不同編碼的字節數
for encoding in ['utf-8', 'utf-16', 'utf-32']:
try:
byte_count = len(html_str.encode(encoding))
stats[f'{encoding.upper()}字節數'] = byte_count
except UnicodeEncodeError:
stats[f'{encoding.upper()}字節數'] = '編碼失敗'
return stats
# 分析特殊字符內容
encoding_stats = analyze_encoding(soup_special)
print("特殊字符內容編碼分析:")
for key, value in encoding_stats.items():
print(f" {key}: {value}")
# 9. 編碼最佳實踐建議
print("\n9. 編碼最佳實踐建議:")
recommendations = [
"✓ 始終使用UTF-8編碼處理HTML內容",
"✓ 在HTML頭部明確聲明字符集",
"✓ 處理用戶輸入時驗證編碼",
"✓ 使用適當的錯誤處理策略",
"✓ 測試特殊字符和多語言內容",
"✓ 避免混合使用不同編碼"
]
for rec in recommendations:
print(f" {rec}")
return soup_utf8, soup_special
# 運行編碼處理演示
if __name__ == "__main__":
utf8_soup, special_soup = encoding_demo()
終端日誌:
=== 編碼處理功能演示 ===
1. 自動編碼檢測:
UTF-8解析結果:
標題: 中文測試頁面
主標題: 歡迎來到Python學習網站
檢測到的原始編碼: None
2. 處理不同編碼的內容:
GBK字節長度: 59
GBK解析結果:
標題: 中文標題
段落: 這是GBK編碼的內容
3. 編碼轉換:
UTF-8編碼字節數: 674
UTF-8編碼: 成功,674字節
GBK編碼: 成功,638字節
ISO-8859-1編碼: 失敗 - 'latin-1' codec can't encode character '\u4e2d'...
ASCII編碼: 失敗 - 'ascii' codec can't encode character '\u4e2d' in...
4. 處理編碼錯誤:
ASCII編碼策略'ignore': 成功
結果長度: 158字符
處理後標題:
ASCII編碼策略'replace': 成功
結果長度: 398字符
處理後標題: ????????????
ASCII編碼策略'xmlcharrefreplace': 成功
結果長度: 1058字符
處理後標題: 特殊字符處理測試
5. 自定義編碼處理:
utf-8編碼失敗,嘗試ascii
安全編碼結果: 1058字節
安全編碼後標題: 特殊字符處理測試
6. 編碼聲明處理:
原始字符集聲明: UTF-8
修改後字符集聲明: UTF-8
7. 內容編碼驗證:
utf-8編碼驗證: ✓ 字符串內容編碼有效
utf-8編碼驗證: ✓ 字符串內容編碼有效
utf-8編碼驗證: ✓ 字符串內容編碼有效
8. 編碼統計信息:
特殊字符內容編碼分析:
總字符數: 254
ASCII字符數: 158
非ASCII字符數: 96
中文字符數: 12
表情符號數: 1
UTF-8字節數: 302
UTF-16字節數: 510
UTF-32字節數: 1018
9. 編碼最佳實踐建議:
✓ 始終使用UTF-8編碼處理HTML內容
✓ 在HTML頭部明確聲明字符集
✓ 處理用戶輸入時驗證編碼
✓ 使用適當的錯誤處理策略
✓ 測試特殊字符和多語言內容
✓ 避免混合使用不同編碼